Basic Usage of DataMapPlot

This notebook will walk you through the basic usage patterns of DataMapPlot, from the format of the data you’ll need to get started, to tweaking and saving a plot for use in a presentation, paper, or poster. To get started we’ll need to import DataMapPlot. Also, for the purposes of this documentation, I need to keep the image sizes smaller to fit in readthedocs; because of that I will set the global DPI for matplotlib (which DataMapPlot uses for plotting), but you should probably remove those lines if you are running this notebook yourself.

[1]:
# Ensure we don't generate large images for inline docs
# You probably want to remove this if running the notebook yourself
import matplotlib
matplotlib.rcParams["figure.dpi"] = 72

import datamapplot

Next we will need some data to plot. DataMapPlot has some example datasets, and we can simply pull those directly from github. For this example we’ll use data about Wikipedia. The original data that this was derived from is the Wikipedia embeddings generated by Cohere using their embed system to generate paragraph embeddings. You can read more about that on Cohere’s blog-post about the dataset. This particular dataset was derived from the embeddings of the “Simple English” Wikipedia, which contains 486 thousand paragraphs taken from around 200 thousand articles. A “Data Map” was created from this data using UMAP for dimension reduction, and it was then clustered using HDBSCAN to pick out topics. Each of these topics was then given a name (through a combination of LLMs and head tweaking) to create the labelled data map we’ll be using.

So, without further ado, let’s grab the data from github and load the numpy arrays:

[2]:
import numpy as np
import requests
import io

data_map_file = requests.get("https://github.com/TutteInstitute/datamapplot/raw/main/examples/Wikipedia-data_map.npy")
wikipedia_data_map = np.load(io.BytesIO(data_map_file.content))
label_file = requests.get("https://github.com/TutteInstitute/datamapplot/raw/main/examples/Wikipedia-cluster_labels.npy")
wikipedia_labels = np.load(io.BytesIO(label_file.content), allow_pickle=True)

What does the data actually look like? The data map provides a two dimensional location for each paragraph, and is thus a 2d numpy array of float values:

[3]:
wikipedia_data_map
[3]:
array([[ 1.7236053,  3.706319 ],
       [ 1.7875007,  3.7248063],
       [ 1.7760702,  3.6861744],
       ...,
       [ 7.58305  , -4.234798 ],
       [ 7.7917733, -3.90715  ],
       [ 8.133467 , -3.991396 ]], dtype=float32)

In total we have almost 486 thousand rows, one for each paragraph in the original dataset.

[4]:
wikipedia_data_map.shape
[4]:
(485859, 2)

The topic label data is an array of text entries, with paragraphs that didn’t fall into a specific topic (at the given clustering granularity) given the label 'Unlabelled'. Note that we have a label entry for each and every paragraph from the original dataset.

[5]:
wikipedia_labels
[5]:
array(['Unlabelled', 'Unlabelled', 'Unlabelled', ..., 'Television Series',
       'Unlabelled', 'Television Series'], dtype=object)

Thus we have almost 486 thousand labels, one for each row in the data map:

[6]:
wikipedia_labels.shape
[6]:
(485859,)

In the simplest usage we can simply hand this data to DataMapPlot and let it work out what to do. The goal of DataMapPlot is to automate as much of the work as possible, making it as easy as possible to get a good looking plot out. That means that DataMapPlot does all the heavy lifting of picking out topic centers, creating colour palettes, formatting the labels, finding a suitable font size and arranging the labels to avoid both overlapping labels and crossing label indicator lines, and plotting. All of that is handled with the base function create_plot. For the most part, whever you are using DataMapPlot, the create_plot function is all you’ll need, since it provides a great deal of options to tune and tweak results, some of which we’ll look at in the further tutorial notebooks. For now we’ll use it in the most straightforward way, and simply hand it the data map and the topic labels.

[7]:
datamapplot.create_plot(wikipedia_data_map, wikipedia_labels)
[7]:
(<Figure size 864x864 with 1 Axes>, <Axes: >)
_images/basic_usage_13_1.png

As you can see, just handing DataMapPlot the data map and labels is enough to generate an attractive plot with all the clusters labelled. The default colour palette is designed such that nearby topics should have similar colours, but with enough variation to distinguish the topics. The labels are laid out in rings around the data map, which indicator arrows pointing to the cluster in the data map that the label refers to. By default the text labels are also coloured to match thematically with the colour of the cluster they are attached to, aiding in picking out which cluster the label refers to (see, for example, the “Religion and Deities” cluster above). DataMapPlot also attempts to make good choices about representing the data map itself; in particular for data maps with many points it defaults to using datashader for rendering the data map, avoiding many issues associated with overplotting and other associated plotting pitfalls. Again, by default, DataMapPlot also adds a Kernel Density Estimate (KDE) based “glow” around clusters which can help highlight extremely small dense clusters; this can be turned off easily if required. All of this can be tweaked and adjusted, but the goal of DataMapPlot is to try to make good choices for you, and avoid hours of carefully tweaking the plot yourself by hand to get things “just right”.

One thing to note here is that there are a lot of topics to be labelled. In fact there are fifty different text labels that have to placed in the plot. Nonetheless DataMapPlot has managed to find a reasonable layout that avoids overlapping text or crossing indicator lines. In practice DataMapPlot can handling up to 64 labels reasonably well; beyond that it will continue to function, and do it’s best (now with three or more rings of labels), but label layout will be challenging and no promises can be made about the quality of the aesthetics in the result.

In fact even with fifty labels we have a few places where things are getting tightly packed. Let’s prune things down a little by removing labels for some of the smallest clusters. To do that let’s first look at the sizes of the various clusters:

[8]:
import pandas as pd

label_sizes = pd.Series(wikipedia_labels).value_counts()
label_sizes.reset_index()
[8]:
index count
0 Unlabelled 239368
1 Countries of the World 16751
2 Music 15219
3 Zoology 14028
4 Religion and Deities 8438
5 Biographies of Celebrities 8311
6 Movies 8305
7 United States Cities, Towns and Counties 8096
8 Cities and Towns in France 7497
9 Biographies of Actors and Actresses 6909
10 Biographies in Arts, Literature and Journalism 6674
11 Biographies of Soccer Players 6516
12 Video Games 6508
13 Historic Buildings and Monuments 6495
14 Classical Music 6234
15 Software and Computing 6134
16 Sports and Recreation 5689
17 Social Sciences and Humanities 5380
18 Notable Historical Events 5272
19 US Politicians 4940
20 Deaths of Famous People 4939
21 Medicine and Diseases 4676
22 Biographies of Athletes and Sportspeople 4669
23 Biographies of Scientists and Mathematicians 4174
24 Politicians 3888
25 Cities and Towns in Germany 3781
26 Television Series 3713
27 Plants and Botany 3484
28 American Politicians 3320
29 British and French Royal Families 3319
30 Educational Institutions 3309
31 Languages and Grammar 3306
32 Regions of France 3241
33 Aircraft and Spacecraft 2952
34 Food and Cooking 2873
35 Atlantic Tropical Storms and Hurricanes 2848
36 Biology and Molecular Biology 2818
37 Mathematics 2669
38 Global Geography 2621
39 Chemistry 2613
40 Sports Players 2509
41 WWE Wrestlers 2394
42 Census Details 2245
43 Amphibians 2239
44 Astronomy 2201
45 Public Transport Systems 2172
46 Physics 2153
47 National Governments and Parliaments 2034
48 Conflict and War 2012
49 Ice Hockey 1963
50 Biographies of Politicians 1960

It seems not unreasonable to prune off clusters with fewer than 2500 paragraphs in them; that should unclustter things a little. While we are at it let’s change the "Unlabelled" label given to paragraphs that didn’t have a specific topic to something else – in this case "No Topic".

[9]:
simplified_labels = wikipedia_labels.copy()
simplified_labels[simplified_labels == "Unlabelled"] = "No Topic"
simplified_labels[np.in1d(simplified_labels, label_sizes[label_sizes < 2500].index)] = "No Topic"

By default DataMapPlot assumes that unclustered points are given the label "Unlabelled", but that isn’t required. We can specify a name for the unclustered point label using the noise_label keyword argument to create_plot. Let’s see how our simplified label set looks:

[10]:
datamapplot.create_plot(wikipedia_data_map, simplified_labels, noise_label="No Topic")
[10]:
(<Figure size 864x864 with 1 Axes>, <Axes: >)
_images/basic_usage_19_1.png

Great! That did a good job of uncluttering things. This is looking like a plot we could keep. At this point it is time to add any final tweaks (here’s we’ll set a label font size, use medoids instead of centroids for where the indicator line points, and wrap the text labels more tightly), provide a title (and optionally a sub-title) and store the result figure and axis in some variables so we can save the generated plot.

[11]:
fig, ax = datamapplot.create_plot(
    wikipedia_data_map,
    simplified_labels,
    noise_label="No Topic",
    title="Map of Wikipedia",
    sub_title="Paragraphs from articles on Simple Wikipedia embedded with Cohere embed",
    label_font_size=11,
    font_weight=100,
    label_wrap_width=8,
    use_medoids=True,
)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
_images/basic_usage_21_1.png

That looks like what we want; now we need to save the plot. Fortunately the plot itself is simply a matplotlib Figure instance. Thus we can leverage matplotlib’s powerful tools for saving plots. To do this we call the savefig method and provide a filename to save to. Matplotlib is smart and will interpret the file extension in the filename to determine the format of the saved file. Here we’ll simply save to a PNG, but you can equally well generate a JPG, PDF, or SVG file if you wish. The supported filetypes will vary depending on the backend being used by matplotlib, but you can see the options available by running fig.canvas.get_supported_filetypes(). One thing to note is that, due to how DataMapPlot handles the titles it is important to add the argument bbox_inches="tight" to the savefig method to ensure that the title is not clipped off in the saved file.

[12]:
fig.savefig("datamapplot-basic_usage-example.png", bbox_inches="tight")
[13]:
fig.canvas.get_supported_filetypes()
[13]:
{'eps': 'Encapsulated Postscript',
 'jpg': 'Joint Photographic Experts Group',
 'jpeg': 'Joint Photographic Experts Group',
 'pdf': 'Portable Document Format',
 'pgf': 'PGF code for LaTeX',
 'png': 'Portable Network Graphics',
 'ps': 'Postscript',
 'raw': 'Raw RGBA bitmap',
 'rgba': 'Raw RGBA bitmap',
 'svg': 'Scalable Vector Graphics',
 'svgz': 'Scalable Vector Graphics',
 'tif': 'Tagged Image File Format',
 'tiff': 'Tagged Image File Format',
 'webp': 'WebP Image Format'}

And that wraps up our basic introduction to using DataMapPlot. The real key is to get your data map information in the right format (a text label per point ideally based on clusters in the data map, and ideally with not to many distinct/unique text labels). After that DataMapPlot can do most of the heavy lifting for you.