Basic Usage of DataMapPlot
This notebook will walk you through the basic usage patterns of DataMapPlot, from the format of the data you’ll need to get started, to tweaking and saving a plot for use in a presentation, paper, or poster. To get started we’ll need to import DataMapPlot. Also, for the purposes of this documentation, I need to keep the image sizes smaller to fit in readthedocs; because of that I will set the global DPI for matplotlib (which DataMapPlot uses for plotting), but you should probably remove those lines if you are running this notebook yourself.
[1]:
# Ensure we don't generate large images for inline docs
# You probably want to remove this if running the notebook yourself
import matplotlib
matplotlib.rcParams["figure.dpi"] = 72
import datamapplot
Next we will need some data to plot. DataMapPlot has some example datasets, and we can simply pull those directly from github. For this example we’ll use data about Wikipedia. The original data that this was derived from is the Wikipedia embeddings generated by Cohere using their embed
system to generate paragraph embeddings. You can read more about that on Cohere’s blog-post about the dataset. This particular dataset was derived
from the embeddings of the “Simple English” Wikipedia, which contains 486 thousand paragraphs taken from around 200 thousand articles. A “Data Map” was created from this data using UMAP for dimension reduction, and it was then clustered using HDBSCAN to pick out topics. Each of these topics was then given a name (through a combination of LLMs and head
tweaking) to create the labelled data map we’ll be using.
So, without further ado, let’s grab the data from github and load the numpy arrays:
[2]:
import numpy as np
import requests
import io
data_map_file = requests.get("https://github.com/TutteInstitute/datamapplot/raw/main/examples/Wikipedia-data_map.npy")
wikipedia_data_map = np.load(io.BytesIO(data_map_file.content))
label_file = requests.get("https://github.com/TutteInstitute/datamapplot/raw/main/examples/Wikipedia-cluster_labels.npy")
wikipedia_labels = np.load(io.BytesIO(label_file.content), allow_pickle=True)
What does the data actually look like? The data map provides a two dimensional location for each paragraph, and is thus a 2d numpy array of float values:
[3]:
wikipedia_data_map
[3]:
array([[ 1.7236053, 3.706319 ],
[ 1.7875007, 3.7248063],
[ 1.7760702, 3.6861744],
...,
[ 7.58305 , -4.234798 ],
[ 7.7917733, -3.90715 ],
[ 8.133467 , -3.991396 ]], dtype=float32)
In total we have almost 486 thousand rows, one for each paragraph in the original dataset.
[4]:
wikipedia_data_map.shape
[4]:
(485859, 2)
The topic label data is an array of text entries, with paragraphs that didn’t fall into a specific topic (at the given clustering granularity) given the label 'Unlabelled'
. Note that we have a label entry for each and every paragraph from the original dataset.
[5]:
wikipedia_labels
[5]:
array(['Unlabelled', 'Unlabelled', 'Unlabelled', ..., 'Television Series',
'Unlabelled', 'Television Series'], dtype=object)
Thus we have almost 486 thousand labels, one for each row in the data map:
[6]:
wikipedia_labels.shape
[6]:
(485859,)
In the simplest usage we can simply hand this data to DataMapPlot and let it work out what to do. The goal of DataMapPlot is to automate as much of the work as possible, making it as easy as possible to get a good looking plot out. That means that DataMapPlot does all the heavy lifting of picking out topic centers, creating colour palettes, formatting the labels, finding a suitable font size and arranging the labels to avoid both overlapping labels and crossing label indicator lines, and
plotting. All of that is handled with the base function create_plot
. For the most part, whever you are using DataMapPlot, the create_plot
function is all you’ll need, since it provides a great deal of options to tune and tweak results, some of which we’ll look at in the further tutorial notebooks. For now we’ll use it in the most straightforward way, and simply hand it the data map and the topic labels.
[7]:
datamapplot.create_plot(wikipedia_data_map, wikipedia_labels)
[7]:
(<Figure size 864x864 with 1 Axes>, <Axes: >)

As you can see, just handing DataMapPlot the data map and labels is enough to generate an attractive plot with all the clusters labelled. The default colour palette is designed such that nearby topics should have similar colours, but with enough variation to distinguish the topics. The labels are laid out in rings around the data map, which indicator arrows pointing to the cluster in the data map that the label refers to. By default the text labels are also coloured to match thematically with the colour of the cluster they are attached to, aiding in picking out which cluster the label refers to (see, for example, the “Religion and Deities” cluster above). DataMapPlot also attempts to make good choices about representing the data map itself; in particular for data maps with many points it defaults to using datashader for rendering the data map, avoiding many issues associated with overplotting and other associated plotting pitfalls. Again, by default, DataMapPlot also adds a Kernel Density Estimate (KDE) based “glow” around clusters which can help highlight extremely small dense clusters; this can be turned off easily if required. All of this can be tweaked and adjusted, but the goal of DataMapPlot is to try to make good choices for you, and avoid hours of carefully tweaking the plot yourself by hand to get things “just right”.
One thing to note here is that there are a lot of topics to be labelled. In fact there are fifty different text labels that have to placed in the plot. Nonetheless DataMapPlot has managed to find a reasonable layout that avoids overlapping text or crossing indicator lines. In practice DataMapPlot can handling up to 64 labels reasonably well; beyond that it will continue to function, and do it’s best (now with three or more rings of labels), but label layout will be challenging and no promises can be made about the quality of the aesthetics in the result.
In fact even with fifty labels we have a few places where things are getting tightly packed. Let’s prune things down a little by removing labels for some of the smallest clusters. To do that let’s first look at the sizes of the various clusters:
[8]:
import pandas as pd
label_sizes = pd.Series(wikipedia_labels).value_counts()
label_sizes.reset_index()
[8]:
index | count | |
---|---|---|
0 | Unlabelled | 239368 |
1 | Countries of the World | 16751 |
2 | Music | 15219 |
3 | Zoology | 14028 |
4 | Religion and Deities | 8438 |
5 | Biographies of Celebrities | 8311 |
6 | Movies | 8305 |
7 | United States Cities, Towns and Counties | 8096 |
8 | Cities and Towns in France | 7497 |
9 | Biographies of Actors and Actresses | 6909 |
10 | Biographies in Arts, Literature and Journalism | 6674 |
11 | Biographies of Soccer Players | 6516 |
12 | Video Games | 6508 |
13 | Historic Buildings and Monuments | 6495 |
14 | Classical Music | 6234 |
15 | Software and Computing | 6134 |
16 | Sports and Recreation | 5689 |
17 | Social Sciences and Humanities | 5380 |
18 | Notable Historical Events | 5272 |
19 | US Politicians | 4940 |
20 | Deaths of Famous People | 4939 |
21 | Medicine and Diseases | 4676 |
22 | Biographies of Athletes and Sportspeople | 4669 |
23 | Biographies of Scientists and Mathematicians | 4174 |
24 | Politicians | 3888 |
25 | Cities and Towns in Germany | 3781 |
26 | Television Series | 3713 |
27 | Plants and Botany | 3484 |
28 | American Politicians | 3320 |
29 | British and French Royal Families | 3319 |
30 | Educational Institutions | 3309 |
31 | Languages and Grammar | 3306 |
32 | Regions of France | 3241 |
33 | Aircraft and Spacecraft | 2952 |
34 | Food and Cooking | 2873 |
35 | Atlantic Tropical Storms and Hurricanes | 2848 |
36 | Biology and Molecular Biology | 2818 |
37 | Mathematics | 2669 |
38 | Global Geography | 2621 |
39 | Chemistry | 2613 |
40 | Sports Players | 2509 |
41 | WWE Wrestlers | 2394 |
42 | Census Details | 2245 |
43 | Amphibians | 2239 |
44 | Astronomy | 2201 |
45 | Public Transport Systems | 2172 |
46 | Physics | 2153 |
47 | National Governments and Parliaments | 2034 |
48 | Conflict and War | 2012 |
49 | Ice Hockey | 1963 |
50 | Biographies of Politicians | 1960 |
It seems not unreasonable to prune off clusters with fewer than 2500 paragraphs in them; that should unclustter things a little. While we are at it let’s change the "Unlabelled"
label given to paragraphs that didn’t have a specific topic to something else – in this case "No Topic"
.
[9]:
simplified_labels = wikipedia_labels.copy()
simplified_labels[simplified_labels == "Unlabelled"] = "No Topic"
simplified_labels[np.in1d(simplified_labels, label_sizes[label_sizes < 2500].index)] = "No Topic"
By default DataMapPlot assumes that unclustered points are given the label "Unlabelled"
, but that isn’t required. We can specify a name for the unclustered point label using the noise_label
keyword argument to create_plot
. Let’s see how our simplified label set looks:
[10]:
datamapplot.create_plot(wikipedia_data_map, simplified_labels, noise_label="No Topic")
[10]:
(<Figure size 864x864 with 1 Axes>, <Axes: >)

Great! That did a good job of uncluttering things. This is looking like a plot we could keep. At this point it is time to add any final tweaks (here’s we’ll set a label font size, use medoids instead of centroids for where the indicator line points, and wrap the text labels more tightly), provide a title (and optionally a sub-title) and store the result figure and axis in some variables so we can save the generated plot.
[11]:
fig, ax = datamapplot.create_plot(
wikipedia_data_map,
simplified_labels,
noise_label="No Topic",
title="Map of Wikipedia",
sub_title="Paragraphs from articles on Simple Wikipedia embedded with Cohere embed",
label_font_size=11,
font_weight=100,
label_wrap_width=8,
use_medoids=True,
)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

That looks like what we want; now we need to save the plot. Fortunately the plot itself is simply a matplotlib Figure instance. Thus we can leverage matplotlib’s powerful tools for saving plots. To do this we call the savefig
method and provide a filename to save to. Matplotlib is smart and will interpret the file extension in the filename to determine the format of the saved file. Here we’ll simply save to a
PNG, but you can equally well generate a JPG, PDF, or SVG file if you wish. The supported filetypes will vary depending on the backend being used by matplotlib, but you can see the options available by running fig.canvas.get_supported_filetypes()
. One thing to note is that, due to how DataMapPlot handles the titles it is important to add the argument bbox_inches="tight"
to the savefig
method to ensure that the title is not clipped off in the saved file.
[12]:
fig.savefig("datamapplot-basic_usage-example.png", bbox_inches="tight")
[13]:
fig.canvas.get_supported_filetypes()
[13]:
{'eps': 'Encapsulated Postscript',
'jpg': 'Joint Photographic Experts Group',
'jpeg': 'Joint Photographic Experts Group',
'pdf': 'Portable Document Format',
'pgf': 'PGF code for LaTeX',
'png': 'Portable Network Graphics',
'ps': 'Postscript',
'raw': 'Raw RGBA bitmap',
'rgba': 'Raw RGBA bitmap',
'svg': 'Scalable Vector Graphics',
'svgz': 'Scalable Vector Graphics',
'tif': 'Tagged Image File Format',
'tiff': 'Tagged Image File Format',
'webp': 'WebP Image Format'}
And that wraps up our basic introduction to using DataMapPlot. The real key is to get your data map information in the right format (a text label per point ideally based on clusters in the data map, and ideally with not to many distinct/unique text labels). After that DataMapPlot can do most of the heavy lifting for you.