Interactive DataMapPlot Sizing Options

This notebook will walk you through some of the sizing specific customizations that are available for interactive plots in DataMapPlot. There are a lot of fine-grained options, so this notebook will instead highlight some of the major options and hint at the further customization that can be achieved with respect to them. TO get started we’ll need to import DataMapPlot.

[1]:
import datamapplot

To demonstrate what DataMapPlot interactive plotting we’ll need some data – ideally data with multiple layers of clustering and labelling. The examples directory of the DataMapPlot repository contains some pre-prepared datasets for experimenting with. We’ll grab one of those. Much like static plotting we’ll need a data map, but unlike the static plotting instead of a single set of clusters with labels, we’ll want to have multiple layers of labelling at different levels of granularity (that can be revealed as one zooms in). In this case we’ll use a data map derived from titles and abstracts of perpaers from the Machine Learning section of the ArXiv pre-print server. The cluster layers were built using fast_hdbscan, and the cluster label generation was done using an LLM based on the content of the clusters.

[2]:
import numpy as np
import requests
import io

base_url = "https://github.com/TutteInstitute/datamapplot"
data_map_file = requests.get(
    f"{base_url}/raw/main/examples/arxiv_ml_data_map.npy"
)
arxivml_data_map = np.load(io.BytesIO(data_map_file.content))
arxivml_label_layers = []
for layer_num in range(5):
    label_file = requests.get(
        f"{base_url}/raw/interactive/examples/arxiv_ml_layer{layer_num}_cluster_labels.npy"
    )
    arxivml_label_layers.append(np.load(io.BytesIO(label_file.content), allow_pickle=True))

Let’s start by making a basic interactive DataMapPlot output based on this data so we have an idea of what the starting point looks like, and can better understand what the various customizations we will be applying can do for us.

[3]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
)
plot
[3]:

By default the result is displayed in a iframe within the notebook. The most obvious thing one might like to change is the size of that output. That we can do via the width and height keyword arguments. These can be used to specify the width and height of the resulting iframe. Notably the value you pass to the keywords can either be integers (which will be interpreted as the number of pixels), or a string that can be appropriately interpreted as a size specification for an iframe. For example, the default with is "100%" ensuring that the output spans the entire width of the notebook. We can choose somethign else, however – let’s try making both the width and the height an explicit number of pixels:

[4]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    width=400,
    height=400,
)
plot
[4]:

We get a much smaller plot; now left justified since it is not stretching to fill the width.Note that the overall size of the scatterplot has not changed, instead we’ve changed the size of our viewport of it.

Moving beyond the size of the plot itself, let’s look at how we can control the size of plot elements themselves. The major plot elements for consideration are the individual points in the scatterplot, and the text labels of the clusters. We’ll look at the sizing of text first. In general the interactive plot in DataMapPlot attempts to automatically size text based on the size of the clusters (specifically the scaling is based on the fourth root of the cluster size) providing smaller labels for smaller clusters. While not giving explicit control over text sizes per cluster, DataMapPlot provides means to bound the sizes of text generated. Specifically the min_fontsize and max_fontsize provide the size, in points, that provide the lower and upper bounds of font sizes generated by DataMapPlot. For example, we can set both to the same size and have all text labels be the same size.

[5]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
)
plot
[5]:

A more common task is to simply increase the minimum size of text shown, or allow for extra large labels for the very large clusters. There are a number of other options for fine-tuning including text_outline_width and linespacing which we will not go into here. One furthe option that may be useful is the text_collision_size_scale. This controls how collision detection between labels is handled, with higher priority labels (those from larger clusters) being shown and those they overlpa or collide with being hidden. The default value is 3, which provides a good amount of spacing between labels – but you can set it to other values:

[6]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
    text_collision_size_scale=2,
)
plot
[6]:

This results in more overlaps among the text elements; in contrast increasing this value above 3 will result in sparser labels appearing.

Other than text the major plot element to control sizing for is the points in the scatterplot. It is possible to provide different sizes for individual points via the marker_size_array much like with static plots. To demonstrate this we’ll need a little more data.

[7]:
hover_data_file = requests.get(
    f"{base_url}/raw/interactive/examples/arxiv_ml_hover_data.npy"
)
arxiv_hover_data = np.load(io.BytesIO(hover_data_file.content), allow_pickle=True)
arxiv_marker_size_array = np.asarray([len(x) for x in arxiv_hover_data], dtype=np.float32)

Now we can individually size the points based on, in this case, the length of the title of the paper. We do this by passing the array through via the marker_size_array keyword. Note that we have no scaled these values – DataMapPlot will do that for us. The only situation where you would want to scale values prior to passing them in is iff they are not roughly linearly distributed; for example if we were to use citation counts, most of which are small, but a small number of which are very very large, we would likely want to log-scale that data.

[8]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
    marker_size_array=arxiv_marker_size_array,
)
plot
[8]:

If you zoom in you will see that the points in the scatterplot are now of varying size. If you zoom in close enough on a densely packed region you will note that, eventually at a high enough zoom, all the points are the same size and the densely packed region spreads out, allowing you to see the individual points. This is controlled by the point_radius_min_pixels and point_radius_max_pixels keyword argument values. The point_radius_min_pixels provides the minimum radius, in on screen pixels, that a point is allowed to be; if the point would render smaller than this value (if you zoom out far enough) it will instead end up fixed at this value. Similarly the point_radius_max_pixels provides a maximum radius, in on screen pixels, of any point. So as you zoom in point will increase in size proportional to your zoom until they reach point_radius_max_pixels pixels and then they are capped at this size. This allows you to zoom in and resolve dense clusters of points. We can set these to somethign other than the default values of 0.01 and 24:

[9]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
    marker_size_array=[len(x) for x in arxiv_hover_data],
    point_radius_min_pixels=1,
    point_radius_max_pixels=8,
)
plot
[9]:

Now, as we zoom out, point will only get so small, and in contrast, when we zoom in the points will max out their size and start resolving from dense clusters much earlier.

Similar options exist to control the linewidth around points, and the min and max pixel widths thereof. For examplwe we can set the point_line_width to 0 and now the points are not bounded by a thin white border, but instead are consistently a solid colour.

[10]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
    marker_size_array=[len(x) for x in arxiv_hover_data],
    point_radius_min_pixels=1,
    point_line_width=0,
)
plot
[10]:

The last sizing option we weill look at is the initial zoom of the plot. It is not uncommon for data maps to include a few sparse points that are quite far from the main body of points. By default DataMapPlot attenpts to keep all the data in frame with its initial zoom level. This is not always desireable. The keyword initial_zoom_fraction provides the means to control this. The zoom level is set such that the width or height displayed is this fraction of what would have been selected by default. Thus if we pick a value such as 0.5, then we will be zoomed in such that we are viewing half the width or height of what would be chosen to cover all the data.

[11]:
plot = datamapplot.create_interactive_plot(
    arxivml_data_map,
    arxivml_label_layers[0],
    arxivml_label_layers[2],
    arxivml_label_layers[4],
    min_fontsize=14,
    max_fontsize=14,
    initial_zoom_fraction=0.5,
)
plot
[11]:

Obviously there are a lot more options – we have mentioned a few without actually going into details or providing examples – see the API documentation, particularly of the render_html function, for the fullest list.