{
"cells": [
{
"cell_type": "markdown",
"id": "6eb7056b-b116-41f8-a16d-1c0a991f414d",
"metadata": {},
"source": [
"# Interactive DataMapPlot Sizing Options\n",
"\n",
"This notebook will walk you through some of the sizing specific customizations that are available for interactive plots in DataMapPlot. There are a lot of fine-grained options, so this notebook will instead highlight some of the major options and hint at the further customization that can be achieved with respect to them. TO get started we'll need to import DataMapPlot."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e10bab62",
"metadata": {},
"outputs": [],
"source": [
"import datamapplot"
]
},
{
"cell_type": "markdown",
"id": "4cb6e3e2-2781-4f9f-88cb-c99654a26e1d",
"metadata": {},
"source": [
"To demonstrate what DataMapPlot interactive plotting we'll need some data -- ideally data with multiple layers of clustering and labelling. The examples directory of the DataMapPlot repository contains some pre-prepared datasets for experimenting with. We'll grab one of those. Much like static plotting we'll need a data map, but unlike the static plotting instead of a single set of clusters with labels, we'll want to have multiple layers of labelling at different levels of granularity (that can be revealed as one zooms in). In this case we'll use a data map derived from titles and abstracts of perpaers from the Machine Learning section of the ArXiv pre-print server. The cluster layers were built using [fast_hdbscan](https://github.com/TutteInstitute/fast_hdbscan), and the cluster label generation was done using an LLM based on the content of the clusters."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "775cffd6",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import requests\n",
"import io\n",
"\n",
"base_url = \"https://github.com/TutteInstitute/datamapplot\"\n",
"data_map_file = requests.get(\n",
" f\"{base_url}/raw/main/examples/arxiv_ml_data_map.npz\"\n",
")\n",
"arxivml_data_map = np.load(io.BytesIO(data_map_file.content))[\"arr_0\"]\n",
"arxivml_label_layers = []\n",
"for layer_num in range(5):\n",
" label_file = requests.get(\n",
" f\"{base_url}/raw/interactive/examples/arxiv_ml_layer{layer_num}_cluster_labels.npz\"\n",
" )\n",
" arxivml_label_layers.append(np.load(io.BytesIO(label_file.content), allow_pickle=True))[\"arr_0\"]"
]
},
{
"cell_type": "markdown",
"id": "86b6fa45-7f04-4bea-abf6-768f831424a7",
"metadata": {},
"source": [
"Let’s start by making a basic interactive DataMapPlot output based on this data so we have an idea of what the starting point looks like, and can better understand what the various customizations we will be applying can do for us."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f38a1caa",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot = datamapplot.create_interactive_plot(\n",
" arxivml_data_map,\n",
" arxivml_label_layers[0],\n",
" arxivml_label_layers[2],\n",
" arxivml_label_layers[4],\n",
")\n",
"plot"
]
},
{
"cell_type": "markdown",
"id": "dd0b868a-8abb-40ef-ad33-ddd32f341c8a",
"metadata": {},
"source": [
"By default the result is displayed in a iframe within the notebook. The most obvious thing one might like to change is the size of that output. That we can do via the ``width`` and ``height`` keyword arguments. These can be used to specify the width and height of the resulting iframe. Notably the value you pass to the keywords can either be integers (which will be interpreted as the number of pixels), or a string that can be appropriately interpreted as a size specification for an iframe. For example, the default with is ``\"100%\"`` ensuring that the output spans the entire width of the notebook. We can choose somethign else, however -- let's try making both the width and the height an explicit number of pixels:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4c6c19c6-7bb9-4ba7-8f94-c32f450f1a32",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot = datamapplot.create_interactive_plot(\n",
" arxivml_data_map,\n",
" arxivml_label_layers[0],\n",
" arxivml_label_layers[2],\n",
" arxivml_label_layers[4],\n",
" width=400,\n",
" height=400,\n",
")\n",
"plot"
]
},
{
"cell_type": "markdown",
"id": "3367ea58-b015-4f93-a088-3b1657c35e3d",
"metadata": {},
"source": [
"We get a much smaller plot; now left justified since it is not stretching to fill the width.Note that the overall size of the scatterplot has not changed, instead we've changed the size of our viewport of it.\n",
"\n",
"Moving beyond the size of the plot itself, let's look at how we can control the size of plot elements themselves. The major plot elements for consideration are the individual points in the scatterplot, and the text labels of the clusters. We'll look at the sizing of text first. In general the interactive plot in DataMapPlot attempts to automatically size text based on the size of the clusters (specifically the scaling is based on the fourth root of the cluster size) providing smaller labels for smaller clusters. While not giving explicit control over text sizes per cluster, DataMapPlot provides means to bound the sizes of text generated. Specifically the ``min_fontsize`` and ``max_fontsize`` provide the size, in points, that provide the lower and upper bounds of font sizes generated by DataMapPlot. For example, we can set both to the same size and have all text labels be the same size."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "78cf0189",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot = datamapplot.create_interactive_plot(\n",
" arxivml_data_map,\n",
" arxivml_label_layers[0],\n",
" arxivml_label_layers[2],\n",
" arxivml_label_layers[4],\n",
" min_fontsize=14,\n",
" max_fontsize=14,\n",
")\n",
"plot"
]
},
{
"cell_type": "markdown",
"id": "23e701cf-e443-42fa-8296-761dbaffcc3f",
"metadata": {},
"source": [
"A more common task is to simply increase the minimum size of text shown, or allow for extra large labels for the very large clusters. There are a number of other options for fine-tuning including ``text_outline_width`` and ``linespacing`` which we will not go into here. One furthe option that may be useful is the ``text_collision_size_scale``. This controls how collision detection between labels is handled, with higher priority labels (those from larger clusters) being shown and those they overlpa or collide with being hidden. The default value is 3, which provides a good amount of spacing between labels -- but you can set it to other values:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "89f04852",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot = datamapplot.create_interactive_plot(\n",
" arxivml_data_map,\n",
" arxivml_label_layers[0],\n",
" arxivml_label_layers[2],\n",
" arxivml_label_layers[4],\n",
" min_fontsize=14,\n",
" max_fontsize=14,\n",
" text_collision_size_scale=2,\n",
")\n",
"plot"
]
},
{
"cell_type": "markdown",
"id": "ef33578c-42b2-4b38-9e79-75d8373fb96f",
"metadata": {},
"source": [
"This results in more overlaps among the text elements; in contrast increasing this value above 3 will result in sparser labels appearing.\n",
"\n",
"Other than text the major plot element to control sizing for is the points in the scatterplot. It is possible to provide different sizes for individual points via the ``marker_size_array`` much like with static plots. To demonstrate this we'll need a little more data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f28bbc94",
"metadata": {},
"outputs": [],
"source": [
"hover_data_file = requests.get(\n",
" f\"{base_url}/raw/interactive/examples/arxiv_ml_hover_data.npz\"\n",
")\n",
"arxiv_hover_data = np.load(io.BytesIO(hover_data_file.content), allow_pickle=True)[\"arr_0\"]\n",
"arxiv_marker_size_array = np.asarray([len(x) for x in arxiv_hover_data], dtype=np.float32)"
]
},
{
"cell_type": "markdown",
"id": "84603cd7-20ec-4e55-b6d9-1eebf78eb4be",
"metadata": {},
"source": [
"Now we can individually size the points based on, in this case, the length of the title of the paper. We do this by passing the array through via the ``marker_size_array`` keyword. Note that we have no scaled these values -- DataMapPlot will do that for us. The only situation where you would want to scale values prior to passing them in is iff they are not roughly linearly distributed; for example if we were to use citation counts, most of which are small, but a small number of which are very very large, we would likely want to log-scale that data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "fdadbbcc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"