Interactive DataMapPlot Selection and Filtering

This notebook will get you started on using selection and filtering tools in DataMapPlot interactive plots. There is a lot that can be done, so this notebook will instead highlight some of the major options and walk through some simple examples to get you started on building your own selection options. To get started we’ll need to import DataMapPlot.

[1]:

import datamapplot

/work/home/lmmcinn/.conda/envs/datamapplot/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

To demonstrate what DataMapPlot interactive plots can do we’ll need some data. The examples directory of the DataMapPlot repository contains some pre-prepared datasets for experimenting with. We’ll grab one of those. Much like the static plotting we need a data map – as set of 2d coordinates, one per data sample we are mapping – and at least one set of labels idenityfing the “topic” of a data sample, usually based on clusters in the data map. In this case we’ll use a data map derived from the CORD-19 dataset – a dataset of papers and scientific articles related to COVID-19 that was curated by Allen AI. To show off some of the filtering options we’ll be sure to load some other associated metadata for the papers such as the primary field of research, the publication date, and the citation counts. All of this data is also available in the examples directory.

[2]:

import numpy as np
import pandas as pd
import requests
import io

base_url = "https://github.com/TutteInstitute/datamapplot"

data_map_file = requests.get(
    f"{base_url}/raw/main/examples/cord19_umap_vectors.npz"
)
cord19_data_map = np.load(io.BytesIO(data_map_file.content))["arr_0"]
cord19_label_layers = []
for i in range(6):
    label_file = requests.get(
        f"{base_url}/raw/main/examples/cord19_layer{i}_cluster_labels.npz"
    )
    cord19_label_layers.append(
        np.load(io.BytesIO(label_file.content), allow_pickle=True)["arr_0"]
    )
hover_text_file = requests.get(
    f"{base_url}/raw/main/examples/cord19_large_hover_text.npz"
)
cord19_hover_text = np.load(io.BytesIO(hover_text_file.content), allow_pickle=True)["arr_0"]
cord19_extra_data = pd.read_feather(f"{base_url}/raw/main/examples/cord19_extra_data.arrow")
date_file = requests.get(
    f"{base_url}/raw/main/examples/CORD19-subset-publish-dates.npz"
)
cord19_publication_date = np.load(io.BytesIO(date_file.content))["arr_0"]
cord19_extra_data["publication_date"] = pd.Series(cord19_publication_date)
cord19_extra_data["publication_date"] = cord19_extra_data.publication_date.fillna(cord19_extra_data.publication_date.max()).dt.as_unit('s')

To begin let’s make a basic interactive plot. By this point we’ll assume you are more familiar with some of the interactive plotting options, so we’ll make use of them to make a good looking plot that includes hover information and is searchable. This will provide the baseline that we’ll add first filtering and later selection options to.

Tip: Due to the larger datasets and a desire to keep the documentation pages smaller, this tutorial makes use of the inline_data=False directive to ensure all the data for all the plots are not embedded in a single page. This can cause issues, and does not work (for example) on Colab. If you are having issues running this notebook, try removing the inline_data=False keyword arguments and see if that helps! Also note that this means some of the exampes on the documentation page can take quite a while to load. Please be patient.

[3]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-1",
    initial_zoom_fraction=0.9,
)
plot

[3]:

Suppose we want to look at the data for varying citation counts. We could colour the data according to citation count, but that will likely be messy and hard to see. Likewise sizing points in relation to citation count can work, but does a poorer job at letting you quickly see where all the high citation count papers reside. A better approach would be to provide a filtering tool that shows the distribution of citation counts and lets you select from there. To do this we can add the citation count information as histogram_data. Since citation counts have a power-law-like distribution we’ll make things a little easier by looking at the log of the citation counts.

[4]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-2",
    initial_zoom_fraction=0.9,
    histogram_data=np.log10(1 + cord19_extra_data.citation_count),
)
plot

[4]:

The resulting interactive plot has a histogram in the bottom left corner, showing the distirbution of (log of) citation counts. If you hover your mouse over individual bins in the histogram the data map will update to display only the points in that bin, letting you quickly view the relative positions in the datamap. To create a persistent selection you can hold shift and drag in the histogram to create a selection box covering multiple bins. The selection box can be resized (by dragging the edges) or moved (by dragging from the center) while holding shift, with the selection updating in realtime. You can also click on individual bins to persistently select just that bin – clicking the same bin again will deselect it.

It is also worth noting that we have the seach feature in the upper right – we can do a text search on the titles of the papers. This search actually interacts with the histogram. If no histogram selection is made the histogram will update to show the distribution of items matching the search. Now hovering selects only items in the bix that also match the search. Similarly selections within the histogram result in only results that match both the text search and the histogram selection filtering. It is worth playing around a little with this, to see how the interaction between the two ways of selecting/filtering points in the data map works.

Of course we may wish to look at distributions of something that isn’t a standard numeric value. What if we want to look at dates or times? This would be a very common use case – being able to filter a data map by the date or time a certain data sample was created or measured is very valuable. Let’s see what happens when we pass in the publication dates of the papers as histogram_data. While we are at it, let’s use some of the other historgam options, such as setting the number of histogram bins to use.

[5]:

datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-3",
    initial_zoom_fraction=0.9,
    histogram_data=cord19_extra_data.publication_date,
    histogram_n_bins=12
)

[5]:

We see that the dates were handled just fine – the histogram is relatively data aware, and can work with dates and times natively. Again it is worth verifying that filtering works as before, and that interactions with the text search are indeed easy.

We do have an obvious problem though. Since the data collected any papers relevant to COVID-19 it includes some very old papers dating back to the 70s. The reality is that almost all of the actual data lands in a range much closer to 2019-2021. It would be good if we could actually pick a range for the hostogram. Fortunately we can via the histogram_range option which takes a tuple of values – the start and end for the histogram. Items that fall outside the histogram range specified will be placed in either the first, or last, bin in the historgam, depending on whether they are below or above the range respectively. In our case, since the data is actually dates we will need to generate pandas datetime objects for the start and end of our range (if the data was numeric we could simply specify numeric values directly).

Another issue was choosing a number of bins – while that can work for arbitrary numeric ranges, dates and times are special, so it would be nice to be able to have the data bins meet with those categories. It would be good, for example, to bin dates by the month or year, or times by the hour, minute or second (depeding on the time range of the data). For this we have the histogram_group_datetime_by option which takes a time unit ("year", "quarter", "month", "day", "hour", "minute" or "second") to group by. Give the kind of time range we are looking at grouping by quarter makes sense.

[6]:

datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-4",
    initial_zoom_fraction=0.9,
    histogram_data=cord19_extra_data.publication_date,
    histogram_group_datetime_by="quarter",
    histogram_range=(pd.to_datetime("2019-09-01"), pd.to_datetime("2022-01-01"))
)

[6]:

We get a better looking histogram thsi tiem around, showing more of the range of data. We do have a little problem with pre-2019 data being piled into the bottom-most bin, but we can overlook that for now. We see that there was a major peak in publication in the last quarter of 2020, and selection shows it ws pretty evenly distribuetd over all the different fields and topics. Hopefully this suffices as a demo of the date-time handlign capabilities of histograms.

Let’s move on to another data type – purely categorical data. What happens if we provide the primary field of research of the papers as histogram_data? There are actually a surprisingly diverse range of fields included in the dataset. Let’s keep the number of bins down to 15. When we specify the number of bins for categorical data like this it will take the top 15 categories, and lump the rest of the data into a special “Other” category. Since this is the last histogram example we’re goign to do, let’s also make use of it to show some of the further options with the histogram_settings which allow you control over the size and presentation of the histogram. We’ll even put the y-axis of the histogram on a log-scale (since many of the fields have relatively few papers, and it would be hard to see them on a linear scale).

[7]:

datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-5",
    initial_zoom_fraction=0.9,
    histogram_data=cord19_extra_data.primary_field,
    histogram_n_bins=15,
    histogram_settings={
        "histogram_log_scale":True,
        "histogram_title":"Primary Field",
        "histogram_bin_fill_color":"#a64531",
        "histogram_bin_unselected_fill_color":"#d6a591",
        "histogram_bin_selected_fill_color":"#f68571",
        "histogram_width":400,
        "histogram_height":125,
    }
)

[7]:

That was a whirlwind tour of the filtering options available via histograms in DataMapPlot. Hopefully it is enough to get you started. Please consult the doctsrings or API docs for more information about the parameters and options available to get the most out of your filtering histograms.

Now let’s turn to other ways to select data. DataMapPlot supports lasso-style selection of points. This is turned off unless a way to handle such a selection is provided. For that we’ll need to load an extra module of selection handlers.

[8]:

import datamapplot.selection_handlers

The selection_handlers module provides some pre-built tools for reacting to lasso selections in an interactive data map. Currently the list is small, but hopefully it will grow over time. If a selection handler is instatiated and passed to create_interactive_plot as a selection_handler it will enable lasso selection for the plot, with lasso selections resulting in callbacks in the handler being called. A very simple selection handling option is to simply open a side-pane that provides a list of a sample of (hover_text of) items in the selection. The selection_handlers module has one of these called DisplaySample which we can instantiate, setting the number of items to randomly sample to 50. To make a selection in the resulting data map, once the map is loaded, hold down the shift key and draw a lasso with your mouse.

[9]:

datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-6",
    initial_zoom_fraction=0.9,
    selection_handler=datamapplot.selection_handlers.DisplaySample(n_samples=50)
)

[9]:

There are a number of built in selection handlers available:

Handler Class	Use case	Description
DisplaySample	Show a sampling of the selected data	You’ve already seen it, but it allows you to resample as required. This is good for text data which can be quite long
TagSelection	Tag data samples based on selections	Allows a basic tagging system where resulting tags can be download as JSON
ExportSelection	A simple export of all metadata associated to selected points	Adds a simple button to allow the download of a all the metadata associated to a given selection
DataTable	Show a table of associated metadata	Similar to the “DisplaySample” but much better for complex tabular metadata. It comes with a data download option
Histogram	Show histograms of metadata associated to a selection	Supports categorical and continuous data, and can show multiple fields at once for comparison
Statistics	Simple statistics about the selection	A simple handler that can display basic ststaistics about metadata associated to the selection
WordCloud	Show a word cloud of text from the selection	For text data this can provide a word cloud visualization to give a sense of the major themes/topics of the text in the selected region
CohereSummary	Use and LLM to summarize the selection	This is really more of an example/placeholder to show how a selection handler can reach out to an LLM API to summarize the content of a cluster

That’s very nice, and can provide some utility, but it is likely that you have a rather more particular task in mind for acting on selections. That’s okay, because you can build your own selection handler. It will involve a little javascript (and possibly HTML and CSS), but hopefully the results will be worth it. To understand what is required let’s start with the just about the simplest possible selection handler. We need to create a new class that inherits from datamapplot.selection_handlers.SelectionHandlerBase. We can then add various properties to it to enable it to do things upon selection. The most critical of these is javascript which should be a property that returns a string of javascript code to be used for handling selections. The critical thing here is that it must define a function to act as a callback upon selection, taking the indices of the points selected, and you must register that handler with the datamap via await datamap.addSelectionHandler(callbackFunction);. For our simple case we can simply raise an alert that gives the number of items in the selection.

[10]:

class NewSelectionHandler (datamapplot.selection_handlers.SelectionHandlerBase):

    @property
    def javascript(self):
        return """
function customCallback(selectedIndices) {
    alert("Lasso selection selected " + selectedIndices.length + " total points");
}

await datamap.addSelectionHandler(customCallback);
"""

Now that we have created a new selection handler class we can pass that in as a selection_handler to create_interactive_plot.

[11]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-6",
    initial_zoom_fraction=0.9,
    selection_handler=NewSelectionHandler()
)
plot

[11]:

The result is that when you use shift-lasso-drag to select points in the plot a popup message in the browser will inform your of the number of points selected. This is often a modal dialog, so you’ll have to clear it by hitting “okay” or similar before seelecting again. Nonetheless this gives the core skeleton for creating a selection handler – we need a javascript function lassoSelectionCallback that can take a array of indices and do something with it. As long as you can write a little javascript you can arrange for interesting things to happen upon selection.

Let’s get a little more fancy. Using the alert function is not ideal. It would be better if we could have some HTML element on the page show the information we want. So let’s do that. As before we’ll need to create a class inheriting from datamapplot.selection_handlers.SelectionHandlerBase, and a javascript function lassoSelectionCallback to operate upon the selection. Now, however, since we want to display results in the actual HTML output, we’ll also need some HTML and CSS elements. These can be added as properties for the class, similarly to the javascript property.

We’ll need a div to put any output in, ideally with some id or class so we can style it, and reference it from the javascript. In this case I’ll use an id but I’ll set the class to be 'container-box' which will ensure it is styled to match other items such as the title, search box, or logo – this is a class defined by datamapplot, but worth using here. For the moment the div is simply empty. It will be up to the javascript function to add or alter the content of the div. All of this simply goes into the string returned by the html property.

Similarly we can create a css property that returns a string of CSS we wish to add to the eventual output. In this case we’ll simply place the div, based on its id, in the bottom right, set a font-size, and ensure it resizes itself to fit whatever the content of the div is.

Lastly we need to have our lassoSelectionCallback actually do something. The first step is to get the relevant div based on the id we set for it in the HTML. Next we set the innerHTML to be a giving the count of how many points were selected. And that’s it – the content of the div will now be reactive to the lasso selection, updating the text with each new selection made.

[12]:

class NewSelectionHandler (datamapplot.selection_handlers.SelectionHandlerBase):

    @property
    def javascript(self):
        return """
function customCallback(selectedIndices) {
    const countDisplayContainer = document.getElementById("count-display");
    countDisplayContainer.innerHTML = "Lasso selection selected " + selectedIndices.length + " total points";
}

await datamap.addSelectionHandler(customCallback);
"""
    @property
    def html(self):
        return "<div id='count-display' class='container-box'></div>"

    @property
    def css(self):
        return """
#count-display {
    position: absolute;
    right: 0;
    bottom: 0;
    width: fit-content;
    height: fit-content;
    font-size: 24;
}
"""

Now we simply add this new selection handler to a plot using the selection_handler parameter and we are good to go. You can test out that lasso selection indeed updates the text.

[13]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-6",
    initial_zoom_fraction=0.9,
    selection_handler=NewSelectionHandler()
)
plot

[13]:

That demonstrates the basic concepts involved in writing your own selection handler – using javascript to update or change HTML elements based on the selection. And hopefully you now have some idea of how you might go abaout writing one of your own. It isn’t a great demonstration of what can be done, however, nor how to access data beyond the specific indices of the selected points. To remedy that let’s build something more complicated that actually does something interesting. Out of necessity this is going to be more complicated and involve more javascript, but I’ll try to highlight the core elements that matter versus aspects that are just javascript to do useful work that are specific to this example.

First we need a goal. For a relatively simple demonstration that still does something let’s try to create a word-cloud based on the selected items. That could easily turn into a lot of javascript code, but we can save ourselves a lot of work by making use of existing libraries. In this case there is a d3-cloud library that allows for dynamic word-clouds using the d3 data visualization library. A catch is that we’ll need those libraries loaded into the HTML file. We can do this by passing in a set of dependencies to the base class. With the visualization side hopefully made simpler, we just need some more basic javascript functions – the ability to count occurrences of words in arrays of text, and the ability to remove stop-words (common words such as “the”, “and”, “or” etc.) from the output of that. We can use the English language stop word list from sklearn, and write a short function called wordCounter to handle the word counting. You can, of course, write whatever helper functions you need for your use case.

This brings us to actually writing the lassoSelectionCallback. First we need to get the text associated to the selection. That kind data, along with any other “extra_point_data” is stored in datamap.metaData, keyed by the column name. In our case we want the hover_text which has the text we get when hovering over items. We can translate from indices to actual content by mapping each index to the text in the datamap.metaData.hover_text array at that index via const selectedText = selectedPoints.map(i => datamap.metaData.hover_text[i]);. Now we simply build word-counts from the array of texts using our helper function, and pass the results to another function that uses d3-cloud to make a word cloud.

As long as we ensure we have a div to place the word-cloud in (via the html property) and style it appropriately (via the css property) then we are essentially done.

[14]:

import sklearn.feature_extraction.text
import string

class WordCloud (datamapplot.selection_handlers.SelectionHandlerBase):

    def __init__(self, n_words=256, width=300, height=400, stop_words=None, **kwargs):
        super().__init__(
            dependencies=[
                "https://d3js.org/d3.v6.min.js",
                "https://unpkg.com/d3-cloud@1.2.7/build/d3.layout.cloud.js",
            ],
            **kwargs
        )
        self.n_words = n_words
        self.width = width
        self.height = height
        self.stop_words = stop_words or list(sklearn.feature_extraction.text.ENGLISH_STOP_WORDS)

    @property
    def javascript(self):
        return f"""
const _STOPWORDS = new Set({self.stop_words});
const svg = d3.select("#word-cloud").append("svg")
    .attr("width", {self.width})
    .attr("height", {self.height})
    .append("g")
    .attr("transform", "translate(" + {self.width} / 2 + "," + {self.height} / 2 + ")");
const wordCloudItem = document.getElementById("word-cloud");

function wordCounter(textItems) {{
    // Split words
    const words = textItems.join(' ').toLowerCase().split(/\s+/);

    // Count unique words
    const wordCounts = new Map();
    words.forEach(word => {{
        wordCounts.set(word, (wordCounts.get(word) || 0) + 1);
    }});

    // Remove stopwords, trim to the top words and rescale
    _STOPWORDS.forEach(stopword => wordCounts.delete(stopword));
    const result = Array.from(wordCounts, ([word, frequency]) => ({{ text: word, size: Math.sqrt(frequency) }}))
                        .sort((a, b) => b.size - a.size).slice(0, {self.n_words});
    const maxSize = Math.max(...(result.map(x => x.size)));
    return result.map(({{text, size}}) => ({{ text: text, size: (size / maxSize)}}));
}}

function generateWordCloud(words) {{
  const width = {self.width};
  const height = {self.height};

  // Configure a cloud layout
  const layout = d3.layout.cloud()
    .size([width, height])
    .words(words.map(d => ({{text: d.text, size: 5 + d.size * width / 10}})))
    .rotate(() => 0)
    .fontSize(d => d.size)
    .font("Impact")
    .on("end", draw);

  layout.start();

  function draw(words) {{
    // Update existing words
    const text = svg.selectAll("text")
      .data(words, d => d.text);

    // Remove old words
    text.exit()
      .remove();

    // Add new words
    text.enter()
      .append("text")
      .attr("text-anchor", "middle")
      .text(d => d.text)
      .merge(text)
      .attr("transform", d => "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")")
      .attr("font-size", d => d.size)
      .attr("font-family", "Impact");
  }}
}}

function wordCloudCallback(selectedPoints) {{
    const selectedText = selectedPoints.map(i => datamap.metaData.hover_text[i]);
    const wordCounts = wordCounter(selectedText);
    generateWordCloud(wordCounts);
}}

await datamap.addSelectionHandler(wordCloudCallback);
        """

    @property
    def html(self):
        return """<div id="word-cloud" class="container-box"></div>"""

    @property
    def css(self):
        return f"""
#word-cloud {{
    position: absolute;
    right: 0;
    bottom: 0;
    width: {self.width}px;
    height: {self.height}px;
    font-family: Impact;
    z-index: 10;
}}
"""

<>:93: SyntaxWarning: invalid escape sequence '\s'
<>:93: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_60011/1619870544.py:93: SyntaxWarning: invalid escape sequence '\s'
  @property

A much more involved example, but mostly it is javascript code to do the work of parsing the text and building the word-cloud. We can now hand this off to create_interactive_plot and verify that it works. If you want to make adjustments to the above and try them out feel free. Be aware, however, that errors produced internally to the javascript will only show up in the javascript console (accessible in the “Developer Tools” for your browser), so be sure to check there if things seem to be silently failing to work as intended.

[15]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-6",
    initial_zoom_fraction=0.9,
    selection_handler=WordCloud(256)
)
plot

[15]:

That’s a decent word cloud, but of course we can likely do better, and provide more options. Fortunately datamapplot.selection_handlers has a WordCloud handler pre-built that does he same as the above, but also much more, with more flexibility and better aesthetics. Here’s an example of what we can do:

[16]:

plot = datamapplot.create_interactive_plot(
    cord19_data_map,
    *cord19_label_layers,
    hover_text=cord19_hover_text,
    font_family="Cinzel",
    enable_search=True,
    inline_data=False,
    offline_data_prefix="cord-large-6",
    initial_zoom_fraction=0.9,
    selection_handler=datamapplot.selection_handlers.WordCloud(256, width=500, height=300, font_family="Cinzel", color_scale="viridis_r", n_rotations=0)
)
plot

[16]: