Unsupervised Name-Matching Introducing John to Dr. Watson
Dr. Philipp WarmerIn everyday life individuals are referred to by a variety of names. The used name largely depends on the context of the calling. A Dr. John H. Watson’ might be referred to as ‘John’ in an intimate — or by ‘Dr. Watson’ in a professional setting. Despite both name variants having nothing in common, they both refer to the same person. This is quite obvious to a human, however making this connection is challenging to the computer due to the lack of the given context.
Here we showcase an unsupervised workflow based on https://dirty-cat.github.io/, https://spacy.io/ and https://scikit-learn.org/stable/ to determine the underlying name-groups of different name variants, or to say it in layman’s terms, to explain the computer that both ‘John’ and ‘Dr. Watson’ both refer to the name-group ‘Dr. John H. Watson’. This is done in 4 steps, 1) extracting the names from the text, 2) based on the name similarities determine the number of name clusters, 3) using the number of clusters to determine meaningful name-groups and 4) assign each name to a name-group.
Setting up the environment
Before we get started let’s make sure we have all required libraries and the respective spaCy language model installed.
Package Version
— — — — — — — — — — — — — — — — — — — -
dirty-cat 0.2.0
en-core-web-sm 3.2.0
matplotlib 3.5.1
numpy 1.22.3
pandas 1.4.2
scikit-learn 1.0.2
seaborn 0.11.2
spacy 3.2.4
After installing spaCy the language model can be downloaded like this:
> python -m spacy download en_core_web_sm
Let’s load the required packages
Now that all the libraries are installed, let us import them. I ran all of the following code in python 3.9.11.
import numpy as np
import spacy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import AffinityPropagation
from dirty_cat import SimilarityEncoder, GapEncoder
If you can import all of the libraries without an error you are ready to go. Let’s start!
1. Detecting and extracting names from text
First we want to get a list of names. This can be done using the code snippet below. Here, based on an input string, we extract a list of names. This step is added to make the workflow comprehensive. However, we continue with a predefined list of names in the next step.
Lets get started!
> names = get_names(some_book)
2. Determine the similarity of names
To learn onto how many individuals we want to map our names, we first need to determine the putative number of people to whom the names belong. For illustration purposes I’ve chosen two people, ‘Dr. John H. Watson’ and ‘Sherlock Holmes’. For each of the two individuals I’ve selected multiple name variants, including only their first and last name, with and without title or with a typo. Using this example list as input we compute the n-gram similarity using the get_encodings function. Under the hood it used the similarity encoder from the dirty cat library. With the function get_clusters the similarities are subject to affinity propagation. This results in an automated cluster assignment for each name and thus also the total number of clusters.
names = [‘Watson’, ‘John’, ‘John H. Watson’, ‘Sherlock Holmes’, ‘olmes’,’Sherlock’, ‘Holmes’, ‘Dr. Watson’]
> encodings = get_encodings(names)
> clusters, n_clusters = get_clusters(encodings)
Let’s see if we could retrieve the two expected name clusters for Dr. John H. Watson and Sherlock Holmes.
> print(n_clusters)
>> 2
Hurray! We got the right number of name clusters. Let’s next figure out the underlying name groups.
3. Let’s determine the name-groups
Now that we have automatically determined the number of name clusters (i.e. number of people) we next use the function get_name_groups to determine the underlying name constituents as well as the activations for each name-group. Why do we want the activations? In short, an activation can be understood as how strongly a given name responds to a name-group, thereby giving us a measure of relatedness.
> name_groups, name_activations = get_name_groups(names, n_clusters)
So let’s check if the name-groups make sense to us.
> print(name_activations)
>> [‘sherlock, holmes’, ‘watson, john’]
That looks promising. We determined two name-groups: ‘sherlock, holmes’, ‘watson, john’. Let us now connect them to our list of names.
4. Let’s map each name to a name-group
Now we can map the name-group activation to each name. The figure below, generated with plot_topic_activations, shows the activation value for each name / name-group pair. Do we see activations that make sense to us?
> plot_topic_activations(name_activations, name_groups, names)
Therefore we use the get_clean_names function to select the largest activation value for each name. This way we select the most associated name-group for every name. Let’s check if we can connect John and Dr. Watson to the same name-group.
> matched_names = get_clean_names(name_activations, name_groups, names)
Voila, let’s wrap it up.
In this automated workflow we started out by highlighting how names can be extracted using named entity recognition. Next we selected a set of names, computed their n-gram similarity, from which we determined the number of meaningful name clusters. Afterwards we pulled out their underlying name-groups and mapped them back to the initial names using topic activations. This way we performed an unsupervised mapping of both ‘John’ and ‘Dr. Watson’ to the same underlying name-group: ‘watson, john’. In other words, we mapped ambiguous names to distinct name-groups.
This workflow runs smoothly for our test cases but arguably the real world of natural language processing is more messy. If this workflow doesn’t function well for your use case, here are a couple of screws that can be adjusted to make this pipeline more robust. Setting the obvious preprocessing of names aside, the main ones are i) determining name clusters and ii) generation of name-groups. i) One way to improve the unsupervised determination of name clusters is to make it work by consensus of multiple orthogonal approaches. The similarity encoding approach from dirty cat could be supplemented with the word2vec algorithm [https://arxiv.org/abs/1301.3781] or the transformed based universal sentence encoder [https://arxiv.org/abs/1803.11175]. Then the number of name clusters is determined by majority vote, which is inherently more robust. The second screw, ii), is on the generation of name-groups. While content, such as name-groups, can be generated by reconstructing a latent space, using convolutional variational autoencoders or generative adversarial networks, any of those would require massive pre-training to be useful for our workflow. A more lightweight way would be by bundling potentially ambiguous name-groups together and use them as the input to the whole workflow, basically run it recursively either for a predefined number of iterations or till the activations converge. Those are some of the ideas out there.
I hope you took something away from my assembly of spaCy, dirty_cat and sklearn. If you have any questions or input please feel free to get in touch.