Matching strings that are similar but not exactly the same is a fairly common problem - think of matching peoples names that may be spelt slightly different, or use abbreviated spellings e.g. William vs. Bill. Based on this SO post about matching strings using Apache Spark to match strings, I wanted to understand the approach in more depth through an example.
Aprroximately or fuzzily matching two strings means calculating how similar 2 strings are and one approach is to calculate the edit distance between the strings - that is, the number of changes (insertion, deletion, substitution) that would need to be made to one string to make it equal to the other. A popular algorithm to calculate this distance is the Levenshtein distance algorithm, however this algorithm is too computationally expensive (slow) for large datasets. Spark's machine learning capabilities offer a different approach to this problem.
When applying a machine learning approach to string matching, we first need to transform our strings into a numerical representation An effective way to represent a string as sets is to calculate the set of substrings of length n (n-grams) that appear within it. For example, two similar names and their 1, 2 and 3-gram representations. is shown in the table below.
|S1||John Smith||J,o,h,n, ,S,m,i,t,h||Jo,oh,hn,n , S,Sm,mi,it,th||Joh,ohn,hn ,n S, Sm,Smi,mit,ith|
|S2||John Smyth||J,o,h,n, ,S,m,y,t,h||Jo,oh,hn,n , S,Sm,my,yt,th||Joh,ohn,hn ,n S, Sm,Smy,myt,yth|
Spark's NGram feature transformer converts input strings into arrays of n-grams.
After converting each string to its constituent n-grams, we can then convert each string into a fixed length feature vector utilizing the hashing trick i.e. applying a hash function to each of the n-grams and using the hash function as an index directly into the array.
Spark's HashingTF class is a transformer that takes sets of terms (the n-grams in this example) and converts them into fixed length feature vectors utilising the Murmurhash 3 hash function. The default feature dimension in HashingTF is 218 = 262,144.
Taking the 2 example strings above, the following feature vectors are obtained from the 3-grams
|S1||John Smith||Joh,ohn,hn ,n S, Sm,Smi,mit,ith||[89578,131746,138261,155335,203211,205508,236449,247475], [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
|S2||John Smyth||Joh,ohn,hn ,n S, Sm,Smy,myt,yth||[89578,98162,137482,138261,155335,203211,247475,249580], [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
The Jaccard index is a measure of the similarity of 2 sets. Taking the feature vectors of our 2 sample strings calculated above, we can calculate the Jaccard index as the number of elements that appear in both S1 and S2, divided by the total number of elements in S1 and S2, or 5/11 = 0.454545.
Once we have converted our strings to feature vectors our next goal is to replace potentially large sets (well not so large if matching short strings, but consider matching large blocks of text) with a small representation which has the property that we can compare this small representation for 2 strings and estimate the Jaccard similarity of the 2 strings.
We can represent the two strings in our example as a characterisitc matrix as below
Whilst the way in which min hashing works is beyond the scope of this post (for an excellent explanation see the MMDS book, Chapter 3, available here), the key concept is that the probability that a min hashing function for a random permutation of rows in the characteristic matrix produces the same value for two sets is equal to the Jaccard similarity of those two sets.
Spark's MinHashLSH class is an estimator that takes a DataFrame and produces a Model which is a Transformer. In order to use MinHashLSH, we can fit a model on the featurized data obtained from a HashingTF transformer, Running transform() on the resultant model provides us with the hash values.
As described on SO, we can use Sparks machine learning pipeline to chain together the transformers and the estimator (together with an initial transformer to tokenize the input), giving this method that takes in 2 dataframes (containing names), fits the model to one of the dataframes and does an approximate similarity join having transformed both the dataframes on the resultant model.
def match_names(df_1, df_2): pipeline = Pipeline(stages=[ RegexTokenizer( pattern="", inputCol="name", outputCol="tokens", minTokenLength=1 ), NGram(n=3, inputCol="tokens", outputCol="ngrams"), HashingTF(inputCol="ngrams", outputCol="vectors"), MinHashLSH(inputCol="vectors", outputCol="lsh") ]) model = pipeline.fit(df_1) stored_hashed = model.transform(df_1) landed_hashed = model.transform(df_2) matched_df = model.stages[-1].approxSimilarityJoin(stored_hashed, landed_hashed, 1.0, "confidence").select( col("datasetA.name"), col("datasetB.name"), col("confidence")) matched_df.show(20, False)
2 DataFrames, with 10 names in each
|1||John Smyth||Bob Jones|
|2||John Smith||Ned Flanders|
|3||Jo Smith||Lisa Short|
|4||Bob Jones||Joe Tan|
|5||Tim Jones||Jim Jones|
|6||Laura Tully||John Smith|
|7||Sheena Easton||John Smith|
|8||Hilary Jones||Jon Smithers|
|9||Hannah Short||Chris Smith|
|10||Greg Norman||Norm Smith|
Example output (lower value in the confidence column indicates a stronger match).
+------------+----------+------------------+ |name |name |confidence | +------------+----------+------------------+ |John Smith |John Smyth|0.5454545454545454| |Bob Jones |Bob Jones |0.0 | |Bob Jones |Tim Jones |0.6 | |Jim Jones |Tim Jones |0.25 | |John Smith |John Smyth|0.5454545454545454| |Norm Smith |Jo Smith |0.6 | |John Smith |John Smith|0.0 | |Jon Smithers|Jo Smith |0.6666666666666667| |Chris Smith |Jo Smith |0.6363636363636364| |Jim Jones |Bob Jones |0.6 | |John Smith |John Smith|0.0 | +------------+----------+------------------+