Missing values and fuzzy matching

The data

I found a dataset about COVID cases in the different communes (LK) in Germany in an Excel file online (Link). It shows cases (Fälle) and the 7-Day-Incidence value (per 100,000 inhabitants).

LK	Fälle	Inzidenz
SK Augsburg	999	336.8377
SK Herne	464	296.5823
SK Berlin Neukölln	917	285.5604
SK Duisburg	1401	280.9383
LK Rottal-Inn	335	275.7156
LK Bautzen	780	260.2099

The SK (Stadt) and LK (Landkreis) are important, as they indicate whether it is about the city or the county around the city. In some cases you have both.

LK	Fälle	Inzidenz
LK Würzburg	237	146.02408
SK Würzburg	163	127.40945
SK Aschaffenburg	59	83.09625
LK Aschaffenburg	130	74.62687

Now, if I want to show this information on a map, I need a shapefile with the communes. The good news is, there is one (Link). The bad news is, the names are in a different format than in the Excel file with the COVID cases.

GEN	BEZ
Flensburg	Kreisfreie Stadt
Kiel	Kreisfreie Stadt
Lübeck	Kreisfreie Stadt
Neumünster	Kreisfreie Stadt
Dithmarschen	Kreis
Herzogtum Lauenburg	Kreis

A first fix to bring the two formats closer together would be to convert BEZ into LK and SK and paste it together with GEN.

name
SK Flensburg
SK Kiel
SK Lübeck
SK Neumünster
LK Dithmarschen
LK Herzogtum Lauenburg

A map with missing values

This already helps a lot. Let’s see if we can join the COVID indicence values to the shapefile with the communes.

This already looks quite good, but there are a lot of missing communes.

Let’s look into the Excel file with COVID cases which ones did not have a match.

LK	Fälle	Inzidenz
SK Berlin Neukölln	917	285.5604
SK Offenbach	330	253.3006
SK Mülheim a.d.Ruhr	378	221.5294
SK Berlin Friedrichshain-Kreuzberg	617	218.7486
SK Berlin Reinickendorf	554	213.3321
LK Lindau	162	197.6068

Can we find equivalent names in the shapefile? We search for Berlin, Offenbach, Mülheim and Lindau.

name
SK Mülheim an der Ruhr
SK Offenbach am Main
LK Offenbach
LK Lindau (Bodensee)
SK Berlin

We notice a few things: Berlin has only one entry in the shapefile and a more detailed breakdown in the COVID cases document. SK Offenbach is called Offenbach am Main, Lindau is called Lindau (Bodensee). And Mülheim a.d.Ruhr is written Mülheim an der Ruhr.

One option would be to find all the matches manually and replace them. There are around 40, so this would be feasible but a little annoying. Maybe we can save this time of comparing each element and let the computer find the best match for us.

Fuzzy matching is doing exactly this: Based on string distances, it finds the closest match in the other source.

Fuzzy matching

I wrote a function to do the calculation of the string distances and find the best fit. Additionally, we have the option to clean the input before, i.e. transform all letters to lowercase, remove or replace certain words which disturb the process. Usually such words would become clear after running the matching process once and noticing some undesired results.

original	best_fit	similarity
StadtRegion Aachen	LK Städteregion Aachen	0.2315310
SK Berlin Friedrichshain-Kreuzberg	SK Berlin	0.1470588
SK Berlin Steglitz-Zehlendorf	SK Berlin	0.1379310
Region Hannover	LK Region Hannover	0.1333333
SK Berlin Reinickendorf	SK Berlin	0.1217391
LK Lindau	LK Lindau (Bodensee)	0.1100000
LK Sankt Wendel	LK St. Wendel	0.0970513
SK Berlin Spandau	SK Berlin	0.0941176
SK Neustadt a.d.Weinstraße	SK Neustadt an der Weinstraße	0.0915340
SK Mülheim a.d.Ruhr	SK Mülheim an der Ruhr	0.0841542
SK Ludwigshafen	SK Ludwigshafen am Rhein	0.0750000
SK Frankenthal	SK Frankenthal (Pfalz)	0.0727273
SK Landau i.d.Pfalz	SK Landau in der Pfalz	0.0665072
SK Halle	SK Halle (Saale)	0.0545455
LK Neustadt a.d.Aisch-Bad Windsheim	LK Neustadt a.d. Aisch-Bad Windsheim	0.0341270
SK Freiburg i.Breisgau	SK Freiburg im Breisgau	0.0264822
LK Mühldorf a.Inn	LK Mühldorf a. Inn	0.0111111
LK Neumarkt i.d.OPf.	LK Neumarkt i.d. OPf.	0.0095238
LK Pfaffenhofen a.d.Ilm	LK Pfaffenhofen a.d. Ilm	0.0083333
LK Wunsiedel i.Fichtelgebirge	LK Wunsiedel i. Fichtelgebirge	0.0066667

This table serves as a control point. We can check that all the items were matched correctly. It also serves as a dictionary which we will use to replace the original names before joining.

Closing comments

The code behind this example was written with R and can be found here (under Day 19 - NULL).
I previously used fuzzy matching in an example from the banking world (matching company names) and talked about this in Latin-R 2019, you can find the presentation and the deck here.
If you wonder about what happened to the Berlin case in our example above, I did not take extra care of it. In practice one would have to sum up the cases and find an average incidence value to replace the values from different parts of the city. As this was not directly related to the fuzzy matching exercise, I omitted it.
I am very curious if there are other applications for fuzzy matching in the GIS world, or if there are other methods to overcome a problem like this. Feel free to reach out to me.

Missing values and fuzzy matching

Richard Vogg

Nov 19, 2020

The data

A map with missing values

Fuzzy matching

Closing comments