The data

I found a dataset about COVID cases in the different communes (LK) in Germany in an Excel file online (Link). It shows cases (Fälle) and the 7-Day-Incidence value (per 100,000 inhabitants).

LK Fälle Inzidenz
SK Augsburg 999 336.8377
SK Herne 464 296.5823
SK Berlin Neukölln 917 285.5604
SK Duisburg 1401 280.9383
LK Rottal-Inn 335 275.7156
LK Bautzen 780 260.2099

The SK (Stadt) and LK (Landkreis) are important, as they indicate whether it is about the city or the county around the city. In some cases you have both.

LK Fälle Inzidenz
LK Würzburg 237 146.02408
SK Würzburg 163 127.40945
SK Aschaffenburg 59 83.09625
LK Aschaffenburg 130 74.62687

Now, if I want to show this information on a map, I need a shapefile with the communes. The good news is, there is one (Link). The bad news is, the names are in a different format than in the Excel file with the COVID cases.

GEN BEZ
Flensburg Kreisfreie Stadt
Kiel Kreisfreie Stadt
Lübeck Kreisfreie Stadt
Neumünster Kreisfreie Stadt
Dithmarschen Kreis
Herzogtum Lauenburg Kreis

A first fix to bring the two formats closer together would be to convert BEZ into LK and SK and paste it together with GEN.

name
SK Flensburg
SK Kiel
SK Lübeck
SK Neumünster
LK Dithmarschen
LK Herzogtum Lauenburg

A map with missing values

This already helps a lot. Let’s see if we can join the COVID indicence values to the shapefile with the communes.

This already looks quite good, but there are a lot of missing communes.

Let’s look into the Excel file with COVID cases which ones did not have a match.

LK Fälle Inzidenz
SK Berlin Neukölln 917 285.5604
SK Offenbach 330 253.3006
SK Mülheim a.d.Ruhr 378 221.5294
SK Berlin Friedrichshain-Kreuzberg 617 218.7486
SK Berlin Reinickendorf 554 213.3321
LK Lindau 162 197.6068

Can we find equivalent names in the shapefile? We search for Berlin, Offenbach, Mülheim and Lindau.

name
SK Mülheim an der Ruhr
SK Offenbach am Main
LK Offenbach
LK Lindau (Bodensee)
SK Berlin

We notice a few things: Berlin has only one entry in the shapefile and a more detailed breakdown in the COVID cases document. SK Offenbach is called Offenbach am Main, Lindau is called Lindau (Bodensee). And Mülheim a.d.Ruhr is written Mülheim an der Ruhr.

One option would be to find all the matches manually and replace them. There are around 40, so this would be feasible but a little annoying. Maybe we can save this time of comparing each element and let the computer find the best match for us.

Fuzzy matching is doing exactly this: Based on string distances, it finds the closest match in the other source.

Fuzzy matching

I wrote a function to do the calculation of the string distances and find the best fit. Additionally, we have the option to clean the input before, i.e. transform all letters to lowercase, remove or replace certain words which disturb the process. Usually such words would become clear after running the matching process once and noticing some undesired results.

original best_fit similarity
StadtRegion Aachen LK Städteregion Aachen 0.2315310
SK Berlin Friedrichshain-Kreuzberg SK Berlin 0.1470588
SK Berlin Steglitz-Zehlendorf SK Berlin 0.1379310
Region Hannover LK Region Hannover 0.1333333
SK Berlin Reinickendorf SK Berlin 0.1217391
LK Lindau LK Lindau (Bodensee) 0.1100000
LK Sankt Wendel LK St. Wendel 0.0970513
SK Berlin Spandau SK Berlin 0.0941176
SK Neustadt a.d.Weinstraße SK Neustadt an der Weinstraße 0.0915340
SK Mülheim a.d.Ruhr SK Mülheim an der Ruhr 0.0841542
SK Ludwigshafen SK Ludwigshafen am Rhein 0.0750000
SK Frankenthal SK Frankenthal (Pfalz) 0.0727273
SK Landau i.d.Pfalz SK Landau in der Pfalz 0.0665072
SK Halle SK Halle (Saale) 0.0545455
LK Neustadt a.d.Aisch-Bad Windsheim LK Neustadt a.d. Aisch-Bad Windsheim 0.0341270
SK Freiburg i.Breisgau SK Freiburg im Breisgau 0.0264822
LK Mühldorf a.Inn LK Mühldorf a. Inn 0.0111111
LK Neumarkt i.d.OPf. LK Neumarkt i.d. OPf. 0.0095238
LK Pfaffenhofen a.d.Ilm LK Pfaffenhofen a.d. Ilm 0.0083333
LK Wunsiedel i.Fichtelgebirge LK Wunsiedel i. Fichtelgebirge 0.0066667

This table serves as a control point. We can check that all the items were matched correctly. It also serves as a dictionary which we will use to replace the original names before joining.

Closing comments