The data
I found a dataset about COVID cases in the different communes (LK) in Germany in an Excel file online (Link). It shows cases (Fälle) and the 7-Day-Incidence value (per 100,000 inhabitants).
LK | Fälle | Inzidenz |
---|---|---|
SK Augsburg | 999 | 336.8377 |
SK Herne | 464 | 296.5823 |
SK Berlin Neukölln | 917 | 285.5604 |
SK Duisburg | 1401 | 280.9383 |
LK Rottal-Inn | 335 | 275.7156 |
LK Bautzen | 780 | 260.2099 |
The SK (Stadt) and LK (Landkreis) are important, as they indicate whether it is about the city or the county around the city. In some cases you have both.
LK | Fälle | Inzidenz |
---|---|---|
LK Würzburg | 237 | 146.02408 |
SK Würzburg | 163 | 127.40945 |
SK Aschaffenburg | 59 | 83.09625 |
LK Aschaffenburg | 130 | 74.62687 |
Now, if I want to show this information on a map, I need a shapefile with the communes. The good news is, there is one (Link). The bad news is, the names are in a different format than in the Excel file with the COVID cases.
GEN | BEZ |
---|---|
Flensburg | Kreisfreie Stadt |
Kiel | Kreisfreie Stadt |
Lübeck | Kreisfreie Stadt |
Neumünster | Kreisfreie Stadt |
Dithmarschen | Kreis |
Herzogtum Lauenburg | Kreis |
A first fix to bring the two formats closer together would be to convert BEZ into LK and SK and paste it together with GEN.
name |
---|
SK Flensburg |
SK Kiel |
SK Lübeck |
SK Neumünster |
LK Dithmarschen |
LK Herzogtum Lauenburg |
A map with missing values
This already helps a lot. Let’s see if we can join the COVID indicence values to the shapefile with the communes.
This already looks quite good, but there are a lot of missing communes.
Let’s look into the Excel file with COVID cases which ones did not have a match.
LK | Fälle | Inzidenz |
---|---|---|
SK Berlin Neukölln | 917 | 285.5604 |
SK Offenbach | 330 | 253.3006 |
SK Mülheim a.d.Ruhr | 378 | 221.5294 |
SK Berlin Friedrichshain-Kreuzberg | 617 | 218.7486 |
SK Berlin Reinickendorf | 554 | 213.3321 |
LK Lindau | 162 | 197.6068 |
Can we find equivalent names in the shapefile? We search for Berlin, Offenbach, Mülheim and Lindau.
name |
---|
SK Mülheim an der Ruhr |
SK Offenbach am Main |
LK Offenbach |
LK Lindau (Bodensee) |
SK Berlin |
We notice a few things: Berlin has only one entry in the shapefile and a more detailed breakdown in the COVID cases document. SK Offenbach is called Offenbach am Main, Lindau is called Lindau (Bodensee). And Mülheim a.d.Ruhr is written Mülheim an der Ruhr.
One option would be to find all the matches manually and replace them. There are around 40, so this would be feasible but a little annoying. Maybe we can save this time of comparing each element and let the computer find the best match for us.
Fuzzy matching is doing exactly this: Based on string distances, it finds the closest match in the other source.
Fuzzy matching
I wrote a function to do the calculation of the string distances and find the best fit. Additionally, we have the option to clean the input before, i.e. transform all letters to lowercase, remove or replace certain words which disturb the process. Usually such words would become clear after running the matching process once and noticing some undesired results.
original | best_fit | similarity |
---|---|---|
StadtRegion Aachen | LK Städteregion Aachen | 0.2315310 |
SK Berlin Friedrichshain-Kreuzberg | SK Berlin | 0.1470588 |
SK Berlin Steglitz-Zehlendorf | SK Berlin | 0.1379310 |
Region Hannover | LK Region Hannover | 0.1333333 |
SK Berlin Reinickendorf | SK Berlin | 0.1217391 |
LK Lindau | LK Lindau (Bodensee) | 0.1100000 |
LK Sankt Wendel | LK St. Wendel | 0.0970513 |
SK Berlin Spandau | SK Berlin | 0.0941176 |
SK Neustadt a.d.Weinstraße | SK Neustadt an der Weinstraße | 0.0915340 |
SK Mülheim a.d.Ruhr | SK Mülheim an der Ruhr | 0.0841542 |
SK Ludwigshafen | SK Ludwigshafen am Rhein | 0.0750000 |
SK Frankenthal | SK Frankenthal (Pfalz) | 0.0727273 |
SK Landau i.d.Pfalz | SK Landau in der Pfalz | 0.0665072 |
SK Halle | SK Halle (Saale) | 0.0545455 |
LK Neustadt a.d.Aisch-Bad Windsheim | LK Neustadt a.d. Aisch-Bad Windsheim | 0.0341270 |
SK Freiburg i.Breisgau | SK Freiburg im Breisgau | 0.0264822 |
LK Mühldorf a.Inn | LK Mühldorf a. Inn | 0.0111111 |
LK Neumarkt i.d.OPf. | LK Neumarkt i.d. OPf. | 0.0095238 |
LK Pfaffenhofen a.d.Ilm | LK Pfaffenhofen a.d. Ilm | 0.0083333 |
LK Wunsiedel i.Fichtelgebirge | LK Wunsiedel i. Fichtelgebirge | 0.0066667 |
This table serves as a control point. We can check that all the items were matched correctly. It also serves as a dictionary which we will use to replace the original names before joining.
Closing comments
- The code behind this example was written with R and can be found here (under Day 19 - NULL).
- I previously used fuzzy matching in an example from the banking world (matching company names) and talked about this in Latin-R 2019, you can find the presentation and the deck here.
- If you wonder about what happened to the Berlin case in our example above, I did not take extra care of it. In practice one would have to sum up the cases and find an average incidence value to replace the values from different parts of the city. As this was not directly related to the fuzzy matching exercise, I omitted it.
- I am very curious if there are other applications for fuzzy matching in the GIS world, or if there are other methods to overcome a problem like this. Feel free to reach out to me.