Fuzzy CSV column matching

Fuzzy CSV column matching - python

I'm parsing a lot of .csv files right now, and I'm running into issues where one .csv will identify, say, a column that holds the name of a candidate running for office with the header candidate_name, while another will use CANDIDATE_FULL_NAME.
I'm updating dictionaries with the results of the columns like this, except constantly changing the row[value] for each different header.
dict.update({
'candidate': row['column_header']
Is there a way to fuzzy match this? Preferably something that I can use almost drop-in so that I don't have to set up a class/method that regex tests each column for its similarity.
I already set up a class that tests to match a value to a list of values, but I feel as if this is something that I won't have to write myself. Unfortunately, my google-fu has returned nothing.
I'd use the column number, but unfortunately the columns aren't always in the same order. Additionally, I can't alter the original .csv files (or else I'd definitely normalize them).

No "fuzzy" matching built-in to pandas as far as I know. If there is some common denominator, e.g. the word "name" is only and always in the column that contains the candidate's name, you could use it to rename the name column. For example:
import pandas as pd
import numpy as np
def fuzzymatch(df, string, stname):
for col in df.columns:
if col.lower().find(string) > -1:
df.rename(columns={col:stname}, inplace=True)
break
return df
df = pd.DataFrame({"CANDIDATE_NAME_HERE": ["Ted","Fred","Sally","John","Jane"], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
#pd.read_csv('filename.csv') will load your csv file
string = 'name'
stname = 'candidate_name'
df = fuzzymatch(df, string, stname)
print(df)
B C candidate_name
0 20 32 Ted
1 30 234 Fred
2 10 23 Sally
3 40 23 John
4 50 42523 Jane

Related

Splitting a large pandas datafile based on the data in one colimn

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.
ie. [name, color, number, state;
bob, green, 21, TX;
joe, red, 33, TX;
sue, blue, 22, NY;
....]
I'd like to have it put each states worth of data in to its own data sub file
df[1] = [bob, green, 21, TX] [joe, red, 33, TX]
df[2] = [sue, blue, 22, NY]
Pandas seems like the best option for this as the csv file given is about 500 lines long

You could try something like:
import pandas as pd
for state, df in pd.read_csv("file.csv").groupby("state"):
df.to_csv(f"file_{state}.csv", index=False)
Here file.csv is your base file. If it looks like
name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY
the output would be 2 files:
file_TX.csv:
name,color,number,state
bob,green,21,TX
joe,red,33,TX
file_NY.csv:
name,color,number,state
sue,blue,22,NY

There are different methods for reading csv files. You may find all methods in following link:
(https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)
Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:
import pandas as pd
df = pd.read_csv(r"file_path")
Now let's assume after these lines, you have the following dataframe:
name
color
number
state
bob
green
21
TX
joe
red
33
TX
sue
blue
22
NY
...
...
...
...
From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:
df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:
Method 1- Making List of Lists for Different States
First, let's get state list without duplicates:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
Now we need to use list comprehension.
lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]
lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.
Method 2- Making Different Dataframes for Different States
In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)
r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]
Let's make different dataframes for different states: (and reset index)
dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)
Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.
Method 3 - My Recommendation
Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:
import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]
for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)
def call(df):
if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}
data = {}
for i in range(len(states)): data.update(call(dfs[i]))
I may have done some typos, but I hope you understand the idea.
This code gives a nested dictionary such as:
first choice is state (TX,NY...)
next choice is state number index (0,1,2...)
next choice is name or color or number
Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

Match similar column elements using pandas and fuzzwuzzy

I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.
The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)
After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.
For instance:
Column 1
1. 9Com(panynAm9e00
2. NikE4
3. Mitrosof2
Column 2
1. Microsoft
2. Company Name
3. Nike
Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.
Any suggestions?

I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:
import pandas as pd
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['9Com(panynAm9e00'],
['NikE4'],
['Mitrosof2']],
columns=['Name'])
known_list = ['Microsoft','Company Name','Nike']
def find_match(x):
match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
return match
df['match found'] = [find_match(row) for row in df['Name']]
Yields:
Name match found
0 9Com(panynAm9e00 Company Name
1 NikE4 Nike
2 Mitrosof2 Microsoft

I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:
convert everything (col 1 and 2) to lowercase
For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
Do the same (step 2) for each entry in column 1
For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names
The dictionary-distance implementation is up to you.

Pandas read_csv() - Flie contains different data

Problem:
I am trying to read in a csv to a pandas dataframe that contains data of different column sizes.
Example & Description:
Code:
df = pd.read_csv(input, error_bad_lines=False)
input:
ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8
and this pattern continues for the entirety of the file. Originally I was thinking about throwing away the extra columns since read_csv option throws and error and doesn't read them I just started to ignore them. However I then get duplicate headers in my dataframe... To combat this I tried the drop_duplicates() but found out that only in V0.17 of pandas do they include the keep=False option. I eventually started to convince myself that try to keep the data. So here is my question. Based on the dataset above I was hoping that I might be able to create two unique dataframes. You can assume that the ID will always be unique so you can create N number of frames for the number of different IDs you have. Each ID will not have the same number of headers. Once a different ID is encountered its header will be printed. So for example if we hit another ID 16 its header will be printed prior to the data. And again if we hit another ID 15 its header will be printed prior to its data.
I was thinking maybe to just preprocess the data before I started using dataframes since that is an option. But since I am still fairly new to all that pandas can do, I was hoping maybe some people here would have suggestions before I went ahead and wrote some nasty preprocessing code :). The other thought I had which turns into a question is - For the error_bad_lines, is there a way to save those lines to another dataframe or something else? Additionally, I tell pandas in the read_csv to only look for items that have an ID of X and just do that for all my ID's? I will add that the number of IDs is finite and known.
My current version of pandas is 0.14.

Note I corrected what I think is a typo in your sample data.
I split your data with a lookahead regular expression. I look for newline characters that are followed by ID.
Then parse each element of the list and concatenate.
from io import StringIO
import pandas as pd
import re
txt = """ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16, 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8"""
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])
ID Time Val Val1 Val2
0 15 18:00:01 4.0 NaN NaN
1 15 18:00:02 6.0 NaN NaN
2 15 18:00:03 5.0 NaN NaN
0 16 18:00:03 NaN 1.0 43.0
0 15 18:00:04 8.0 NaN NaN
The above was working with the sample data provided by OP. If this were in a csv file the solution would look like this
from io import StringIO
import pandas as pd
import re
with open('myfile.csv') as f:
txt = f.read()
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])

You can treat the file as having four columns:
df = pd.read_csv(input, names=['id', 'time', 'v1', 'v2'])
and filter out the extra headers:
df = df[df.id != 'ID']
Then your two data sets are simply df[pd.isnull(df.v2)] and df[~pd.isnull(df.v2)].

Pandas new column as string extraction of another only for certain condition on string length verified: Fast way

I am working with a large df (near 2 millions rows) and need to create a new column from another one. The task seems easy: the starting column, called "PTCODICEFISCALE" contains a string made of 11 either 16 characters, no other possibilities, no NaN.
The new column I have to create ("COGNOME") must contain the 3 first characters of "PTCODICEFISCALE" ONLY IF the lenght of the "PTCODICEFISCALE" nth-row is 16; else when the lenght is 11 the new column should contain nothing, which means "NaN" I think.
I have tried this:
csv.loc[len(csv['PTCODICEFISCALE']) == 16, 'COGNOME'] = csv.loc[csv.PTCODICEFISCALE.str[:3]]
In the output this error message appears:
ValueError: cannot index with vector containing NA / NaN values
Which I don't understand.
I am sure there are no NA /NaN in "PTCODICEFISCALE" column.
Any help? Thanks!
P.S.: "csv" is the name of the DataFrame

I think you need numpy.where and condition with str.len:
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
Sample:
csv = pd.DataFrame({'PTCODICEFISCALE':['0123456789123456','1','01234567891234']})
print (csv)
PTCODICEFISCALE
0 0123456789123456
1 1
2 01234567891234
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
print (csv)
PTCODICEFISCALE COGNOME
0 0123456789123456 012
1 1 NaN
2 01234567891234 NaN

Merge misaligned pandas dataframes

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.

A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.