Loop over different Dataframes Pandas - python

Hello World,
I have 52 dataframes (df and data) with each of them containing only Name with first letter of Alphabet.
Below is an example for letter A & B (containing in df and data)
dfA dfB
Code Name Code Name
15 Amiks 68 Bernard
157 Alis 14 Barpti
dataA dataB
Code Name Code Name
Amiks Berti
Alis Bernard
Anatole Barpti
Question:
Not an expert in Python, How can I loop over dataframes with the same letter? and not looking for all dataframes but just the same letter.
For Eg: Check whether:
dataA.Name is in dfA.Name ?
dataB.Name is in dfB.Name ?
dataZ.Name is in dfZ.Name ?
Edit
The original DF are below
df data
Code Name Code Name
15 Amiks Amiks
157 Alis Alis
14 Barpti Bernard
68 Bernard Barpti
I just created one df per first letter.
The aim is to speed up the computational time and avoid checking within the whole DF, when we can check for only rows with the same first letter.
Thanks for anyone helping.

You should considere variable names as only meaningful for you and not for the Python interpretor. Of course, there are hacky ways using locals() or globals() but they should never be used in normal code.
That means that is your dataframes are related, you should use a data structure to trace the relation. You could use 2 lists or a list of pairs, for example:
dataframes = [ (dfA, dataA), (dfB, dataB), ...]
If the letter itself matter, you could use a dictionary:
dataframes = { 'A': (dfA, dataA), 'B': (dfB, dataB), ...}
Then you can easily iterate:
for letter in dataframes:
commons = dataframes[letter][0].merge(dataframes[letter][1], on='Name',
suffixes=('_x', ''))[['Code', 'Name']]
# process commons which contains code and name where datax.Name is in dfx.Name
...
But unless the original dataframe is really huge, I would suggest you to benchmark the one single dataframe with first letter way against the 52 dataframes one. Pandas if rather efficient provided everything fits in memory...

For those who might be interested, below is what I have done:
Step 1: re-merge all df's and data's.
The output is :
Df (containing all letters)
Data (containing all letters)
Step 2: Retrieve First letter of Name
# Function to retrieve first letter
def first_letter(data, col):
a= data[col].map(lambda x: x[0])
b= np.unique(a)
c= b.tolist()
return c
Step 3: Loop over DF and Data
Compare only Name for which first letter is the same
# Apply function to retrieve unique first letter of Name's column and UV
first= first_letter(df, 'Name')
# Loop over each first letter to apply the matching by starting with the same letter for both DF
for letter in first:
df_first= df[df['Name'].str.startswith(letter)]
data_first= data[data['Name'].str.startswith(letter)]
# Process
....
With this code, I can match Names only for those who have the same first letter and not looking into the whole DF.

Related

Python - how to store multiple variables in one column of a .csv file and then read those variables into a list

I think this issue is pretty basic, but I haven't seen it answered online. I'm using python and I have 'pandas' installed to make things easier. If there's a way to do it without 'pandas' that would be awesome too! I'm coding a node connected map. I want to be able to take in some .csv file with a 'previous' and 'next' node list. I want this data to be then read by the program and stored in a list. For example:
.csv file:
Name
Previous
Next
Alpha
one two
Three
Beta
four
five
Charlie
six
seven eight
what I want in my program:
alpha, [one, two], [three]
beta, [four], [five]
charlie, [six], [seven, eight]
I have heard about two ways to write multiple variables in one .csv column.
One way was placing a space in between the two values/variables:
alpha,one two,three
and another way I've heard to solve this is use " marks and separate with a comma:
alpha,"one,two",three
Although I have heard about these answers before, I haven't been able to implement them. When reading the data in my program, it will assume that the space is part of the string or that the comma is part of the string.
file = pd.read_csv("connections.csv")
previous_alpha = []
previous_alpha.append(file.previous[0])
So, instead of having a list of two strings [one, two] my program will have a list containing one string that looks like ["one,two"] or [one two]
I can change the way the variables are structured in the .csv file or the code reading in the data. Thanks for all the help in advance!
There are multiple ways of doing this. Each for a different way you start with CSV data.
First method will have the data in CSV as a single row with lists of things:
Name,Previous,Next
Alpha,"One,Two",Three
Beta,Four,Five
Charlie,Six,"Seven,Eight"
Note the quotation around the lists. We can use apply to change the values. The convert function will just split the string using , as the delimiter.
import pandas as pd
def convert(x):
return x.split(',')
df = pd.read_csv('file.csv')
df['Previous'] = df['Previous'].apply(convert)
df['Next'] = df['Previous'].apply(convert)
Second, each row is repeated for Name with the values in CSV:
Name,Previous,Next
Alpha,One,Three
Alpha,Two,Three
Beta,Four,Five
Charlie,Six,Seven
Charlie,Six,Eight
We can you the agg function to aggregate. The convert function drops the duplicates and returns as a list.
import pandas as pd
def convert(x):
return x.drop_duplicates().to_list()
df = pd.read_csv('file.csv')
df = df.groupby('Name').agg({'Previous': convert, 'Next': convert})
The results should look like this:
Previous Next
Name
Alpha [One, Two] [Three]
Beta [Four] [Five]
Charlie [Six] [Seven, Eight]
If you have this DataFrame:
Name Previous Next
0 Alpha one two Three
1 Beta four five
2 Charlie six seven eight
Then you can split the strings in required columns and save the CSV normally:
df["Previous"] = df["Previous"].str.split()
df["Next"] = df["Next"].str.split()
print(df)
df.to_csv("data.csv", index=False)
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]
To load the data back, you can use pd.read_csv with converters= parameter:
from ast import literal_eval
df = pd.read_csv(
"data.csv", converters={"Previous": literal_eval, "Next": literal_eval}
)
print(df)
Prints:
Name Previous Next
0 Alpha [one, two] [Three]
1 Beta [four] [five]
2 Charlie [six] [seven, eight]

Splitting a large pandas datafile based on the data in one colimn

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.
ie. [name, color, number, state;
bob, green, 21, TX;
joe, red, 33, TX;
sue, blue, 22, NY;
....]
I'd like to have it put each states worth of data in to its own data sub file
df[1] = [bob, green, 21, TX] [joe, red, 33, TX]
df[2] = [sue, blue, 22, NY]
Pandas seems like the best option for this as the csv file given is about 500 lines long
You could try something like:
import pandas as pd
for state, df in pd.read_csv("file.csv").groupby("state"):
df.to_csv(f"file_{state}.csv", index=False)
Here file.csv is your base file. If it looks like
name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY
the output would be 2 files:
file_TX.csv:
name,color,number,state
bob,green,21,TX
joe,red,33,TX
file_NY.csv:
name,color,number,state
sue,blue,22,NY
There are different methods for reading csv files. You may find all methods in following link:
(https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)
Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:
import pandas as pd
df = pd.read_csv(r"file_path")
Now let's assume after these lines, you have the following dataframe:
name
color
number
state
bob
green
21
TX
joe
red
33
TX
sue
blue
22
NY
...
...
...
...
From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:
df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:
Method 1- Making List of Lists for Different States
First, let's get state list without duplicates:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
Now we need to use list comprehension.
lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]
lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.
Method 2- Making Different Dataframes for Different States
In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)
r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]
Let's make different dataframes for different states: (and reset index)
dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)
Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.
Method 3 - My Recommendation
Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:
import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]
for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)
def call(df):
if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}
data = {}
for i in range(len(states)): data.update(call(dfs[i]))
I may have done some typos, but I hope you understand the idea.
This code gives a nested dictionary such as:
first choice is state (TX,NY...)
next choice is state number index (0,1,2...)
next choice is name or color or number
Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

Find similarity between two dataframes, row by row

I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches.
The first one was to append one of the two dataframes to the other one and selecting duplicates:
df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows
The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.
Sample of data: Please note that the datasets have different length
For df1
Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football
and for df2
Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy?
Trump did not win the last presidential election
In order to do this, I am using the following function:
def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]
This should allow me to find rows which match.
The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.
My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).
Please let me know if you need any further information or if you need more code.
Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.
Any suggestion is welcome.
Expected output:
(it is an example; since I am setting a threshold the output may change)
df1
Check sim
how to join to first row []
large data work flows [small data work flows]
I have two dataframes []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here? []
add language identifier []
my dad loves watching football []
df2
Check sim
small data work flows [large data work flows]
I have tried to puzze out an answer []
mix grammatical or spelling errors [fix grammatical or spelling errors]
indent code by 2 spaces [indent code by 4 spaces]
indent code by 8 spaces [indent code by 4 spaces]
put returns between paragraphs []
add curry on the chicken curry []
mom!! mom!! mom!! []
create code fences with backticks []
are you crazy? []
Trump did not win the last presidential election []
I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2) so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores useful as well.
from fuzzywuzzy import fuzz
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)
output:
Check matches
0 how to join to first row []
1 large data work flows [small data work flows]
2 I have two dataframes []
3 fix grammatical or spelling errors [mix gramma... [mix grammatical or spelling errors]
4 indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spa...
5 why are you posting here? [are you crazy?]
6 add language identifier []
7 my dad loves watching football []

Python: Match string and replace in place

I have two dataframes (DF1 & DF2) with multiple columns of text information. I need to match and update one column in DF1
DF1:
Code Name
A A: Andrew
B B: Bill
C C: Chuck
DF2:
Number Codes
1 A
2 B;C
3 A;C
My required output is to transform DF2 as follows:
DF2:
Number Codes
1 A: Andrew
2 B: Bill;C: Chuck
3 A: Andrew;C: Chuck
So far I have tried to use:
df2['Codes'] = df2['Codes'].replace(to_replace="A", value="A: Andrew", regex=True)
But this is not practical for larger datasets.
Do I use the same df.replace function and do some looping to find every code and replace? Or is there other ways to do it better?
One option I'm trying to learn about is using sub() with regex, but i'm new to regex and learning the basics of it.
You should just try to split Column then apply dict with zip and replace
di=dict(zip(df1.Name.str.split(":").str[0],df1.Name))
df2["Codes"]=df2["Codes"].replace(di, regex=True)

Match similar column elements using pandas and fuzzwuzzy

I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.
The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)
After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.
For instance:
Column 1
1. 9Com(panynAm9e00
2. NikE4
3. Mitrosof2
Column 2
1. Microsoft
2. Company Name
3. Nike
Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.
Any suggestions?
I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:
import pandas as pd
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['9Com(panynAm9e00'],
['NikE4'],
['Mitrosof2']],
columns=['Name'])
known_list = ['Microsoft','Company Name','Nike']
def find_match(x):
match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
return match
df['match found'] = [find_match(row) for row in df['Name']]
Yields:
Name match found
0 9Com(panynAm9e00 Company Name
1 NikE4 Nike
2 Mitrosof2 Microsoft
I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:
convert everything (col 1 and 2) to lowercase
For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
Do the same (step 2) for each entry in column 1
For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names
The dictionary-distance implementation is up to you.

Categories