python pandas how to merge/join two tables based on substring? - python

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.

why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)

Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.

You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

Related

Splitting a large pandas datafile based on the data in one colimn

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.
ie. [name, color, number, state;
bob, green, 21, TX;
joe, red, 33, TX;
sue, blue, 22, NY;
....]
I'd like to have it put each states worth of data in to its own data sub file
df[1] = [bob, green, 21, TX] [joe, red, 33, TX]
df[2] = [sue, blue, 22, NY]
Pandas seems like the best option for this as the csv file given is about 500 lines long
You could try something like:
import pandas as pd
for state, df in pd.read_csv("file.csv").groupby("state"):
df.to_csv(f"file_{state}.csv", index=False)
Here file.csv is your base file. If it looks like
name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY
the output would be 2 files:
file_TX.csv:
name,color,number,state
bob,green,21,TX
joe,red,33,TX
file_NY.csv:
name,color,number,state
sue,blue,22,NY
There are different methods for reading csv files. You may find all methods in following link:
(https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)
Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:
import pandas as pd
df = pd.read_csv(r"file_path")
Now let's assume after these lines, you have the following dataframe:
name
color
number
state
bob
green
21
TX
joe
red
33
TX
sue
blue
22
NY
...
...
...
...
From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:
df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:
Method 1- Making List of Lists for Different States
First, let's get state list without duplicates:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
Now we need to use list comprehension.
lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]
lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.
Method 2- Making Different Dataframes for Different States
In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)
r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]
Let's make different dataframes for different states: (and reset index)
dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)
Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.
Method 3 - My Recommendation
Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:
import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]
for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)
def call(df):
if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}
data = {}
for i in range(len(states)): data.update(call(dfs[i]))
I may have done some typos, but I hope you understand the idea.
This code gives a nested dictionary such as:
first choice is state (TX,NY...)
next choice is state number index (0,1,2...)
next choice is name or color or number
Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

Python pandas filtering dataframe column by list of conditions

What is the best way to filter multiple columns in a dataframe?
For example I have this sample from my data:
Index Tag_Number
666052 A1-1100-XY-001
666382 B1-001-XX-X-XX
666385 **FROM** C1-0001-XXX-100
666620 D1-001-XX-X-HP some text
"Tag_Number" column contains Tags but I need to get rid of texts before or after the tag. The common delimeter is "space". My idea was to divide up the column into multiple and filter each of these columns that start with either of these "A1-, B1-, C1-, D1-", i.e. if cell does not start with condition it is False, else True and the apply to the table, so that True Values remain as before, but if False, we get empty values. Finally, once the tags are cleaned up, combine them into one single column. I know this might be complicated and I'm really open to any suggestions.
What I have already tried:
Splitted = df.Tag_Number.str.split(" ",expand=True)
Splitted.columns = Splitted.columns.astype(str)
Splitted = Splitted.rename(columns=lambda s: "Tag"+s)
col_names = list(Splitted.columns)
Splitted
I got this Tag_number column splitted into 30 cols, but now I'm struggling to filter out each column.
I have created a conditions to filter each column by:
asset = ('A1-','B1-','C1-','D1-')
yet this did not help, I only got an array for the last column instead off all which is expected I guess.
for col in col_names:
Splitted_filter = Splitted[col].str.startswith(asset, na = False)
Splitted_filter
Is there a way to filter each column by this 'asset' filter?
Many Thanks
If you want to clean out the text that does not match the asset prefixes, then I think this would work.
sample = pd.read_csv(StringIO("""Index,Tag_Number
666052,A1-1100-XY-001
666382,B1-001-XX-X-XX
666385,**FROM** C1-0001-XXX-100
666620,D1-001-XX-X-HP some text"""))
asset = ('A1-','B1-','C1-','D1-')
def asset_filter(tag_n):
tags = tag_n.split() # common delimeter is "space"
tags = [t for t in tags if len([a for a in asset if t.startswith(a)]) >= 1]
return tags # can " ".join(tags) if str type is desired
sample['Filtered_Tag_Number'] = sample.Tag_Number.astype(str).apply(asset_filter)
See that it is possible to define a custom function asset_filter and the apply it to the column you wish to transform.
Result is this:
Index Tag_Number Filtered_Tag_Number
0 666052 A1-1100-XY-001 [A1-1100-XY-001]
1 666382 B1-001-XX-X-XX [B1-001-XX-X-XX]
2 666385 **FROM** C1-0001-XXX-100 [C1-0001-XXX-100]
3 666620 D1-001-XX-X-HP some text [D1-001-XX-X-HP]

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Calculating column values for a Dataframe by looking up on another dataframe- Very slow with for loop

I have two dataframes, let's call them Train and LogItem. There is a column called user_id in both of them.
For each row in Train, I pick the user_id and a date field and then pass it to a function which returns some values by calculating it from the LogItem dataframe which I use to populate column in Train(LogEntries_7days,Sessioncounts_7days) against the location of that particular row.
def ServerLogData(user_id,threshold,threshold7,dataframe):
dataframe = LogItem[LogItem['user_id']==user_id]
UserData = dataframe.loc[(dataframe['user_id']==user_id) &
(dataframe['server_time']<threshold) &
(dataframe['server_time']>threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.session_id.nunique()
return entries,Unique_Session_Count
for id in Train.index:
print (id)
user_id = (Train.loc[[id],['user_id']].values[0])[0]
threshold = (Train.loc[[id],['impression_time']].values[0])[0]
threshold7 = (Train.loc[[id],['AdThreshold_date']].values[0])[0]
dataframe=[]
Train.loc[[id],'LogEntries_7days'],Train.loc[[id],'Sessioncounts_7days'] =
ServerLogData(user_id,threshold,threshold7,dataframe)
This approach is incredibly slow and just like in databases, can we use apply method here or something else which could be fast enough.
Please suggest me a better approach
Edit: Based on suggestions from super-helpful colleagues here, I am putting some data images for both dataframes and some explanation.
In dataframe Train, there will be user actions with some date values and there will be multiple rows for a user_id.
For each row, I pass user_id and dates to another dataframe and calculate some values. Please note that the second dataframe too has multiple rows for user_id for different dates. So grouping them does not seem be an option here.
I pass user_id and dates, flow goes to second dataframe and find rows based on user_id which fits the dates too that I passed.
If you have a really large dataframe, printing each row is going to eat up a lot of time, and it's not like you'll be able to read throw thousands of lines of output anyway.
If you have a lot of rows for each id, then you can speed it up quite a bit by processing each id only once. There's a question that discsusses filtering a dataframe to unique indices. The top rated answer, adjusted for this case, would be unique_id_df = Train.loc[~Train.index.duplicated(keep='first')]. That creates a dataframe with only one row for each id. It takes the first row for each id, which seems to be what you're doing as well.
You can then create a dataframe from applying your function to unique_id_df. There are several ways to do this. One is to create a series entries_counts_series = unique_id_df.apply(ServerLogData,axis=1) and then turn it into a dataframe with entries_counts_df = pd.DataFrame(entries_counts_series.tolist(), index = entries_counts_series.index). You could also put the data into unique_id_df with unique_id_df['LogEntries_7days'],unique_id_df['Sessioncounts_7days'] = zip(*unique_id_df.apply(ServerLogData,axis=1), but then you would have a bunch of extra columns to get rid of.
Once you have your data, you can merge it with your original dataframe: Train_with_data = Train.merge(entries_counts_df, left_index = True, right_index = True). If you put the data into unique_id_df, you could do something such as Train_with_data = Train.merge(unique_id_df[['LogEntries_7days','Sessioncounts_7days']], left_index = True, right_index = True).
Try out different variants of this and the other answers, and see how long each of them take on a subset of your data.
Also, some notes on ServerLogData:
dataframe is passed as a parameter, but then immediately overwritten.
You subset LogItem to where LogItem['user_id']==user_id, but then you check that condition again. Unless I'm missing something, you can get rid of the dataframe = LogItem[LogItem['user_id']==user_id] line.
You've split the line that sets UserData up, which is good, but standard style is to indent the lines in this sort of situation.
You're only using session_id, so you only need to take that part of the dataframe.
So:
def ServerLogData(user_id,threshold,threshold7):
UserData = LogItem.session_id.loc[(LogItem['user_id']==user_id) &
(LogItem['server_time']<threshold) &
(LogItem['server_time']>threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.nunique()
return entries, Unique_Session_Count
I did some quite-possibly-not-representative tests, and subsetting column, rather than subsetting the entire dataframe and then taking the column out from that dataframe, sped things up significantly.
Try to do a groupby user_id then you pass each user's history as a dataframe, I think it will get you faster results than passing your Train line by line. I have used this method on log files data and it wasn't slow I don't know if it is the optimal solution but I found the results satisfying and quite easy to implement. Something like this:
group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
df_user = group_user.get_group(x)
df_train = group_train.get_group(x)
# do your thing here
processing_function(df_user, df_train)
Write a function doing the calculation you want (I named it processing_function). I hope it helps.
EDIT: here is how your code becomes
def ServerLogData(threshold, threshold7, df_user):
UserData = df_user[(df_user['server_time'] < threshold) & (df_user['server_time'] > threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.session_id.nunique()
return entries, Unique_Session_Count
group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
df_user = group_user.get_group(x)
df_train = group_train.get_group(x)
for id in df_train.index:
user_id = (df_train.loc[[id], ['user_id']].values[0])[0]
threshold = (df_train.loc[[id], ['impression_time']].values[0])[0]
threshold7 = (df_train.loc[[id], ['AdThreshold_date']].values[0])[0]
df_train.loc[[id], 'LogEntries_7days'], df_train.loc[[id], 'Sessioncounts_7days'] = ServerLogData(threshold, threshold7, df_user)

How do you compare the values in two different dataframe groupby results?

I have two different dataframes populated with different name sets. For example:
t1 = pd.DataFrame(['Abe, John Doe', 'Smith, Brian', 'Lin, Sam', 'Lin, Greg'], columns=['t1'])
t2 = pd.DataFrame(['Abe, John', 'Smith, Brian', 'Lin, Sam', 'Lu, John'], columns=['t2'])
I need to find the intersection between the two data sets. My solution was to split by comma, and then groupby last name. Then I'll be able to compare last names, and then see if the first names of t2 are contained within t1. ['Lu, John'] is the only one that should be returned in above example.
What I need help on is how to compare values within two different dataframes that are grouped by a common column. Is there a way to intersect the results of a groupby for two different dataframes and then compare the values within each key value pair? I need to extract the names in t2 that are not in t1.
I have an idea that it should look something like this:
for last in t1:
print(t2.get_group(last)) #BUT run a compare between the two different lists here
Only problem is if the last name doesn't exist in the second groupby, it throws an error, so I can't even proceed to the next step mentioned by the comment, of comparing the values in the groups (first names).
This isn't pandas specific but python has a built in set class with an intersect operation, here's the documentation: https://docs.python.org/3/library/stdtypes.html?highlight=set#set
It works like so
set1 = set(my_list_of_elements)
set2 = set(my_other_list_of_elements)
intersecting_elements = set1 & set2
It's hard to tell if this is what you are looking for though, please update with a minimally, complete, and verifiable example as the comments say for a more accurate answer.
Update - based on comment
for last in t1:
try:
t2_last_group = t2.get_group(last)
# perform compare here
except:
pass
I ended up figuring this out. The pandas dataframe seems to look for contains(...).any(), so I included those. The problem of not being able to find a value in the second groupby dataframe, I surrounded the code with a try/exception. Solution is outlined below.
t1final = []
for index, row in t1.iterrows():
t1lastname = row['last']
t1firstname = row['first']
try:
x = t2groupby.get_group(t1lastname)
if(~x['first'].str.contains(t1firstname,case=False).any()):
t1final.append(t1lastname + ', ' + t1firstname)
except:
t1final.append(t1lastname + ', ' + t1firstname)

Categories