I have several dataframes, each has different and same column names, and the columns with same and different column names may have same values. I want to find the columns in one dataset which has matching values with other dataset' columns(may have same or different column names). Is there any efficient way to do that using python?
For example:
df1: ID count Name
0 1 A
1 2 B
2 3 C
df2: person_id count_number Name Value
0 1 A 11
2 3 C 22
3 4 D 33
df3: key Value
11 11
22 22
33 33
I tried 'isin()':this is not efficient, and 'datacompy': can't be used? because I have different column names.
My expected output: the column names that have matchings. And also better show how many matchings do they have.
For example: In this example, I want to find the matching columns of df1, df2 and df3. And the output I want is: Their pairwise matches: For df1 and df2: ID&person_id; count&count_number, Name; for df2 and df3: Value, and so on.
As you have no expected output, it's hard to answer. A first proposition:
>>> df1.merge(df2, left_on='ID', right_on='person_id').merge(df3, on='Value')
ID count Name_x person_id count_number Name_y Value key
0 0 1 A 0 1 A 11 11
1 2 3 C 2 3 C 22 22
Here I have an example dataframe:
dfx = pd.DataFrame({
'name': ['alex','bob','jack'],
'age': ["0,26,4","1,25,4","5,30,2"],
'job': ["x,abc,0","y,xyz,1","z,pqr,2"],
'gender': ["0,1","0,1","0,1"]
})
I want to first split column dfx['age'] and insert 3 separate columns for it, one for each substring in age value, naming them dfx['age1'],dfx['age2'],dfx['age3'] . I used following code for this:
dfx = dfx.assign(**{'age1':(dfx['age'].str.split(',', expand = True)[0]),\
'age2':(dfx['age'].str.split(',', expand = True)[1]),\
'age3':(dfx['age'].str.split(',', expand = True)[2])})
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job', 'gender']]
dfx
So far so good!
Now, I want to repeat the same operations to other columns job and gender.
Desired Output
name age age1 age2 age3 job job1 job2 job3 gender gender1 gender2
0 alex 0,26,4 0 26 4 x,abc,0 x abc 0 0,1 0 1
1 bob 1,25,4 1 25 4 y,xyz,1 y xyz 1 0,1 0 1
2 jack 5,30,2 5 30 2 z,pqr,2 z pqr 2 0,1 0 1
I have no problem doing it individually for small data frame like this. But, the actual datafile has many such columns. I need iterations.
I found difficulty in iteration over columns, and naming the individual columns.
I would be very glad to have better solution for it.
Thanks !
Use list comprehension for splitting columns defined in list for list of DataFrames, add filtered columns and join together by concat with sorting columns names, then prepend not matched columns by DataFrame.join:
cols = ['age','job','gender']
L = [dfx[x].str.split(',',expand=True).rename(columns=lambda y: f'{x}{y+1}') for x in cols]
df1 = dfx[dfx.columns.difference(cols)]
df = df1.join(pd.concat([dfx[cols]] + L, axis=1).sort_index(axis=1))
print (df)
name age age1 age2 age3 gender gender1 gender2 job job1 job2 job3
0 alex 0,26,4 0 26 4 0,1 0 1 x,abc,0 x abc 0
1 bob 1,25,4 1 25 4 0,1 0 1 y,xyz,1 y xyz 1
2 jack 5,30,2 5 30 2 0,1 0 1 z,pqr,2 z pqr 2
Thanks again #jezrael for your answer. Being inspired by the use of 'f-string' I have solved the problem using iteration as follows:
for col in dfx.columns[1:]:
for i in range(len(dfx[col][0].split(','))):
dfx[f'{col}{i+1}'] = (dfx[col].str.split(',', expand = True)[i])
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job','job1', 'job2','job3', 'gender'
, 'gender1', 'gender2']]
dfx
I have two dataframes df1 and df2.
df1
index emp_id name code
0 07 emp07 'A'
1 11 emp11 'B'
2 30 emp30 'C'
df2
index emp_id salary
0 06 1000
1 17 2000
2 11 3000
I want to store a map from df1['emp_id'] to df2.index.
Example: input array - ['emp11','B'] (from df1)
Expected output: [11, 2] # this is df1['emp_id'], df2.index
Code I am trying:
columns_to_idx = {emp_id: i for i, emp_id in
enumerate(list(DF1.set_index('emp_id').loc[DF2.index][['name', 'code']]))}
I think you need DataFrame.merge with inner join and DataFrame.reset_index for column from index for avoid lost it:
df = df1.merge(df2.reset_index(), on='emp_id')
print (df)
emp_id name code index salary
0 11 emp11 B 2 3000
Then is possible create MultiIndex and select by tuple:
df2 = (df1.merge(df2.reset_index(), on='emp_id')
.set_index(['name','code'])[['emp_id','index']])
print (df2)
emp_id index
name code
emp11 B 11 2
print (df2.loc[('emp11','B')].tolist())
[11, 2]
I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D
I have a Pandas df:
Name No
A 1
A 2
B 2
B 2
B 3
I want to group by column Name, sum column No and then return a 2-column dataframe like this:
Name No
A 3
B 7
I tried:
df.groupby(['Name'])['No'].sum()
but it does not return my desire dataframe. I can't add the result to a dataframe as a column.
Really appreciate any help
Add parameter as_index=False to groupby:
print (df.groupby(['Name'], as_index=False)['No'].sum())
Name No
0 A 3
1 B 7
Or call reset_index:
print (df.groupby(['Name'])['No'].sum().reset_index())
Name No
0 A 3
1 B 7