I have 2 data frames, one is around 4.5 million rows while another is 1200 rows. I want to find the values of the smaller data frame in the column of the bigger data frame and eventually drop those records based on true/false.
df1 = { ‘id’:[‘1234’,’4566’,’6789’], ‘Name’:[‘Sara’, ‘Iris’,’Jeff’], ‘Age’:[10,12,47]}
df2 = { ‘id’:[‘1234’,’4566’,’1080’]}
The function I wrote:
def find_match(row):
if (row.column in df1.column.values) == (row.column in df2.column.values):
return “True”
else:
return “False”
df1[” flag”] = df1.apply(find_match, axis=1)
once I run the .apply(), it runs for a long time since the data frame is huge.
You can try concatenating the two df's using pandas.concat, then dropping the duplicate rows.
import pandas as pd
df1 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[0,1,1,1], "colC":["A","A","B","B"]})
df2 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[1,1,1,1], "colC":["A","B","B","B"]})
df = pd.concat([df1, df2])
print("df: \n", df)
df_dropped = df.drop_duplicates()
print("df_dropped: \n", df_dropped)
This code returns the values from df1 that matches df2.
df1 = pd.DataFrame({"id":["1234","4566","6789"], "Name":["Sara", "Iris","Jeff"], "Age":[10,12,47]})
id Name Age
0 1234 Sara 10
1 4566 Iris 12
2 6789 Jeff 47
df2 = pd.DataFrame({ "id":["1234","4566","1080"]})
id
0 1234
1 4566
2 1080
new_df = df2.merge(df1, on = "id", how = "outer")
This will return the ones that do match, also the ones that do not match with nan values for the name and age colmuns. than you can drop the ones that match and keep only the Nan ones
df_not_match = new_df[new_df["Name"].isna()] # will return the row id : 1080
Related
I have a dataframe which has the words Due Date written differently but it all means the same. The problem is in my master data(xls file), one due date has an extra space or doesnt and i cant change that.All i can change is my final output.
Sr no Due Date Due Date DueDate
1 1/2/22
2 1/5/22
3
4
5 ASAP
I just want that column 2 and 3 all combine under column one at the same location they were
Sr No. Due Date
1 1/2/22
2 1/5/22
3
4
5 ASAP
You can use filter with a regex to get similar names, then bfill and get the first. Finally join to original devoid of the found columns:
d = df.filter(regex=r'(?i)due\s*date')
df2 = (df
.drop(columns=list(d.columns))
.join(d.bfill(1).iloc[:,0])
)
Output:
Sr no Due Date
0 1 1/2/22
1 2 1/5/22
2 3 None
3 4 None
4 5 ASAP
Try with bfill
out = df.bfill(axis = 1)[['Sr No','Due Date']]
Possible solution is the following:
import pandas as pd
# set test data
data = {"Sr no": [1,2,3,4,5],
"Due Date": ["1/2/22", "", "", "", ""],
"Due Date ": ["", "1/2/22", "", "", ""],
" Due Date": ["", "", "", "", "ASAP"]
}
# create pandas dataframe
df = pd.DataFrame(data)
# clean up column names
df.columns = [col.strip() for col in df.columns]
# group data
df = df.groupby(df.columns, axis=1).agg(lambda x: x.apply(lambda y: ''.join([str(l) for l in y if str(l) != "nan"]), axis=1))
# reorder column
df = df[['Sr no', 'Due Date']]
df
Returns
I have two dataframes and wanted to check if they contain the same data or not.
df1:
df1 = [['tom', 10],['nick',15], ['juli',14]]
df1 = pd.DataFrame(df1, columns = ['Name', 'Age'])
df2:
df2 = [['nick', 15],['tom', 10], ['juli',14]]
df2 = pd.DataFrame(df2, columns = ['Name', 'Age'])
Note that the information between them are exactly the same. The only difference is the row order.
I've created a code to match both dataframes, but it's showing that the dataframes are different on the first two rows:
ne = (df != df2).any(1)
ne_stacked = (df != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df2)
changed_from = df.values[difference_locations]
changed_to = df2.values[difference_locations]
divergences = pd.DataFrame({'df1': changed_from, "df2": changed_to}, index=changed.index)
print(divergences)
I am receiving the below result:
GRID SPX RECAP
id col
0 Name tom nick
Age 10 15
1 Name nick tom
Age 15 10
I was expecting to receive:
Empty DataFrame
Columns: [df1, df2]
Index: []
How I change the code so they can test each row on dataframes to check if they are matched?
And if I was comparing two data frames with different number of rows?
When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)
One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31
I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)
I have a dataframe (df1) of 5 columns (a,b,c,d,e) with 6 rows and another dataframe (df2) with 2 columns (a,z) with 20000 rows.
How do I map and merge those dataframes using ('a') value.
So that df1 having 5 columns should map values in df2 having 2 columns with 'a' value and return a new df which has 6 columns (5 from df1 and 1 mapped row in df2) with 6 rows.
By using pd.concat:
import pandas as pd
import numpy as np
columns_df1 = ['a','b','c','d']
columns_df2 = ['a','z']
data_df1 = [['abc','def','ghi','xyz'],['abc2','def2','ghi2','xyz2'],['abc3','def3','ghi3','xyz3'],['abc4','def4','ghi4','xyz4']]
data_df2 = [['a','z'],['a2','z2']]
df_1 = pd.DataFrame(data_df1, columns=columns_df1)
df_2 = pd.DataFrame(data_df2, columns=columns_df2)
print(df_1)
print(df_2)
frames = [df_1, df_2]
print (pd.concat(frames))
OUTPUT:
Edit:
To replace NaN values you could use pandas.DataFrame.fillna:
print (pd.concat(frames).fillna("NULL"))
Replcae NULL with anything you want e.g. 0
OUTPUT: