This been bugging me for a while now. How can I achieve =INDEX(A:A,MATCH(E1&F1,B:B&C:C,0))in python? This will return an error if not found.
So I started playing with the pd.merge_asof. But either way I try it only returns errors.
df_3 = pd.merge_asof(df_1, df_2, on=['x', 'y'], allow_exact_matches=False)
Would give the error:
pandas.tools.merge.MergeError: can only asof on a key for left
Edit:
import pandas as pd
df_1 = pd.DataFrame({'x': ['1', '1', '2', '2', '3', '3', '4', '5', '5', '5'],
'y': ['smth1', 'smth2', 'smth1', 'smth2', 'smth1', 'smth2', 'smth1', 'smth1', 'smth2', 'smth3']})
df_2 = pd.DataFrame({'x': ['1', '2', '2', '3', '4', '5', '5'],
'y': ['smth1','smth1','smth2','smth3','smth1','smth1','smth3'],
'z': ['other1','other1','other2','other3','other1','other1','other3',]})
So that's a sample, where I could simply do this in excel with above formula and get something like this:
x y z
1 smth1 other1
1 smth2 #NA
2 smth1 other1
2 smth2 other2
3 smth1 #NA
3 smth2 #NA
4 smth1 other1
5 smth1 other1
5 smth2 #NA
5 smth3 other3
So, is there an easy way to achieve the INDEX MATCH formula in excel in pandas?
Let's try merge with how='left':
df_1.merge(df_2, on=['x','y'], how='left')
Output:
x y z
0 1 smth1 other1
1 1 smth2 NaN
2 2 smth1 other1
3 2 smth2 other2
4 3 smth1 NaN
5 3 smth2 NaN
6 4 smth1 other1
7 5 smth1 other1
8 5 smth2 NaN
9 5 smth3 other3
Related
I want to compare two dataframes that have similar columns(not all) and print a new dataframe that shows the missing rows of df1 compare to df2 and a second dataframe that shows this time the missing values of df2 compare to df1 based on given columns.
Here the "key_columns" are named key_column1 and key_column2
import pandas as pd
data1 = {'first_column': ['4', '2', '7', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
data2 = {'first': ['1', '2', '2', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
'fourth_column':['1', '2', '2', '2', '2'],
'other2':['1', '4', '3', '2', '2'],
'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
I have modified the data1 and data2 dictionaries so that the resulting dataframes have only same columns to demonstrate that the solution provided in the answer by Emi OB relies on existence of columns in one dataframe which are not in the other one ( in case a common column is used the code fails with KeyError on the column chosen to collect NaNs). Below an improved version which does not suffer from that limitation creating own columns for the purpose of collecting NaNs:
df1['df1_NaNs'] = '' # create additional column to collect NaNs
df2['df2_NaNs'] = '' # create additional column to collect NaNs
df1_s = df1.merge(df2[['key_column1', 'key_column2', 'df2_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df2 = df2.drop(columns=["df2_NaNs"]) # clean up df2
df1_s = df1_s.loc[df1_s['df2_NaNs'].isna(), df1.columns]
df1_s = df1_s.drop(columns=["df1_NaNs"]) # clean up df1_s
print(df1_s)
print('--------------------------------------------')
df2_s = df2.merge(df1[['key_column1', 'key_column2', 'df1_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df1 = df1.drop(columns=["df1_NaNs"]) # clean up df1
df2_s = df2_s.loc[df2_s['df1_NaNs'].isna(), df2.columns]
df2_s = df2_s.drop(columns=["df2_NaNs"]) # clean up df2_s
print(df2_s)
gives:
first second_column key_column1 key_column2 fourth_column
1 2 2 3 2 2
3 2 2 6 1 2
4 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
1 2 2 3 5 3
3 2 2 6 2 5
4 2 2 4 2 6
Also the code below works in case the columns of both dataframes are the same and in addition saves memory and computation time by not creating temporary full-sized dataframes required to achieve the final result:
""" I want to compare two dataframes that have similar columns(not all)
and print a new dataframe that shows the missing rows of df1 compare to
df2 and a second dataframe that shows this time the missing values of
df2 compare to df1 based on given columns. Here the "key_columns"
"""
import pandas as pd
#data1 ={ 'first_column':['4', '2', '7', '2', '2'],
data1 = { 'first':['4', '2', '7', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
# 'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
#print(df1)
data2 = { 'first':['1', '2', '2', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
# 'fourth_column':['1', '2', '2', '2', '2'],
'fourth_column':['2', '3', '4', '5', '6'],
# 'other2':['1', '4', '3', '2', '2'],
# 'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
#print(df2)
data1_key_cols = dict.fromkeys( zip(data1['key_column1'], data1['key_column2']) )
data2_key_cols = dict.fromkeys( zip(data2['key_column1'], data2['key_column2']) )
# for Python versions < 3.7 (dictionaries are not ordered):
#data1_key_cols = list(zip(data1['key_column1'], data1['key_column2']))
#data2_key_cols = list(zip(data2['key_column1'], data2['key_column2']))
from collections import defaultdict
missing_data2_in_data1 = defaultdict(list)
missing_data1_in_data2 = defaultdict(list)
for indx, val in enumerate(data1_key_cols.keys()):
#for indx, val in enumerate(data1_key_cols): # for Python version < 3.7
if val not in data2_key_cols:
for key, val in data1.items():
missing_data1_in_data2[key].append(data1[key][indx])
for indx, val in enumerate(data2_key_cols.keys()):
#for indx, val in enumerate(data2_key_cols): # for Python version < 3.7
if val not in data1_key_cols:
for key, val in data2.items():
missing_data2_in_data1[key].append(data2[key][indx])
df1_s = pd.DataFrame(missing_data1_in_data2)
df2_s = pd.DataFrame(missing_data2_in_data1)
print(df1_s)
print('--------------------------------------------')
print(df2_s)
prints
first second_column key_column1 key_column2 fourth_column
0 2 2 3 2 2
1 2 2 6 1 2
2 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
0 2 2 3 5 3
1 2 2 6 2 5
2 2 2 4 2 6
If you outer merge on the 2 key columns, with an additional unique column in the second dataframe, that unique column will show Nan where the row is in the first dataframe but not the second. For example:
df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
gives:
first second_column key_column1 ... other2 other3 first_column
0 1 1 1 ... 1 6 4
1 2 2 3 ... 4 8 NaN
2 2 2 2 ... 3 1 7
3 2 2 6 ... 2 4 NaN
4 2 2 4 ... 2 2 NaN
5 NaN NaN 3 ... NaN NaN 2
6 NaN NaN 6 ... NaN NaN 2
7 NaN NaN 4 ... NaN NaN 2
Here the Nans in 'first_column' correspond to the rows in df2 that are not in df1. You can then use this fact with .loc[] to filter on those Nan rows, and only the columns in df2 like so:
df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
Output:
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
Full code for both tables is:
df2_outer = df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df1 compare df2')
df2_output = df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
print(df2_output)
df1_outer = df1.merge(df2[['key_column1', 'key_column2', 'first']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df2 compare df1')
df1_output = df1_outer.loc[df1_outer['first'].isna(), df1.columns]
print(df1_output)
Which outputs:
missing values of df1 compare df2
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
missing values of df2 compare df1
first_column second_column key_column1 key_column2 fourth_column other
1 2 2 3 2 2 2
3 2 2 6 1 2 2
4 2 2 4 1 2 2
>>> df = pd.DataFrame({'id': ['1', '1', '2', '2', '3', '4', '4', '5', '5'],
... 'value': ['keep', 'y', 'x', 'keep', 'x', 'Keep', 'x', 'y', 'x']})
>>> print(df)
id value
0 1 keep
1 1 y
2 2 x
3 2 keep
4 3 x
5 4 Keep
6 4 x
7 5 y
8 5 x
In this example, the idea would be to keep index values 0, 3, 4, 5 since they are asscoiated with a duplicate id with a particular value == 'Keep' and 7 (since it is the first of the duplicates for id 5).
In your case try with idxmax
out = df.loc[df['value'].eq('keep').groupby(df.id).idxmax()]
Out[24]:
id value
0 1 keep
3 2 keep
4 3 x
5 4 Keep
7 5 y
I have a data set where there are name and id columns. In theory the name should always correspond to the same id, but due to some system errors and data quality issues in practice this is not always the case.
Generally the scenario is that the wrong id's occur at an extremely negligible rate compare to the right id's. So for example there will be a 1000 rows where the name 'a' and id '1' match but there will be 2 rows where the name is 'a' and id '7'.
So the logic to resolve what the proper id would simply be to find the most frequently occurring id for each name.
d = {'id': ['1', '1', '2', '2',], 'name': ['a', 'a', 'a', 'b'], 'value': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
store name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
The first question is what is the best way to find the proper id for each name and drop the rows where the proper id does not occur, the result being the following:
store name value
0 1 a 1
1 1 a 2
2 2 b 4
The second part is, in the scenarios where the mismatched id is actually the id of another name, then fix the name to match the proper id, example output:
store name value
0 1 a 1
1 1 a 2
2 2 b 3
3 2 b 4
The actual data has thousands of names/ids, the example is just a simplification.
Here is my solution. It's a bit a makeshift job but it should work as a temporary solution
d = {'id': ['1', '1', '2', '2', '2', '3','3', '4', '4'],
'name': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c'],
'value': ['1', '2', '3', '4', '5', '6', '7', '8', '9']}
df = pd.DataFrame(data=d)
Following the raw DataFrame, without id changes:
id name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
4 2 b 5
5 3 b 6
6 3 c 7
7 4 c 8
8 4 c 9
Workflow:
# convert id, value from string to flat
df['id'] = [float(id) for id in df['id']]
df['value'] = [float(value) for value in df['value']]
# extract most repeated id for one name
def most_common(lst):
return max(set(lst), key=lst.count)
count = dict()
for name in pd.unique(df['name']):
temp = {name: most_common(list(df[df['name'] == name]['id']))}
count.update(temp)
# correct wrong id
replace = [[count[name], name] if id != count[name] else [id, name] for id, name in zip(df['id'],df['name'])]
df['id'] = [item[0] for item in replace]
df['name'] = [item[1] for item in replace]
output:
In [3]: count
Out[3]: {'a': 1.0, 'b': 2.0, 'c': 4.0}
In [1]: df
Out[1]:
id name value
0 1.0 a 1.0
1 1.0 a 2.0
2 1.0 a 3.0
3 2.0 b 4.0
4 2.0 b 5.0
5 2.0 b 6.0
6 4.0 c 7.0
7 4.0 c 8.0
8 4.0 c 9.0
This solution might not work if you have the exact same count of two differents 'id' for the same 'name'
I want to replace columns in dataframe after the first column based on the first column. Suppose we have:
df = {'Z': ['1', '0', '1', '1', '0'],
'A': ['1', '1', '1', '0', '0'],
'B': ['0', '0', '1', '0', '0'],
'C': ['1', '0', '0', '0', '`1']}
df = pd.DataFrame (df, columns = ['Z','A','B','C'])
df
I want to replace the columns with 1 IF column = Z ELSE 0 .
The desired outcome is the following:
df2 = {'Z': ['1', '0', '1', '1', '0'],
'A': ['1', '0', '1', '0', '1'],
'B': ['0', '1', '1', '0', '1'],
'C': ['1', '1', '0', '0', '`0']}
df2 = pd.DataFrame (df2, columns = ['Z','A','B','C'])
df2
The problem is that I have 60 columns (A,B,C,D,.....) and I want to be able to do them at the same time.
Use numpy broadcasting:
# Z column
z = df.iloc[:, 0].values
# rest of columns
rest = df.iloc[:, 1:].values
# do comparison and set values
df.iloc[:, 1:] = (z[:, None] == rest).astype(int)
print(df)
Output
Z A B C
0 1 1 0 1
1 0 0 1 1
2 1 1 1 0
3 1 0 0 0
4 0 1 1 0
If you need a new DataFrame, do the following:
z = df.iloc[:, 0].values
rest = df.iloc[:, 1:].values
df2 = pd.DataFrame(data=(z[:, None] == rest).astype(int), columns=df.columns[1:], index=df['Z']).reset_index()
print(df2)
Output
Z A B C
0 1 1 0 1
1 0 0 1 1
2 1 1 1 0
3 1 0 0 0
4 0 1 1 0
You can use DataFrame.eq along axis=0 to compare the column Z with rest of the columns then join the resulting dataframe with the column Z and mask the NaN values:
df[['Z']].join(df.drop('Z', 1).eq(df['Z'], axis=0).astype(int)).mask(df.isna())
Z A B C
0 1 1 0 1
1 0 0 1 1
2 1 1 1 0
3 1 0 0 0
4 0 1 1 0
I think there's an easy way to do that by checking for equality and converting to integer.
z = df["Z"]
others = [c for c in df.columns if c != "Z"] # all columns but 'Z'
df[others] = df[others].transform(lambda x : x.eq(z).astype(int))
outputs :
Z A B C
0 1 1 0 1
1 0 0 1 1
2 1 1 1 0
3 1 0 0 0
4 0 1 1 0
Note that there is a way to keep the NA's, you must use pandas datatypes though, see nullable data types and text data types.
I have data frames for dumps every 10 mins in the day. Example
2019-08-28 06:00:13 SCHOOL_20190828...
2019-08-28 06:10:15 SCHOOL_20190828...
2019-08-28 06:20:14 SCHOOL_20190828...
2019-08-28 06:30:13 SCHOOL_20190828...
2019-08-28 06:40:15 SCHOOL_20190828...
... ...
2019-09-28 05:10:13 SCHOOL_20190928...
2019-09-28 05:20:13 SCHOOL_20190928...
2019-09-28 05:30:13 SCHOOL_20190928...
2019-09-28 05:40:14 SCHOOL_20190928...
2019-09-28 05:50:13 SCHOOL_20190928...
Each successive dataframe has about 2 rows difference(if they happen to be from the same day)
I want to read the first data frame in a day(A), compare it to the next data frame, (B) and append the new rows to my data frame, A. I want to continue until I read in all the data frames for that day. Move on to the next day and do the same. I will then append all the outputs from the various days.
Examples of data frames
import pandas as pd
import dask.dataframe as dd
df_A = pd.DataFrame([{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30},{'a':2,'b':4,'c':6}])
df_B = pd.DataFrame([{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30},{'a':2,'b':4,'c':6},{'a':0,'b':12,'c':16}])
df_C = pd.DataFrame([{'a': 1, 'b': 2, 'c':3},{'a':21,'b':12,'c':9}])
df_A
Out[3]:
a b c
0 1 2 3
1 10 20 30
2 2 4 6
df_B
Out[8]:
a b c
0 1 2 3
1 10 20 30
2 2 4 6
3 0 12 16
df_C
Out[9]:
a b c
0 1 2 3
1 21 12 9
I want my final data frame to be
df
Out[10]:
a b c
0 1 2 3
1 10 20 30
2 2 4 6
3 0 12 16
4 21 12 9
I want the most time-efficient way to do this since the data frames are quite a lot(About 5000)
Currently, I just read all the dumps using dask and drop duplicates.
ddf = dd.read_csv(path, storage_options=storage_opts, assume_missing=True).drop_duplicates().compute()
you can use pd.concat and drop_duplicates to do that like below
df1 = pd.DataFrame([['0', '1', '2', '3'], ['1', '10', '20', '30'], ['2', '2', '4', '6']], columns=('id', 'a', 'b', 'c'))
df2 = pd.DataFrame([['0', '1', '2', '3'], ['1', '10', '20', '30'], ['2', '2', '4', '6'], ['3', '0', '12', '16']], columns=('id', 'a', 'b', 'c'))
df3 = pd.DataFrame([['0', '1', '2', '3'], ['1', '21', '12', '9']], columns=('id', 'a', 'b', 'c'))
df = pd.concat([df1,df2,df3]).drop_duplicates().reset_index(drop=True)
print(df)
Result
id a b c
0 0 1 2 3
1 1 10 20 30
2 2 2 4 6
3 3 0 12 16
1 1 21 12 9