Total meltdown here, need some assistance.
I have a DataFrame with +10m rows and some 150 columns with two ids, looking like below:
df = pd.DataFrame({'id1' : [1,2,5,3,6,4]
,'id2' : [2,1,np.nan,4,np.nan,3]
,'num' : [123, 3231, 123, 231, 6534,2394]})
id1 id2 num
0 1 2.0 123
1 2 1.0 3231
2 5 NaN 123
3 3 4.0 231
4 6 NaN 6534
5 4 3.0 2394
Where row index 0 and 1 are a pair given id1 and id2, and row index 3 and 5 are a pair in the same way. I want the table below, where the second row pair is merged with first row pair
df = pd.DataFrame({'id1' : [1,5,3,6]
,'id2' : [2,np.nan,3,np.nan]
,'num' : [123, 123, 231, 6534]
,'2num' : [3231, np.nan, 2394, np.nan,]})
id1 id2 num 2_num
0 1 2.0 123 3231.0
1 5 NaN 123 NaN
2 3 3.0 231 2394.0
3 6 NaN 6534 NaN
How can this be archived using id1 and id2 and labeling all following columns from "id row 2" with "2_"?
Heres one a merge based approach ,(thank you #pirSquared for improvement). i.e
ndf = df.merge(df, 'left', left_on=['id1', 'id2'], right_on=['id2', 'id1'], suffixes=['', '_2']).drop(['id1_2', 'id2_2'], 1)
cols = ['id1','id2']
ndf[cols] = np.sort(ndf[cols],1)
new = ndf.drop_duplicates(subset=['id1','id2'],keep='first')
id1 id2 num num_2
0 1.0 2.0 123 3231.0
2 5.0 NaN 123 NaN
3 3.0 4.0 231 2394.0
4 6.0 NaN 6534 NaN
The idea is to sort each pair of ids so that we group by them.
cols = ['id1', 'id2']
df[cols] = np.sort(df[cols], 1)
df.set_index(
cols + [df.fillna(-1).groupby(cols).cumcount() + 1]
).num.unstack().add_suffix('_num').reset_index()
id1 id2 1_num 2_num
0 1.0 2.0 123.0 3231.0
1 3.0 4.0 231.0 2394.0
2 5.0 NaN 123.0 NaN
3 6.0 NaN 6534.0 NaN
Use:
df[['id1','id2']] = pd.DataFrame(np.sort(df[['id1','id2']].values, axis=1)).fillna('tmp')
print (df)
id1 id2 num
0 1.0 2 123
1 1.0 2 3231
2 5.0 tmp 123
3 3.0 4 231
4 6.0 tmp 6534
5 3.0 4 2394
df1 = df.groupby(['id1','id2'])['num'].apply(list)
print (df1)
id1 id2
1.0 2.0 [123, 3231]
3.0 4.0 [231, 2394]
5.0 tmp [123]
6.0 tmp [6534]
Name: num, dtype: object
df2 = pd.DataFrame(df1.values.tolist(),
index=df1.index,
columns=['num','2_num'])
.reset_index().replace('tmp', np.nan)
print (df2)
id1 id2 num 2_num
0 1.0 2.0 123 3231.0
1 3.0 4.0 231 2394.0
2 5.0 NaN 123 NaN
3 6.0 NaN 6534 NaN
Related
I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!
You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']
IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN
I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)
I've got 3 Dataframes I would like to merge or join by "label" and then being able to compare all columns
Examples of df are below:
df1
Label,col1,col2,col3
NF1,1,1,6
NF2,3,2,8
NF3,4,5,4
NF4,5,7,2
NF5,6,2,2
df2
Label,col1,col1,col3
NF1,8,4,5
NF2,4,7,8
NF3,9,7,8
df3
Label,col1,col1,col3
NF1,2,8,8
NF2,6,2,0
NF3,2,2,5
NF4,2,4,9
NF5,2,5,8
and what ill like to see is similar to
Label,df1_col1,df2_col1,df_col1,df1_col2,df2_col2,df3_col2,df1_col3,df2_col3,df_col3
NF1,1,8,2,1,4,8,6,5,8
NF2,3,4,6,2,7,2,8,8,0
NF3,4,9,2,5,7,2,4,8,5
NF4,5,,2,7,,4,2,,9
NF5,6,,2,2,,5,2,,8
but I'm to suggestions on how to make the comparisons more readable.
Thanks!
Use concat with list of DataFrames, add parameter keys for prefixes and sorting by columns names:
dfs = [df1, df2, df3]
k = ('df1','df2','df3')
df = (pd.concat([x.set_index('Label') for x in dfs], axis=1, keys=k)
.sort_index(axis=1, level=1)
.rename_axis('Label')
.reset_index())
df.columns = df.columns.map('_'.join).str.strip('_')
print (df)
Label df1_col1 df2_col1 df3_col1 df2_col1.1 df3_col1.1 df1_col2 \
0 NF1 1 8.0 2 4.0 8 1
1 NF2 3 4.0 6 7.0 2 2
2 NF3 4 9.0 2 7.0 2 5
3 NF4 5 NaN 2 NaN 4 7
4 NF5 6 NaN 2 NaN 5 2
df1_col3 df2_col3 df3_col3
0 6 5.0 8
1 8 8.0 0
2 4 8.0 5
3 2 NaN 9
4 2 NaN 8
You can use df.merge:
In [1965]: res = df1.merge(df2, on='Label', how='left', suffixes=('_df1', '_df2')).merge(df3, on='Label', how='left').rename(columns={'col1': 'col1_df3','col2':'col2_df3','col3':'col3_df3'})
In [1975]: res = res.reindex(sorted(res.columns), axis=1)
In [1976]: res
Out[1965]:
Label col1_df1 col1_df2 col1_df3 col2_df1 col2_df2 col2_df3 col3_df1 col3_df2 col3_df3
0 NF1 1 8.00 2 1 4.00 8 6 5.00 8
1 NF2 3 4.00 6 2 7.00 2 8 8.00 0
2 NF3 4 9.00 2 5 7.00 2 4 8.00 5
3 NF4 5 nan 2 7 nan 4 2 nan 9
4 NF5 6 nan 2 2 nan 5 2 nan 8
We can use Pandas' join method, by setting the Label column as the index and joining the dataframes :
dfs = [df1,df2,df3]
keys = ['df1','df2','df3']
#set Label as index
df1, *others = [frame.set_index("Label").add_prefix(f"{prefix}_")
for frame,prefix in zip(dfs,keys)]
#join df1 with others
outcome = df1.join(others,how='outer').rename_axis(index='Label').reset_index()
outcome
Label df1_col1 df1_col2 df1_col3 df2_col1 df2_col2 df2_col3 df3_col1 df3_col2 df3_col3
0 NF1 1 1 6 8.0 4.0 5.0 2 8 8
1 NF2 3 2 8 4.0 7.0 8.0 6 2 0
2 NF3 4 5 4 9.0 7.0 8.0 2 2 5
3 NF4 5 7 2 NaN NaN NaN 2 4 9
4 NF5 6 2 2 NaN NaN NaN 2 5 8
This is a follow-up question to Append any further columns to the first three columns.
I start out with about 120 columns. It is always three columns that belong to each other. Instead of being 120 columns side by side, they should be stacked on top of each other, so we end up with three columns. This has already been solved (see link above).
Sample data:
df = pd.DataFrame({
"1": np.random.randint(900000000, 999999999, size=5),
"2": np.random.choice( ["A","B","C", np.nan], 5),
"3": np.random.choice( [np.nan, 1], 5),
"4": np.random.randint(900000000, 999999999, size=5),
"5": np.random.choice( ["A","B","C", np.nan], 5),
"6": np.random.choice( [np.nan, 1], 5)
})
Working solution for initial question as suggested by Jezrael:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(drop=True)
df.columns = ['A','B','C']
This transforms this:
1 2 3 4 5 6
0 960189042 B NaN 991581392 A 1.0
1 977655199 nan 1.0 964195250 A 1.0
2 961771966 A NaN 969007327 B 1.0
3 955308022 C 1.0 973316485 A NaN
4 933277976 A 1.0 976749175 A NaN
to this:
A B C
0 960189042 B NaN
1 977655199 nan 1.0
2 961771966 A NaN
3 955308022 C 1.0
4 933277976 A 1.0
5 991581392 A 1.0
6 964195250 A 1.0
7 969007327 B 1.0
8 973316485 A NaN
9 976749175 A NaN
Follow Up Question:
Now, if I'd need an indicator from which triple each block comes from, how could this be done? So a result could look like:
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1
These blocks can be of different lengths! So I cannot simply add a counter.
Use reset_index for remove only first level, second level of MultiIndex convert to column:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(level=0, drop=True).reset_index()
df.columns = ['D','A','B','C']
print (df)
D A B C
0 0 960189042 B NaN
1 0 977655199 nan 1.0
2 0 961771966 A NaN
3 0 955308022 C 1.0
4 0 933277976 A 1.0
5 1 991581392 A 1.0
6 1 964195250 A 1.0
7 1 969007327 B 1.0
8 1 973316485 A NaN
9 1 976749175 A NaN
Then if need change order of columns:
cols = df.columns[1:].tolist() + df.columns[:1].tolist()
df = df[cols]
print (df)
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1
Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0