combine and group rows from 2 dfs - python

I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!

You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']

IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Append any further columns to the first three columns AND indicate the triple column it comes from

This is a follow-up question to Append any further columns to the first three columns.
I start out with about 120 columns. It is always three columns that belong to each other. Instead of being 120 columns side by side, they should be stacked on top of each other, so we end up with three columns. This has already been solved (see link above).
Sample data:
df = pd.DataFrame({
"1": np.random.randint(900000000, 999999999, size=5),
"2": np.random.choice( ["A","B","C", np.nan], 5),
"3": np.random.choice( [np.nan, 1], 5),
"4": np.random.randint(900000000, 999999999, size=5),
"5": np.random.choice( ["A","B","C", np.nan], 5),
"6": np.random.choice( [np.nan, 1], 5)
})
Working solution for initial question as suggested by Jezrael:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(drop=True)
df.columns = ['A','B','C']
This transforms this:
1 2 3 4 5 6
0 960189042 B NaN 991581392 A 1.0
1 977655199 nan 1.0 964195250 A 1.0
2 961771966 A NaN 969007327 B 1.0
3 955308022 C 1.0 973316485 A NaN
4 933277976 A 1.0 976749175 A NaN
to this:
A B C
0 960189042 B NaN
1 977655199 nan 1.0
2 961771966 A NaN
3 955308022 C 1.0
4 933277976 A 1.0
5 991581392 A 1.0
6 964195250 A 1.0
7 969007327 B 1.0
8 973316485 A NaN
9 976749175 A NaN
Follow Up Question:
Now, if I'd need an indicator from which triple each block comes from, how could this be done? So a result could look like:
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1
These blocks can be of different lengths! So I cannot simply add a counter.
Use reset_index for remove only first level, second level of MultiIndex convert to column:
arr = np.arange(len(df.columns))
df.columns = [arr // 3, arr % 3]
df = df.stack(0).sort_index(level=[1, 0]).reset_index(level=0, drop=True).reset_index()
df.columns = ['D','A','B','C']
print (df)
D A B C
0 0 960189042 B NaN
1 0 977655199 nan 1.0
2 0 961771966 A NaN
3 0 955308022 C 1.0
4 0 933277976 A 1.0
5 1 991581392 A 1.0
6 1 964195250 A 1.0
7 1 969007327 B 1.0
8 1 973316485 A NaN
9 1 976749175 A NaN
Then if need change order of columns:
cols = df.columns[1:].tolist() + df.columns[:1].tolist()
df = df[cols]
print (df)
A B C D
0 960189042 B NaN 0
1 977655199 nan 1.0 0
2 961771966 A NaN 0
3 955308022 C 1.0 0
4 933277976 A 1.0 0
5 991581392 A 1.0 1
6 964195250 A 1.0 1
7 969007327 B 1.0 1
8 973316485 A NaN 1
9 976749175 A NaN 1

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

DataFrameGroupBy diff() on condition

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

create a dict of changes of 2 dataframes

I have to generate and update a list based on a database extract and a csv file.
I'm planning to do that using 2 pandas dataframes.
I'm able to generate the inserts (new items within the csv file based df) and the deletes (items not existing in the csv file based df) but I don't know how to generate and update the list. The dict should only contain the columns where the values are changed and the key column
The result of the operation should be something like this:
{'key': 10,
'column1': 'abc',
'column6': 10.8
}
Any idea on how to achieve this?
you can do it this way:
In [424]: df
Out[424]:
a b c d
0 7 5 1 3
1 1 8 6 1
2 9 6 5 2
3 5 5 4 2
4 7 1 4 6
In [425]: df2
Out[425]:
a b c d
0 -1 5 1 -1
1 1 8 6 1
2 9 6 5 2
3 5 5 -1 2
4 7 1 4 6
In [426]: df.index.name = 'key'
In [427]: df2.index.name = 'key'
In [430]: (df2[df2 != df]
.....: .dropna(how='all')
.....: .dropna(axis=1, how='all')
.....: .reset_index()
.....: .apply(lambda x: x.dropna().to_dict(), axis=1)
.....: )
Out[430]:
0 {'a': -1.0, 'd': -1.0, 'key': 0.0}
1 {'c': -1.0, 'key': 3.0}
dtype: object
explanation:
In [441]: df2[df2 != df]
Out[441]:
a b c d
key
0 -1.0 NaN NaN -1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN -1.0 NaN
4 NaN NaN NaN NaN
In [443]: df2[df2 != df].dropna(how='all')
Out[443]:
a b c d
key
0 -1.0 NaN NaN -1.0
3 NaN NaN -1.0 NaN
In [444]: df2[df2 != df].dropna(how='all').dropna(axis=1, how='all')
Out[444]:
a c d
key
0 -1.0 NaN -1.0
3 NaN -1.0 NaN

Categories