Compare two columns in two dataframes with a condition on another column - python

I have a multilevel dataframe and I want to compare the value in column secret with a condition on column group. If group = A, we allow the value in another dataframe to be empty or na. Otherwise, output INVALID for the mismatching ones.
multilevel dataframe:
name secret group
df1 df2 df1 df2 df1 df2
id
1 Tim Tim random na A A
2 Tom Tom tree A A
3 Alex Alex apple apple B B
4 May May file cheese C C
expected output for secret
id name secret group
1 Tim na A
2 Tom A
3 Alex apple B
4 May INVALID C
so far I have:
result_df['result'] = multilevel_df.groupby(level=0, axis=0).apply(lambda x: secret_check(x))
#take care of the rest by compare column by column
result_df = multilevel_df.groupby(level=0, axis=1).apply(lambda x: validate(x))
def validate(x):
if x[0] == x[1]:
return x[1]
else:
return 'INVALID'
def secret_check(x):
if (x['group'] == 'A' and pd.isnull(['secret']): #this line is off
return x[1]
elif x[0] == x[1]:
return x[1]
else:
return 'INVALID'

Assuming we have the following dataframe:
df = pd.DataFrame({0: {0: 1, 1: 2, 2: 3, 3: 4},
1: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
2: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
3: {0: 'random', 1: 'tree', 2: 'apple', 3: 'file'},
4: {0: 'na', 1: '', 2: 'apple', 3: 'cheese'},
5: {0: 'A', 1: 'A', 2: 'B', 3: 'C'},
6: {0: 'A', 1: 'A', 2: 'B', 3: 'C'}})
df
df.columns = pd.MultiIndex.from_tuples([('id',''), ('name', 'df1'), ('name', 'df2'),
('secret', 'df1'), ('secret', 'df2'), ('group', 'df1'), ('group', 'df2')])
df
In[1]:
id name secret group
df1 df2 df1 df2 df1 df2
0 1 Tim Tim random na A A
1 2 Tom Tom tree A A
2 3 Alex Alex apple apple B B
3 4 May May file cheese C C
You can use np.select() to return results based on conditions.
.droplevel() to get out of a multiindex dataframe
and df.loc[:,~df.columns.duplicated()] to drop duplicate columns. Since we are setting the answer to df1 columns, df2 columns are not needed.
df[('secret', 'df1')] = np.select([(df[('group', 'df2')] != 'A') &
(df[('secret', 'df1')] != df[('secret', 'df2')])], #condition 1
[df[('secret', 'df1')] + ' > ' + df[('secret', 'df2')]], #result 1
df[('secret', 'df2')]) #alterantive if conditions not met
df.columns = df.columns.droplevel(1)
df = df.loc[:,~df.columns.duplicated()]
df
Out[1]:
id name secret group
0 1 Tim na A
1 2 Tom A
2 3 Alex apple B
3 4 May file > cheese C

If I understand you right, you want to mark "secret" in df2 as invalid if the secrets in df1 and df2 differ and the group is not A. There you go:
condition = (df[('secret', 'df1')] != df[('secret', 'df2')]) &\
df[('group', 'df1')] != 'A')
df.loc[condition, ('secret', 'df2')] = 'INVALID'

Related

Merging two multiindex dataframes

I have 2 dataframes:
df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
(pd.Timestamp('2021-06-28 00:00:00'),
'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
(pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
1: 7614.650000000001,
2: 13775.259999999998,
3: 13158.250000000004,
4: 1457.85,
5: 1089.5600000000002,
6: 38864.9},
(pd.Timestamp('2021-07-05 00:00:00'),
'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
(pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
1: 5606.01,
2: 10239.689999999999,
3: 17476.600000000002,
4: 943.7199999999999,
5: 1410.33,
6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
1: {0: 36234.035577957984,
1: 69078.07089184562,
2: 128879.5397517309,
3: 178376.63536908248,
4: 9293.956915067887,
5: 8184.780211399392,
6: 177480.74540313095},
2: {0: 37887.581678419825,
1: 72243.67956241772,
2: 134803.02342121338,
3: 186603.8963173654,
4: 9716.385738295368,
5: 8555.606693927,
6: 185658.87577993725}}).set_index('category')
First I change the column names of df2 to be the same as df
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
df2 = df2.rename(columns=date_mappings)
Then I try to merge it
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))
But I get an error:
ValueError: Cannot merge a Series without a name
What is my mistake?
My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.
After using
c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)
I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:
Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.
Unwanted adding to the end of the df
Idea is create MultiIndex in both DataFrames:
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05 \
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
metric_3
category
A 37887.581678
B 72243.679562
C 134803.023421
D 186603.896317
E 9716.385738
F 8555.606694
G 185658.875780
If need remove greater datetimes like maximal df1 datetimes use:
#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
2021-06-28
metric_3
category
A 36234.035578
B 69078.070892
C 128879.539752
D 178376.635369
E 9293.956915
F 8184.780211
G 177480.745403
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
Use pd.DataFrame.reindex + pd.DataFrame.join
reindex has a convenient level parameter that allows you to expand on the index levels not present.
df1.join(df2.reindex(df1.index, level=0))
I am not sure if this is what you want but you might need to_frame:
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)

How to drop rows of dataframe inside of the function python

I have dataframe and I want to drop some rows inside of the function
def IncomeToGo(dataframe, mainCatName):
for k in dataframe.name:
if mainCatName in k:
dataframe= dataframe.drop(dataframe.loc[dataframe.name == k].index)
this is the way I use that function
print(len(df1)) // len = 21
IncomeToGo(df1, 'Apple')
print(len(df1)) // len = 21
but the drop part don't do anything and nothing removed form my dataframe
IIUC, here's one way:
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name.ne(mainCatName)]
Example:
Initial df:
name menu
0 A cheese
1 A cake
2 A sausage
3 B chicken
4 B cake
5 B water
6 C chicken
7 C sausage
8 C water
9 D water
10 D cheese
11 D sausage
df = pd.DataFrame({'name': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'B',
6: 'C',
7: 'C',
8: 'C',
9: 'D',
10: 'D',
11: 'D'},
'menu': {0: 'cheese',
1: 'cake',
2: 'sausage',
3: 'chicken',
4: 'cake',
5: 'water',
6: 'chicken',
7: 'sausage',
8: 'water',
9: 'water',
10: 'cheese',
11: 'sausage'}})
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name.ne(mainCatName)]
IncomeToGo(df, 'A')
OUTPUT df:
name menu
3 B chicken
4 B cake
5 B water
6 C chicken
7 C sausage
8 C water
9 D water
10 D cheese
11 D sausage
You have 2 errors in your code:
You don't return anything from your function
You remove rows from the column that your are looping through. This is a very bad practice
Try just by filtering out these rows:
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name != mainCatName]
you can try something like below if you want to use for loop over column values
def IncomeToGo(dataframe, mainCatName):
for k in dataframe.name.unique():
if mainCatName == k:
dataframe = dataframe.loc[dataframe.name != mainCatName].copy()
return dataframe
I would advice not to hardcode column names in the function. Write them in such a way that function can be used at mutiple places dynamically.

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.
Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c
you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Merging multiple dataframe lines into aggregate lines

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?
Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1
Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object
Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

Explode a row to multiple rows in pandas dataframe

I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?
You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Categories