Pandas merging value of two rows in columns of a single row - python

I have data like this, it's output of a groupby:
numUsers = df.groupby(["user","isvalid"]).count()
count
user isvalid
5 0.0 1336
1.0 387
But I need to have count of count_valid and count_invalid columns for each user, like this:
count_valid count_invalid
user
5 387 1336
How can I do it in optimized way in Pandas?

You can use:
out = (df.groupby(["user","isvalid"]).count()
.rename({0: 'count_invalid', 1: 'count_valid'}, level=1)
['count'].unstack()
)
Output:
isvalid count_invalid count_valid
user
5 1336 387
Or, more generic if you have multiple columns, using a MultiIndex:
out = (df.groupby(["user","isvalid"]).count()
.unstack().rename(columns={0: 'invalid', 1: 'valid'}, level=1)
)
out.columns = out.columns.map('_'.join)
Output:
count_invalid count_valid
user
5 1336 387
Or from the original dataset with a crosstab:
pd.crosstab(df['user'], df['isvalid'].map({0: 'count_invalid', 1: 'count_valid'}))

You can replace groupby_count by value_counts:
>>> (df.replace({'isvalid': {0: 'count_invalid', 1: 'count_valid'}})
.value_counts(['user', 'isvalid']).unstack('isvalid')
.rename_axis(columns=None))
count_invalid count_valid
user
5 1336 387
Another version with pivot_table:
>>> (df.replace({'isvalid': {0: 'count_invalid', 1: 'count_valid'}}).assign(count=1)
.pivot_table(index='user', columns='isvalid', values='count', aggfunc='count')
.rename_axis(columns=None))
count_invalid count_valid
user
5 1336 387

Related

Pandas Groupby and generate "duplicate" columns for each groupby value

I have a vertical data frame that I am looking to make more horizontal by "duplicating" columns for each item in the groupby column.
I have the following data frame:
pd.DataFrame({'posteam': {0: 'ARI', 1: 'ARI', 2: 'ARI', 3: 'ARI', 4: 'ARI'},
'offense_grouping': {0: 'personnel_00',
1: 'personnel_01',
2: 'personnel_02',
3: 'personnel_10',
4: 'personnel_11'},
'snap_ct': {0: 1, 1: 6, 2: 4, 3: 396, 4: 1441},
'personnel_epa': {0: 0.1539720594882965,
1: 0.7805194854736328,
2: -0.2678736448287964,
3: 0.1886662095785141,
4: 0.005721719935536385}})
And in its current state, there are 5 duplicate values in the 'posteam' column and 5 different values in the 'offense_grouping' column. Ideally, I would like to group by 'posteam' (so the team only has one row) and by 'offense_grouping'. Each 'offense_grouping' value is corresponded with 'snap_ct' and 'personnel_epa' values. I would like the end result of this group to be something like this:
posteam
personnel_00_snap_ct
personnel_00_personnel_epa
personnel_01_snap_ct
personnel_01_personnel_epa
personnel_02_snap_ct
personnel_02_personnel_epa
ARI
1
.1539...
6
.7805...
4
-.2679
And so on. How can this be achieved?
Given the data you provide, the following would give the expected result. But there might be more complex cases in your data.
z = (
df
.set_index(['posteam', 'offense_grouping'])
.unstack('offense_grouping')
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
# or, alternatively (might be better if you have multiple values
# for some given indices./columns):
z = (
df
.pivot_table(index='posteam', columns='offense_grouping', values=['snap_ct', 'personnel_epa'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
>>> z
offense_grouping personnel_00 personnel_01 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 1 0.153972 6 0.780519
offense_grouping personnel_02 personnel_10 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 4 -0.267874 396 0.188666
offense_grouping personnel_11
snap_ct personnel_epa
posteam
ARI 1441 0.005722
Then you can join the two levels of columns:
res = z.set_axis([f'{b}_{a}' for a, b in z.columns], axis=1)
>>> res
snap_ct_personnel_00 personnel_epa_personnel_00 snap_ct_personnel_01 personnel_epa_personnel_01 snap_ct_personnel_02 personnel_epa_personnel_02 snap_ct_personnel_10 personnel_epa_personnel_10 snap_ct_personnel_11 personnel_epa_personnel_11
posteam
ARI 1 0.153972 6 0.780519 4 -0.267874 396 0.188666 1441 0.005722
​```

Pandas assign a column value basis previous row different column value

I have a df like this:
and the resultDF I want needs to be like this:
So except first row I want Supply value to be added with Available value of previous row and then subtract it with order value. E.g. for row 3 in resultDF, Supply value (2306) is generated by adding Available value (145, row 2) from resultDF and Supply value (2161, row 3) from df. And then simply Available value is calculated using Supply - Order. Can anyone help me with how to generate resultDF.
Use cumsum:
df["Available"] = df["Supply"].cumsum() - df["Order"].cumsum()
df["Supply"] = df["Available"] + df["Order"]
>>> df
product Month Order Supply Available
0 xx-xxx 202107 718 1531.0 813.0
1 None 202108 668 813.0 145.0
2 None 202109 5030 2306.0 -2724.0
3 None 202110 667 -2724.0 -3391.0
Use cumsum to compute right values:
Assuming:
you want to fix your rows per product
your rows are already ordered by (product, month)
# Setup
data = {'Product': ['xx-xxx', 'xx-xxx', 'xx-xxx', 'xx-xxx'],
'Month': [202107, 202108, 202109, 202110],
'Order': [718, 668, 5030, 667],
'Supply': [1531, 0, 2161, 0],
'Available': [813, -668, -2869, -667]}
df = pd.DataFrame(data)
df[['Supply', 'Available']] = df.groupby('Product').apply(lambda x: \
pd.DataFrame({'Supply': x['Order'] + x['Supply'].cumsum() - x['Order'].cumsum(),
'Available': x['Supply'].cumsum() - x['Order'].cumsum()}))
Output:
>>> df
Product Month Order Supply Available
0 xx-xxx 202107 718 1531 813
1 xx-xxx 202108 668 813 145
2 xx-xxx 202109 5030 2306 -2724
3 xx-xxx 202110 667 -2724 -3391

Pandas - select all rows between two values when a string is a match

I have two DataFrames:
import pandas as pd
import numpy as np
d = {'fruit': ['apple', 'pear', 'peach'] * 5, 'values': np.random.randint(0,1000,15)}
df = pd.DataFrame(data=d)
d2 = {'fruit': ['apple', 'pear', 'peach'] * 2, 'min': [43, 196, 143, 174, 510, 450], 'max': [120, 310, 311, 563, 549, 582]}
df2 = pd.DataFrame(data=d2)
I'd like to select all the rows in df with matching fruit to df2 and values between min and max.
I'm trying something like:
df.loc[df['fruit'].isin(df2['fruit'])].loc[df['values'].between(df2['min'], df2['max'])]
But predictably this is returning a ValueError: Can only compare identically-labeled Series objects.
EDIT: you'll notice that fruit is repeated in df2. This is intentional. I still am trying to grab the rows between min and max as above, but I don't just want to collapse the fruits and take the rows between the absolute min and max.
For example, in df1 where fruit == 'apple' I'd like all the rows with values between 43-120 and 174-563.
df3 = df.merge(df2, on='fruit', how='inner') # Thanks for Henry Ecker for suggesting inner join
df3 = df3.loc[(df3['min'] < df3['values']) & (df3['max'] > df3['values'])]
df3
Output
fruit values min max
3 apple 883 467 947
6 apple 805 467 947
9 apple 932 467 947
11 peach 331 307 618
12 apple 665 467 947
If we don't want min and max col in output
df3 = df3.drop(columns=['min', 'max'])
df3
Output
fruit values
3 apple 883
6 apple 805
9 apple 932
11 peach 331
12 apple 665

Calculate and add up Data from a reference dataframe

I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted

ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat

I want to find FRAUD rate for all features separately then replace this values with features value.
For example, my sample data is in below, then i want to find my Model's fraud rate like STEP1, then i want to replace with Model's value like STEP2.
My code is below to find this values but it is not working. Error code is below also. Can someone to help me?
for i in df2_a.columns:
grp1 = df2.groupby(i, as_index=False, sort=True, group_keys=True)[['EXT_REFERENCE']].count()
df3 = df2[df2.FRAUD == 0]
grp2 = df3.groupby(i, as_index=False, sort=True, group_keys=True)[['EXT_REFERENCE']].count()
df4 = df2[df2.FRAUD == 1]
grp3 = df4.groupby(i, as_index=False, sort=True, group_keys=True)[['EXT_REFERENCE']].count()
grp4 = grp1.merge(grp2, how = 'left', on=i )
grp5 = grp4.merge(grp3, how = 'left', on=i )
grp6 = grp5.fillna(0)
grp6[i+'_New'] = grp5.EXT_REFERENCE / grp5.EXT_REFERENCE_x
grp7 = grp6.fillna(0)
grp8 = grp7.drop(['EXT_REFERENCE','EXT_REFERENCE_x','EXT_REFERENCE_y'],axis=1)
df5 = pd.merge(df2_a, grp8, on=i, how='left')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-225543878353> in <module>
6 grp3 = df4.groupby(i, as_index=False, sort=True, group_keys=True)[['EXT_REFERENCE']].count()
7 grp4 = grp1.merge(grp2, how = 'left', on=i )
----> 8 grp5 = grp4.merge(grp3, how = 'left', on=i )
9 grp6 = grp5.fillna(0)
10 grp6[i+'_New'] = grp5.EXT_REFERENCE / grp5.EXT_REFERENCE_x
/opt/anaconda/envs/env_python/lib/python3.6/site-packages/pandas/core/frame.py in merge(self, right,
how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
6866 right_on=right_on, left_index=left_index,
6867 right_index=right_index, sort=sort, suffixes=suffixes,
-> 6868 copy=copy, indicator=indicator, validate=validate)
6869
6870 def round(self, decimals=0, *args, **kwargs):
/opt/anaconda/envs/env_python/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left,
right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator,
validate)
45 right_index=right_index, sort=sort, suffixes=suffixes,
46 copy=copy, indicator=indicator,
---> 47 validate=validate)
48 return op.get_result()
49
/opt/anaconda/envs/env_python/lib/python3.6/site-packages/pandas/core/reshape/merge.py in
__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort,
suffixes, copy, indicator, validate)
531 # validate the merge keys dtypes. We may need to coerce
532 # to avoid incompat dtypes
--> 533 self._maybe_coerce_merge_keys()
534
535 # If argument passed to validate,
/opt/anaconda/envs/env_python/lib/python3.6/site-packages/pandas/core/reshape/merge.py in
_maybe_coerce_merge_keys(self)
978 (inferred_right in string_types and
979 inferred_left not in string_types)):
--> 980 raise ValueError(msg)
981
982 # datetimelikes must match exactly
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should
use pd.concat
EDIT
Thanks for you comments guys but i am wondering that, my sample has a just 1 feature cuz it is an example. But in my reel data i have 120 features. So i am trying with 'for loop' in my code for calculate for all columns. Can you please check below example and don't think just 2 feature (Model, age). Think for 120 feature to same calculation.
Here you go:
import pandas as pd
df = pd.DataFrame({'Model': ['audi', 'audi', 'bmw', 'bmw', 'ford', 'ford'],'Age':[1,2,1,2,1,2] , 'Fraud': [1,1,0,0,1,0]})
# group df by Age
grouped_age = df.groupby('Age', as_index=False).mean()
merged_df = pd.merge(df, grouped_age, on=['Age'], how='inner')
df = merged_df.rename({'Age': 'x', 'Fraud_x': 'Fraud', 'Fraud_y':'Age'}, axis='columns')
df = df.drop('x', axis=1)
# group df by Model
grouped_df = df.groupby('Model', as_index=False).mean()
merged_df = pd.merge(df, grouped_df, on=['Model'], how='inner')
# some display corrections
df = merged_df.rename({'Model': 'x', 'Fraud_x': 'Fraud', 'Fraud_y':'Model', 'Age_x':'Age'}, axis='columns')
df = df.drop(['x', 'Age_y'], axis=1)
df = df[['Model', 'Age', 'Fraud']]
df['Model'] = df['Model'] * 100
df['Age'] = (df['Age'] * 100).round(0)
Output:
Model Age Fraud
0 100.0 67.0 1
1 100.0 33.0 1
2 0.0 67.0 0
3 0.0 33.0 0
4 50.0 67.0 1
5 50.0 33.0 0
I am not sure I understand your code, but here how I would do it:
for col in df.iloc[:, :-1]:
group_df = df.groupby(col).mean()*100
df[col] = df[col].map(group_df['Fraud'])
Result
Model Age Fraud
0 100.0 66.666667 1
1 100.0 33.333333 1
2 0.0 66.666667 0
3 0.0 33.333333 0
4 50.0 66.666667 1
5 50.0 33.333333 0
It assumes the Fraud col will be the last col

Categories