Dataset as:
id col2 col3
0 1 1 123
1 1 1 234
2 1 0 345
3 2 1 456
4 2 0 1243
5 2 0 346
6 3 0 888
7 3 0 999
8 3 0 777
I would like to aggregate data by id, and append the values of col3 into a list only if its corresponding value at col2 is 1. Additionally, for people (of different id) who only have 0 in col2, I like the aggregated value to be 0 for col2 and empty list for col3.
Here is the current code:
df_test = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'col2':[1, 1, 0, 1, 0, 0, 0, 0, 0], 'col3':[123, 234, 345, 456, 1243, 346, 888, 999, 777]})
df_test_agg = pd.pivot_table(df_test, index=['id'], values=['col2', 'col3'], aggfunc={'col2':np.max, 'col3':(lambda x:list(x))})
print (df_test_agg)
col2 col3
id
1 1 [123, 234, 345]
2 1 [456, 1243, 346]
3 0 [888, 999, 777]
The desired output should be (ideally in one-step in Pandas):
col2 col3
id
1 1 [123, 234]
2 1 [456]
3 0 []
///////////////////////////////////////////////////////////////////////////////////////
Edit - Trying out ColdSpeed's solution
df_test = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'col2':[1, 1, 0, 1, 0, 0, 0, 0, 0], 'col3':[123, 234, 345, 456, 1243, 346, 888, 999, 777]})
print (df_test)
df_test_agg = (df_test.where(df_test.col2 > 0)
.assign(id=df_test.id)
.groupby('id')
.agg({'col2': 'max', 'col3': lambda x: x.dropna().tolist()}))
print (df_test_agg)
id col2 col3
0 1 1 123
1 1 1 234
2 1 0 345
3 2 1 456
4 2 0 1243
5 2 0 346
6 3 0 888
7 3 0 999
8 3 0 777
col2 col3
id
1 1.0 [123.0, 234.0]
2 1.0 [456.0]
3 NaN []
///////////////////////////////////////////////////////////////////////////////////////
Edited original post to present more scenarios.
You can filter beforehand, then use groupby:
df_test.query('col2 > 0').groupby('id').agg({'col2': 'max', 'col3': list})
col2 col3
id
1 1 [123, 234]
2 1 [456]
The caveat here is that if a group has only zeros, that group will be missing in the result. So, to fix that, you can mask with where:
(df_test.where(df_test.col2 > 0)
.assign(id=df_test.id)
.groupby('id')
.agg({'col2': 'max', 'col3'lambda x: x.dropna().tolist()}))
col2 col3
id
1 1.0 [123.0, 234.0]
2 1.0 [456.0]
To handle 0 groups in "col2", we can use
(df.assign(col3=df.col3.where(df.col2.astype(bool)))
.groupby('id')
.agg({'col2':'max', 'col3': lambda x: x.dropna().astype(int).tolist()}))
col2 col3
id
1 1 [123, 234]
2 1 [456]
3 0 []
Related
i have a pandas dataframe with columns that, themselves, contain np.array. Imagine having something like this:
import random
df = pd.DataFrame(data=[[[random.randint(1,7) for _ in range(10)] for _ in range(5)]], index=["col1"])
df = df.transpose()
which will result in a dataframe like this:
col1
0 [7, 7, 6, 7, 6, 5, 5, 1, 7, 4]
1 [4, 7, 5, 5, 6, 6, 5, 4, 7, 5]
2 [7, 2, 7, 7, 2, 7, 6, 7, 1, 2]
3 [5, 7, 1, 2, 6, 5, 4, 3, 5, 2]
4 [2, 3, 2, 6, 3, 3, 1, 1, 7, 7]
I want to expand the dataframe to a dataframe with columns ["col1",...."col7"] and count for each row the number of occurances.
The desired result should be an extended dataframe, containing integer values only.
col1 col2 col3 col4 col5 col6 col7
0 1 0 0 1 2 2 4
1 0 0 0 2 3 2 2
2 1 3 0 0 0 1 5
My approach so far is pretty hard coded. I created col1,...col7 with 0 and after that I'm using iterrows() to count the occurances. This works well, but it's quite a lot of code and I'm sure there is a more elegant way to do this. Maybe something with .value_counts() for each array in a row?
Maybe someone can help me find it. Thanks
np.random.seed(2022)
from collections import Counter
import numpy as np
df = pd.DataFrame(data=[[[np.random.randint(1,7) for _ in range(10)] for _ in range(5)]],
index=["col1"])
df = df.transpose()
You can use Series.explode with SeriesGroupBy.value_counts and reshape by Series.unstack:
df1 = (df['col1'].explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.add_prefix('col')
.rename_axis(None, axis=1))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
Or use list comprehension with Counter and DataFrame constructor:
df1 = (pd.DataFrame([Counter(x) for x in df['col1']])
.sort_index(axis=1)
.fillna(0)
.astype(int)
.add_prefix('col'))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
is there a way to convert the multiindex columns to normal value columns? I have a multiindexed table like that:
level_0
level_1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
I want to convert level_0 and level_1 to normal columns:
ID
col0
col1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
Any suggestion?
Thank you!
You can use reset_index followed by rename.
# Setup
my_index = pd.MultiIndex.from_arrays([(0, 1, 2, 3),
(0, 0, 1, 1),
(0, 1, 0, 1)],
names=[None, 'level_0', 'level_1'])
df = pd.DataFrame({'Value': ['A', 'B', 'C', 'D']}, index=my_index)
>>> # level=['level_0', 'level_1'] works, too
>>> df = df.reset_index(level=[1, 2])
>>> df
level_0 level_1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
To rename the columns, you can do
>>> df.rename(columns={'level_0': 'col0', 'level_1': 'col1'})
col0 col1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
In dataframe, I want to iterate over same named columns and while iterating, when their sum exceeds "val_n" value. I want 4 things:
1) exceed_when (at what iteration it exceed from "val_n" value)
2) sum_col (sum of same named columns)
3) At the point of exceed when, I want to replace corresponding col value as (col - (sum_col - val_n)
4) And after exceed_when point, I want to replace rest of cols value to 0.
Dataframe look like:
id col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 val_n
1 350 350 350 350 350 350 350 350 350 350 0 0 0 0 3105.61
2 50 50 55 105 50 0 50 100 50 50 50 50 1025 1066.86 3185.6
3 0 0 0 0 0 3495.1 0 0 0 0 0 0 0 3495.1 3477.76
Required Dataframe:
id col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 val_n exceed_when sum_col
1 350 350 350 350 350 350 350 350 305.61 0 0 0 0 0 3105.61 9 3500
2 50 50 55 105 50 0 50 100 50 50 50 50 1025 1066.86 3185.6 2751.86
3 0 0 0 0 0 3477.76 0 0 0 0 0 0 0 0 3477.76 6 6990.2
This is what I have tried:
def trans(row):
row['sum_col'] = 0
row['exceed_ind'] = 0
for i in range(1, 15):
row['sum_col'] += row['col' + str(i)]
if ((row['exceed_ind'] == 0) &
(row['sum_col'] >= row['val_n'])):
row['exceed_ind'] = 1
row['exceed_when'] = i
else:
continue
if row['exceed_when'] == i:
row['col' + str(i)] = (
row['col' + str(i)] - (
row['sum_col'] - row['val_n']))
elif row['exceed_when'] < i:
row['col' + str(i)] = 0
else:
row['col' + str(i)] = row['col' + str(i)]
return row
df1 = df.apply(trans, axis=1)
I am getting right results for sum_col, exceed when but conditions elif row['exceed_when'] < i , doesn't seems to be working and its not updated the expected 4th point i.e. replace rest of cols value to 0. I am NOT sure what I miss.
DDL to generate DataFrame:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'col1': [350, 50, 0],
'col2': [350, 50, 0],
'col3': [350, 55, 0],
'col4': [350, 105, 0],
'col5' : [350, 50, 0],
'col6': [350, 0, 3495.1],
'col7': [350, 50, 0],
'col8': [350, 100, 0],
'col9': [350, 50, 0],
'col10': [350, 50, 0],
'col11': [0, 50, 0],
'col12': [0, 50, 0],
'col13': [0, 1025, 0],
'col14': [0, 1066.86, 3495.1],
'val_n': [3105.61, 3185.6, 3477.76]
})
Thanks!
To my knowledge, the .apply function will only pass a copy of the row and all updates happen on the copy only, not the original DataFrame itself. In this case, you have to loop through the rows and update them using the index.
df['sum_col'] = 0
df['exceed_ind'] = 0
df['exceed_when'] = 0
for idx, row in df.iterrows():
sum_col = 0
exceed_ind = 0
exceed_when = 0
for i in range(1, 15):
sum_col += row['col' + str(i)]
if ((exceed_ind == 0) &
(sum_col >= row['val_n'])):
exceed_ind = 1
exceed_when = i
df.loc[idx, 'exceed_ind'] = exceed_ind
df.loc[idx, 'exceed_when'] = exceed_when
df.loc[idx, 'col' + str(i)] = (row['col' + str(i)] - (sum_col - row['val_n']))
elif (exceed_ind==1) & (exceed_when < i):
df.loc[idx, 'col' + str(i)] = 0
df.loc[idx, 'sum_col'] = sum_col
print(df)
Result:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 \
id
1 350 350 350 350 350 350.00 350 350 305.61 0 0
2 50 50 55 105 50 0.00 50 100 50.00 50 50
3 0 0 0 0 0 3477.76 0 0 0.00 0 0
col12 col13 col14 val_n sum_col exceed_ind exceed_when
id
1 0 0 0.00 3105.61 3500.00 1 9
2 50 1025 1066.86 3185.60 2751.86 0 0
3 0 0 0.00 3477.76 6990.20 1 6
heres the problem... Imagine the following dataframe as an example:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [3, 4, 5, 6, 7],'col3': [3, 4, 5, 6, 7],'col4': [1, 2, 3, 3, 2]})
Now, I would like to add another column "col 5" which is calculated as follows:
if the value of "col4" is 1, then give me the corresponding value in the column with index 1 (i.e. "col2" in this case), if "col4" is 2 give me the corresponding value in the column with index 2 (i.e. "col3" in this case), etc.
I have tried the below and variations of it, but I can't seem to get the right result
df["col5"] = df.apply(lambda x: df.iloc[x,df[df.columns[df["col4"]]]])
Any help is much appreciated!
If your 'col4' is the indicator of column index, this will work:
df['col5'] = df.apply(lambda x: x[df.columns[x['col4']]], axis=1)
df
# col1 col2 col3 col4 col5
#0 1 3 3 1 3
#1 2 4 4 2 4
#2 3 5 5 3 3
#3 4 6 6 3 3
#4 5 7 7 2 7
You can use fancy indexing with NumPy and avoid a Python-level loop altogether:
df['col5'] = df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']]
print(df)
col1 col2 col3 col4 col5
0 1 3 3 1 3
1 2 4 4 2 4
2 3 5 5 3 3
3 4 6 6 3 3
4 5 7 7 2 7
You should see significant performance benefits for larger dataframes:
df = pd.concat([df]*10**4, ignore_index=True)
%timeit df.apply(lambda x: x[df.columns[x['col4']]], axis=1) # 2.36 s per loop
%timeit df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']] # 1.01 ms per loop
I have a data frame that looks like this:
df = pd.DataFrame({"value": [4, 5, 3], "item1": [0, 1, 0], "item2": [1, 0, 0], "item3": [0, 0, 1]})
df
value item1 item2 item3
0 4 0 1 0
1 5 1 0 0
2 3 0 0 1
Basically what I want to do is replace the value of the one hot encoded elements with the value from the "value" column and then delete the "value" column. The resulting data frame should be like this:
df_out = pd.DataFrame({"item1": [0, 5, 0], "item2": [4, 0, 0], "item3": [0, 0, 3]})
item1 item2 item3
0 0 4 0
1 5 0 0
2 0 0 3
Why not just multiply?
df.pop('value').values * df
item1 item2 item3
0 0 5 0
1 4 0 0
2 0 0 3
DataFrame.pop has the nice effect of in-place removing and returning a column, so you can do this in a single step.
if the "item_*" columns have anything besides 1 in them, then you can multiply with bools:
df.pop('value').values * df.astype(bool)
item1 item2 item3
0 0 5 0
1 4 0 0
2 0 0 3
If your DataFrame has other columns, then do this:
df
value name item1 item2 item3
0 4 John 0 1 0
1 5 Mike 1 0 0
2 3 Stan 0 0 1
# cols = df.columns[df.columns.str.startswith('item')]
cols = df.filter(like='item').columns
df[cols] = df.pop('value').values * df[cols]
df
name item1 item2 item3
0 John 0 5 0
1 Mike 4 0 0
2 Stan 0 0 3
You could do something like:
df = pd.DataFrame([df['value']*df['item1'],df['value']*df['item2'],df['value']*df['item3']])
df.columns = ['item1','item2','item3']
EDIT:
As this answer will not scale well to many columns as #coldspeed comments, it should be done iterating a loop:
cols = ['item1','item2','item3']
for c in cols:
df[c] *= df['value']
df.drop('value',axis=1,inplace=True)
You need:
col = ['item1','item2','item3']
for c in col:
df[c] = df[c] * df['value']
df.drop(['value'],1,inplace=True)
pd.DataFrame.mul
You can use mul, or eqivalently multiply, either using labels or integer positional indexing:
# label-based indexing
res = df.filter(regex='^item').mul(df['value'], axis='index')
# integer positional indexing
res = df.iloc[:, 1:].mul(df.iloc[:, 0], axis='index')
print(res)
# item1 item2 item3
# 0 0 4 0
# 1 5 0 0
# 2 0 0 3