pandas - split column with arrays into multiple columns and count values - python

i have a pandas dataframe with columns that, themselves, contain np.array. Imagine having something like this:
import random
df = pd.DataFrame(data=[[[random.randint(1,7) for _ in range(10)] for _ in range(5)]], index=["col1"])
df = df.transpose()
which will result in a dataframe like this:
col1
0 [7, 7, 6, 7, 6, 5, 5, 1, 7, 4]
1 [4, 7, 5, 5, 6, 6, 5, 4, 7, 5]
2 [7, 2, 7, 7, 2, 7, 6, 7, 1, 2]
3 [5, 7, 1, 2, 6, 5, 4, 3, 5, 2]
4 [2, 3, 2, 6, 3, 3, 1, 1, 7, 7]
I want to expand the dataframe to a dataframe with columns ["col1",...."col7"] and count for each row the number of occurances.
The desired result should be an extended dataframe, containing integer values only.
col1 col2 col3 col4 col5 col6 col7
0 1 0 0 1 2 2 4
1 0 0 0 2 3 2 2
2 1 3 0 0 0 1 5
My approach so far is pretty hard coded. I created col1,...col7 with 0 and after that I'm using iterrows() to count the occurances. This works well, but it's quite a lot of code and I'm sure there is a more elegant way to do this. Maybe something with .value_counts() for each array in a row?
Maybe someone can help me find it. Thanks

np.random.seed(2022)
from collections import Counter
import numpy as np
df = pd.DataFrame(data=[[[np.random.randint(1,7) for _ in range(10)] for _ in range(5)]],
index=["col1"])
df = df.transpose()
You can use Series.explode with SeriesGroupBy.value_counts and reshape by Series.unstack:
df1 = (df['col1'].explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.add_prefix('col')
.rename_axis(None, axis=1))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
Or use list comprehension with Counter and DataFrame constructor:
df1 = (pd.DataFrame([Counter(x) for x in df['col1']])
.sort_index(axis=1)
.fillna(0)
.astype(int)
.add_prefix('col'))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3

Related

How to remove duplicate rows with a condition in pandas

i.e
i want to drop duplicates pairs using col1 and col2 as the subset only if the values are the opposite in col3 (one negative and one positive). similar to drop_duplicates function but i want to impose a condition and only want to remove the first pair (i.e if 3 duplicates, just remove 2, leave 1)
my dataset (df):
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
4 1 2 -1
5 1 2 1
6 1 2 1
I want:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
rows 4 and 5 are duplicated in col1 and col2 but value in col3 is the opposite, therefore we remove both. row 0 and row 2 have duplicate values in col1 and col2 but col3 is the same, so we don't remove those rows.
i've tried using drop_duplicates but realised it wouldn't work as it will only remove all duplicates and not consider anything else.
We can do transform
out = df[df.groupby(['col1','col2']).col3.transform('sum').ne(0) & df.col3.ne(0)]
Out[252]:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
Recreating the dataset:
import pandas as pd
data = [
[1, 1, 1],
[2, 2, 2],
[1, 1, 1],
[3, 5, 7],
[1, 2, -1],
[1, 2, 1],
[1, 2, 1],
]
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
if your data is not massive, you can use an iterrows function on a subset of the data.
The subset contains all duplicate values after all values have been turned into absolute values.
Next, we check if col3 is negative and if the opposite of col3 is in the duplicate subset.
If so, we drop the row from df.
df_dupes = df[df.abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
df.drop(labels=i, axis=0, inplace=True)
This code should remove row 4.
In your desired output, you left row 5 for some reason.
If you can explain why you left row 5 but kept row 0, then I can adjust my code to more accurately match your desired output.
I used #Petar Luketina code here with an adjustment and it worked. However I would like to use it for a massive dataset -> 1million rows and 43 columns. This code takes forever:
df_dupes = df[df['col3'].abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
print(row.col3)
try:
c = np.where((df['col1'] ==row.col1) & (df['col2'] ==row.col2) &
(df['col3'] ==-row.col3))[0][0]
df.drop(labels=[i,df.index.values[c]], axis=0, inplace=True)
except:
pass
I know this is an old question, but for those people interested, here is an alternative that avoids iterating over the rows:
First use a flag to identify the pair of rows to be removed (row plus the next row when col1 and col2 are the same and col3 are the negative of each other)
df.loc[(df.col1 == df.col1.shift(1)) & (df.col2 == df.col2.shift(1)) & (df.col3 == -df.col3.shift(1)), 'removeFlag'] = True
df.loc[df.removeFlag.shift(-1) == True, 'removeFlag'] = True
col1 col2 col3 removeFlag
0 1 1 1 NaN
1 2 2 2 NaN
2 1 1 1 NaN
3 3 5 7 NaN
4 1 2 -1 True
5 1 2 1 True
6 1 2 1 NaN
Then use this flag to delete to offending rows:
df = df[~(df.removeFlag == True)]
df.drop(columns=['removeFlag'], inplace=True)
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
This approach probably needs a little more refinement if row 6 had been the same as row 4 (ie the first half of a repeated identical pair) but you get the idea.

2nd largest value in each row

How can I create a column col4 that contains the 2nd largest value in each row
df = pd.DataFrame([[4, 1, 5],
[5, 2, 9],
[2, 9, 3],
[8, 5, 4]],
columns=["col_A", "col_B", "col_C"])
cols = np.array(df.columns)
df['col4'] = df.nlargest(2, columns=cols) #wrong
You can use indexing on the output of np.sort:
N = 2
df['col4'] = np.sort(df)[:, -N]
Alternative with apply:
df['col4'] = df.apply(lambda r: r.nlargest(2).iloc[-1], axis=1)
output:
col_A col_B col_C col4
0 4 1 5 4
1 5 2 9 5
2 2 9 3 3
3 8 5 4 5
For each row, you could sort the values and take the second last one as follow :
df["col4"] = df.apply(lambda x: sorted(x)[-2], axis=1)

Convert column values to column of list based on condition

Dataset as:
id col2 col3
0 1 1 123
1 1 1 234
2 1 0 345
3 2 1 456
4 2 0 1243
5 2 0 346
6 3 0 888
7 3 0 999
8 3 0 777
I would like to aggregate data by id, and append the values of col3 into a list only if its corresponding value at col2 is 1. Additionally, for people (of different id) who only have 0 in col2, I like the aggregated value to be 0 for col2 and empty list for col3.
Here is the current code:
df_test = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'col2':[1, 1, 0, 1, 0, 0, 0, 0, 0], 'col3':[123, 234, 345, 456, 1243, 346, 888, 999, 777]})
df_test_agg = pd.pivot_table(df_test, index=['id'], values=['col2', 'col3'], aggfunc={'col2':np.max, 'col3':(lambda x:list(x))})
print (df_test_agg)
col2 col3
id
1 1 [123, 234, 345]
2 1 [456, 1243, 346]
3 0 [888, 999, 777]
The desired output should be (ideally in one-step in Pandas):
col2 col3
id
1 1 [123, 234]
2 1 [456]
3 0 []
///////////////////////////////////////////////////////////////////////////////////////
Edit - Trying out ColdSpeed's solution
df_test = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'col2':[1, 1, 0, 1, 0, 0, 0, 0, 0], 'col3':[123, 234, 345, 456, 1243, 346, 888, 999, 777]})
print (df_test)
df_test_agg = (df_test.where(df_test.col2 > 0)
.assign(id=df_test.id)
.groupby('id')
.agg({'col2': 'max', 'col3': lambda x: x.dropna().tolist()}))
print (df_test_agg)
id col2 col3
0 1 1 123
1 1 1 234
2 1 0 345
3 2 1 456
4 2 0 1243
5 2 0 346
6 3 0 888
7 3 0 999
8 3 0 777
col2 col3
id
1 1.0 [123.0, 234.0]
2 1.0 [456.0]
3 NaN []
///////////////////////////////////////////////////////////////////////////////////////
Edited original post to present more scenarios.
You can filter beforehand, then use groupby:
df_test.query('col2 > 0').groupby('id').agg({'col2': 'max', 'col3': list})
col2 col3
id
1 1 [123, 234]
2 1 [456]
The caveat here is that if a group has only zeros, that group will be missing in the result. So, to fix that, you can mask with where:
(df_test.where(df_test.col2 > 0)
.assign(id=df_test.id)
.groupby('id')
.agg({'col2': 'max', 'col3'lambda x: x.dropna().tolist()}))
col2 col3
id
1 1.0 [123.0, 234.0]
2 1.0 [456.0]
To handle 0 groups in "col2", we can use
(df.assign(col3=df.col3.where(df.col2.astype(bool)))
.groupby('id')
.agg({'col2':'max', 'col3': lambda x: x.dropna().astype(int).tolist()}))
col2 col3
id
1 1 [123, 234]
2 1 [456]
3 0 []

Pandas - calculate new column with variable column input

heres the problem... Imagine the following dataframe as an example:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [3, 4, 5, 6, 7],'col3': [3, 4, 5, 6, 7],'col4': [1, 2, 3, 3, 2]})
Now, I would like to add another column "col 5" which is calculated as follows:
if the value of "col4" is 1, then give me the corresponding value in the column with index 1 (i.e. "col2" in this case), if "col4" is 2 give me the corresponding value in the column with index 2 (i.e. "col3" in this case), etc.
I have tried the below and variations of it, but I can't seem to get the right result
df["col5"] = df.apply(lambda x: df.iloc[x,df[df.columns[df["col4"]]]])
Any help is much appreciated!
If your 'col4' is the indicator of column index, this will work:
df['col5'] = df.apply(lambda x: x[df.columns[x['col4']]], axis=1)
df
# col1 col2 col3 col4 col5
#0 1 3 3 1 3
#1 2 4 4 2 4
#2 3 5 5 3 3
#3 4 6 6 3 3
#4 5 7 7 2 7
You can use fancy indexing with NumPy and avoid a Python-level loop altogether:
df['col5'] = df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']]
print(df)
col1 col2 col3 col4 col5
0 1 3 3 1 3
1 2 4 4 2 4
2 3 5 5 3 3
3 4 6 6 3 3
4 5 7 7 2 7
You should see significant performance benefits for larger dataframes:
df = pd.concat([df]*10**4, ignore_index=True)
%timeit df.apply(lambda x: x[df.columns[x['col4']]], axis=1) # 2.36 s per loop
%timeit df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']] # 1.01 ms per loop

Pandas: consolidating columns in DataFrame

Using the DataFrame below as an example:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
col1 col2
0 1 A
1 2 A
2 3 B
3 2 B
4 1 C
how can I get
col1 col2
0 1 A,C
1 2 A,B
2 3 B
You can groupby on 'col1' and then apply a lambda that joins the values:
In [88]:
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
df.groupby('col1')['col2'].apply(lambda x: ','.join(x)).reset_index()
Out[88]:
col1 col2
0 1 A,C
1 2 A,B
2 3 B

Categories