Python Dataframe- count occurrences of list element

Python Dataframe- count occurrences of list element - python

I have the following dataframe:
The basket_new column contains numbers from 0 to 5 in a list (the amount can vary for each number and transaction). I would like to count the occurrences of every number for each transaction and save that number in another DataFrame like this:
I just created a lambda function for Cat_0 to test it, unfortunately it's not working as it is creating "None" entries (see picture 2).
This is the function:
df_cat["Cat_0"] = df_train["basket_new"].map(lambda x: df_cat["Cat_0"]+1 if "0" in x else None)
Can you please just tell me what I'm doing wrong / how to fix my issue?

Use explode and crosstab.
Let say you have a df like this:
df = pd.DataFrame({'a':[1,2,3,4], 'b':[[1,2],[0],[3,1,2,3],[4,2,2,2,1]]})
df:
a b
0 1 [1, 2]
1 2 [0]
2 3 [3, 1, 2, 3]
3 4 [4, 2, 2, 2, 1]
df1 = df['b'].explode()
df[['a', 'b']].join(pd.crosstab(df1.index, df1))
a b 0 1 2 3 4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1
If you want to rename columns:
df[['a', 'b']].join(pd.crosstab(df1.index, df1, colnames=['b']).add_prefix('cat_'))
a b cat_0 cat_1 cat_2 cat_3 cat_4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1

Using the list.count() method:
df_cat["Cat_0"] = df['basket_new'].map(lambda x: x.count(0))
Using the first 5 rows of your data:
for i in range(0,5): df_cat["Cat_{0}".format(i)] = df['basket'].map(lambda x: x.count(i))
Cat_0 Cat_1 Cat_2 Cat_3 Cat_4 Cat_5
0 0 0 0 1 0 0
1 1 0 0 2 0 1
2 0 1 0 1 1 1
3 0 0 1 0 0 0
4 0 0 0 0 4 0

This is a rather lengthy one but it works
df.explode('basket_new').groupby(['transaction_id','customerType','basket_new']).agg(count = ('basket_new','count'))\
.reset_index().pivot_table(index=['transaction_id','customerType'], columns='basket_new', values='count', fill_value=0)\
.reset_index()

Related

pandas one-hot-encoding column containing a list of feature and each feature can be negative

So i have the following dataset
d = {'user': ['a','a','b','b'], 'item':[1, 2, 1, 3], 'features': [[2], [-2, -1], [-137, -1, 2], [-137, 2, 1]]}
df = pd.DataFrame(data=d)
user item features
0 a 1 [2]
1 a 2 [-2, -1]
2 b 1 [-137, -1, 2]
3 b 3 [-137, 2, 1]
i'm trying to obtain the following dataset:
user item '2' '1' '137'
0 a 1 1 0 0
1 a 2 -1 -1 0
2 b 1 1 -1 -1
3 b 3 1 1 -1
i tried to use:
dataset = load_dataset()
mlb = MultiLabelBinarizer()
dataset = dataset.join(pd.DataFrame(mlb.fit_transform(dataset.pop('features')),
columns=mlb.classes_,
index=dataset.index))
but i obtained this:
user item '-1' '-137' '-2' '1' '2'
0 a 1 0 0 0 0 1
1 a 2 1 0 1 0 0
2 b 1 1 1 0 0 1
3 b 3 0 1 0 1 1
Can someone please help me ?

In pandas this can be done as follows:
df1 = df.explode('features')
df1['f1'] = abs(df1.features)
df1['f2'] = np.sign(df1.features)
df1.pivot(['user', 'item'], 'f1', 'f2').fillna(0).reset_index()
f2 user item 1 2 137
0 a 1 0 1 0
1 a 2 -1 -1 0
2 b 1 -1 1 -1
3 b 3 1 1 -1

other than max value, replace all value to 0 in each column (pandas, python)

i have a df which I want to replace values that are not the max value for each column to 0.
code:
data = {
"A": [1, 2, 3],
"B": [3, 5, 1],
"C": [9, 0, 1]
}
df = pd.DataFrame(data)
sample df:
A B C
0 1 3 9
1 2 5 0
2 3 1 1
result trying to get:
A B C
0 0 0 9
1 0 5 0
2 3 0 0
kindly advise.
many thanks

Try:
df[df!=df.max()] = 0
Output:
A B C
0 0 0 9
1 0 5 0
2 3 0 0

Use DataFrame.where with compare max values :
df = df.where(df.eq(df.max()), 0)
print(df)
A B C
0 0 0 9
1 0 5 0
2 3 0 0

Pandas: How can I add many new columns (one-hot encode table) based on one column (which contains list)?

Here's my problem. Given a sample pandas dataframe (lists in colB can contains only numbers from 0 to n (in this example 5)):
colA colB
0 0 [2, 4, 5]
1 1 [0, 1]
2 2 [0]
3 4 [5]
4 4 [2, 5]
I am trying to do something like this: (in my example, 5 new columns with 0 or 1 based on colB)
colA colB colB0 colB1 colB2 colB4 colB5
0 0 [2, 4, 5] 0 0 1 1 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 1
4 4 [2, 5] 0 0 1 0 1
I have done it with: iterrows(). However, it's really slow when I got 900k rows. Is there any efficient solution? Thanks for answers.

You can expand the list in colB into one-hot encode table by:
join the elements in list using .map() and join
generate the one-hot encode table by .str.get_dummies()
add prefix to the generated column labels by .add_prefix()
df_exp = (df['colB'].map(lambda x: '|'.join(map(str, x)))
.str.get_dummies()
.add_prefix('colB')
)
Alternatively, you can also generate the one-hot encode table by:
df_exp = (pd.get_dummies(df['colB'].explode())
.groupby(level=0).max()
.add_prefix('colB')
)
Data Input:
data = {'colA': [0, 1, 2, 4, 4], 'colB': [[2, 4, 5], [0, 1], [0], [5], [2, 5]]}
df = pd.DataFrame(data)
colA colB
0 0 [2, 4, 5]
1 1 [0, 1]
2 2 [0]
3 4 [5]
4 4 [2, 5]
Result:
print(df_exp)
colB0 colB1 colB2 colB4 colB5
0 0 0 1 1 1
1 1 1 0 0 0
2 1 0 0 0 0
3 0 0 0 0 1
4 0 0 1 0 1
attach the generated one-hot encode table to the original dataframe
df_out = df.join(df_exp)
Result:
print(df_out)
colA colB colB0 colB1 colB2 colB4 colB5
0 0 [2, 4, 5] 0 0 1 1 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 1
4 4 [2, 5] 0 0 1 0 1
To do it in one step:
Use either:
df_out = df.join(df['colB'].map(lambda x: '|'.join(map(str, x)))
.str.get_dummies()
.add_prefix('colB')
)
Or, use:
df_out = df.join(pd.get_dummies(df['colB'].explode())
.groupby(level=0).max()
.add_prefix('colB')
)

Here is an alternative, based on explode+unstack:
df.join(df.explode('colB')
.set_index('colB', append=True)
.unstack()
.notna()
.astype(int)
.droplevel(0,1)
.reindex(range(5), axis=1)
.fillna(0, downcast='infer')
.add_prefix('colB')
)
Output:
colA colB colB0 colB1 colB2 colB3 colB4
0 0 [2, 4, 5] 0 0 1 0 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 0
4 4 [2, 5] 0 0 1 0 0

Create new columns in pandas from python nested lists

I have a pandas data frame. One of the columns has a nested list. I would like to create new columns from the nested list
Example:
L = [[1,2,4],
[5,6,7,8],
[9,3,5]]
I want all the elements in the nested lists as columns. The value should be one if the list has the element and zero if it does not.
1 2 4 5 6 7 8 9 3
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0
0 0 0 1 0 0 0 1 1

You can try the following:
df = pd.DataFrame({"A": L})
df
# A
#0 [1, 2, 4]
#1 [5, 6, 7, 8]
#2 [9, 3, 5]
# for each cell, use `pd.Series(1, x)` to create a Series object with the elements in the
# list as the index which will become the column headers in the result
df.A.apply(lambda x: pd.Series(1, x)).fillna(0).astype(int)
# 1 2 3 4 5 6 7 8 9
#0 1 1 0 1 0 0 0 0 0
#1 0 0 0 0 1 1 1 1 0
#2 0 0 1 0 1 0 0 0 1

pandas
Very similar to #Psidom's answer. However, I use pd.value_counts and will handle repeats
Use #Psidom's df
df = pd.DataFrame({'A': L})
df.A.apply(pd.value_counts).fillna(0).astype(int)
numpy
More involved, but speedy
lst = df.A.values.tolist()
n = len(lst)
lengths = [len(sub) for sub in lst]
flat = np.concatenate(lst)
u, inv = np.unique(flat, return_inverse=True)
rng = np.arange(n)
slc = np.hstack([
rng.repeat(lengths)[:, None],
inv[:, None]
])
data = np.zeros((n, u.shape[0]), dtype=np.uint8)
data[slc[:, 0], slc[:, 1]] = 1
pd.DataFrame(data, df.index, u)
Results
1 2 3 4 5 6 7 8 9
0 1 1 0 1 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0
2 0 0 1 0 1 0 0 0 1

How to get DataFrame indices for multiple rows configurations using Pandas in Python?

Consider the DataFrame P1 and P2:
P1 =
A B
0 0 0
1 0 1
2 1 0
3 1 1
P2 =
A B C
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I would like to know if there is a concise and efficient way of getting the indices in P1 for the row (tuple/configurations/assignments) of columns ['A','B'] in P2.
That is, given P2['A','B']:
P2['A','B'] =
A B
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 1 1
I would like to get [0, 0, 1, 1, 2, 2, 3, 3], since the first and second rows in P2['A','B'] corresponds to the first row in P1, and so on.

You could use merge and extract the overlapping keys
In [3]: tmp = p2[['A', 'B']].merge(p1.reset_index())
In [4]: tmp
Out[4]:
A B index
0 0 0 0
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 2
5 1 0 2
6 1 1 3
7 1 1 3
Get the values.
In [5]: tmp['index'].values
Out[5]: array([0, 0, 1, 1, 2, 2, 3, 3], dtype=int64)
However, there could be a native NumPy method to do this aswell.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe- count occurrences of list element - python

This is a rather lengthy one but it works df.explode('basket_new').groupby(['transaction_id','customerType','basket_new']).agg(count = ('basket_new','count'))\ .reset_index().pivot_table(index=['transaction_id','customerType'], columns='basket_new', values='count', fill_value=0)\ .reset_index()

Related

pandas one-hot-encoding column containing a list of feature and each feature can be negative

other than max value, replace all value to 0 in each column (pandas, python)

Pandas: How can I add many new columns (one-hot encode table) based on one column (which contains list)?

Create new columns in pandas from python nested lists

How to get DataFrame indices for multiple rows configurations using Pandas in Python?

Categories

Resources