Python Dataframe- count occurrences of list element - python

I have the following dataframe:
The basket_new column contains numbers from 0 to 5 in a list (the amount can vary for each number and transaction). I would like to count the occurrences of every number for each transaction and save that number in another DataFrame like this:
I just created a lambda function for Cat_0 to test it, unfortunately it's not working as it is creating "None" entries (see picture 2).
This is the function:
df_cat["Cat_0"] = df_train["basket_new"].map(lambda x: df_cat["Cat_0"]+1 if "0" in x else None)
Can you please just tell me what I'm doing wrong / how to fix my issue?

Use explode and crosstab.
Let say you have a df like this:
df = pd.DataFrame({'a':[1,2,3,4], 'b':[[1,2],[0],[3,1,2,3],[4,2,2,2,1]]})
df:
a b
0 1 [1, 2]
1 2 [0]
2 3 [3, 1, 2, 3]
3 4 [4, 2, 2, 2, 1]
df1 = df['b'].explode()
df[['a', 'b']].join(pd.crosstab(df1.index, df1))
a b 0 1 2 3 4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1
If you want to rename columns:
df[['a', 'b']].join(pd.crosstab(df1.index, df1, colnames=['b']).add_prefix('cat_'))
a b cat_0 cat_1 cat_2 cat_3 cat_4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1

Using the list.count() method:
df_cat["Cat_0"] = df['basket_new'].map(lambda x: x.count(0))
Using the first 5 rows of your data:
for i in range(0,5): df_cat["Cat_{0}".format(i)] = df['basket'].map(lambda x: x.count(i))
Cat_0 Cat_1 Cat_2 Cat_3 Cat_4 Cat_5
0 0 0 0 1 0 0
1 1 0 0 2 0 1
2 0 1 0 1 1 1
3 0 0 1 0 0 0
4 0 0 0 0 4 0

This is a rather lengthy one but it works
df.explode('basket_new').groupby(['transaction_id','customerType','basket_new']).agg(count = ('basket_new','count'))\
.reset_index().pivot_table(index=['transaction_id','customerType'], columns='basket_new', values='count', fill_value=0)\
.reset_index()

Related

pandas one-hot-encoding column containing a list of feature and each feature can be negative

So i have the following dataset
d = {'user': ['a','a','b','b'], 'item':[1, 2, 1, 3], 'features': [[2], [-2, -1], [-137, -1, 2], [-137, 2, 1]]}
df = pd.DataFrame(data=d)
user item features
0 a 1 [2]
1 a 2 [-2, -1]
2 b 1 [-137, -1, 2]
3 b 3 [-137, 2, 1]
i'm trying to obtain the following dataset:
user item '2' '1' '137'
0 a 1 1 0 0
1 a 2 -1 -1 0
2 b 1 1 -1 -1
3 b 3 1 1 -1
i tried to use:
dataset = load_dataset()
mlb = MultiLabelBinarizer()
dataset = dataset.join(pd.DataFrame(mlb.fit_transform(dataset.pop('features')),
columns=mlb.classes_,
index=dataset.index))
but i obtained this:
user item '-1' '-137' '-2' '1' '2'
0 a 1 0 0 0 0 1
1 a 2 1 0 1 0 0
2 b 1 1 1 0 0 1
3 b 3 0 1 0 1 1
Can someone please help me ?
In pandas this can be done as follows:
df1 = df.explode('features')
df1['f1'] = abs(df1.features)
df1['f2'] = np.sign(df1.features)
df1.pivot(['user', 'item'], 'f1', 'f2').fillna(0).reset_index()
f2 user item 1 2 137
0 a 1 0 1 0
1 a 2 -1 -1 0
2 b 1 -1 1 -1
3 b 3 1 1 -1

other than max value, replace all value to 0 in each column (pandas, python)

i have a df which I want to replace values that are not the max value for each column to 0.
code:
data = {
"A": [1, 2, 3],
"B": [3, 5, 1],
"C": [9, 0, 1]
}
df = pd.DataFrame(data)
sample df:
A B C
0 1 3 9
1 2 5 0
2 3 1 1
result trying to get:
A B C
0 0 0 9
1 0 5 0
2 3 0 0
kindly advise.
many thanks
Try:
df[df!=df.max()] = 0
Output:
A B C
0 0 0 9
1 0 5 0
2 3 0 0
Use DataFrame.where with compare max values :
df = df.where(df.eq(df.max()), 0)
print(df)
A B C
0 0 0 9
1 0 5 0
2 3 0 0

Pandas: How can I add many new columns (one-hot encode table) based on one column (which contains list)?

Here's my problem. Given a sample pandas dataframe (lists in colB can contains only numbers from 0 to n (in this example 5)):
colA colB
0 0 [2, 4, 5]
1 1 [0, 1]
2 2 [0]
3 4 [5]
4 4 [2, 5]
I am trying to do something like this: (in my example, 5 new columns with 0 or 1 based on colB)
colA colB colB0 colB1 colB2 colB4 colB5
0 0 [2, 4, 5] 0 0 1 1 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 1
4 4 [2, 5] 0 0 1 0 1
I have done it with: iterrows(). However, it's really slow when I got 900k rows. Is there any efficient solution? Thanks for answers.
You can expand the list in colB into one-hot encode table by:
join the elements in list using .map() and join
generate the one-hot encode table by .str.get_dummies()
add prefix to the generated column labels by .add_prefix()
df_exp = (df['colB'].map(lambda x: '|'.join(map(str, x)))
.str.get_dummies()
.add_prefix('colB')
)
Alternatively, you can also generate the one-hot encode table by:
df_exp = (pd.get_dummies(df['colB'].explode())
.groupby(level=0).max()
.add_prefix('colB')
)
Data Input:
data = {'colA': [0, 1, 2, 4, 4], 'colB': [[2, 4, 5], [0, 1], [0], [5], [2, 5]]}
df = pd.DataFrame(data)
colA colB
0 0 [2, 4, 5]
1 1 [0, 1]
2 2 [0]
3 4 [5]
4 4 [2, 5]
Result:
print(df_exp)
colB0 colB1 colB2 colB4 colB5
0 0 0 1 1 1
1 1 1 0 0 0
2 1 0 0 0 0
3 0 0 0 0 1
4 0 0 1 0 1
attach the generated one-hot encode table to the original dataframe
df_out = df.join(df_exp)
Result:
print(df_out)
colA colB colB0 colB1 colB2 colB4 colB5
0 0 [2, 4, 5] 0 0 1 1 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 1
4 4 [2, 5] 0 0 1 0 1
To do it in one step:
Use either:
df_out = df.join(df['colB'].map(lambda x: '|'.join(map(str, x)))
.str.get_dummies()
.add_prefix('colB')
)
Or, use:
df_out = df.join(pd.get_dummies(df['colB'].explode())
.groupby(level=0).max()
.add_prefix('colB')
)
Here is an alternative, based on explode+unstack:
df.join(df.explode('colB')
.set_index('colB', append=True)
.unstack()
.notna()
.astype(int)
.droplevel(0,1)
.reindex(range(5), axis=1)
.fillna(0, downcast='infer')
.add_prefix('colB')
)
Output:
colA colB colB0 colB1 colB2 colB3 colB4
0 0 [2, 4, 5] 0 0 1 0 1
1 1 [0, 1] 1 1 0 0 0
2 2 [0] 1 0 0 0 0
3 4 [5] 0 0 0 0 0
4 4 [2, 5] 0 0 1 0 0

Create new columns in pandas from python nested lists

I have a pandas data frame. One of the columns has a nested list. I would like to create new columns from the nested list
Example:
L = [[1,2,4],
[5,6,7,8],
[9,3,5]]
I want all the elements in the nested lists as columns. The value should be one if the list has the element and zero if it does not.
1 2 4 5 6 7 8 9 3
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0
0 0 0 1 0 0 0 1 1
You can try the following:
df = pd.DataFrame({"A": L})
df
# A
#0 [1, 2, 4]
#1 [5, 6, 7, 8]
#2 [9, 3, 5]
# for each cell, use `pd.Series(1, x)` to create a Series object with the elements in the
# list as the index which will become the column headers in the result
df.A.apply(lambda x: pd.Series(1, x)).fillna(0).astype(int)
# 1 2 3 4 5 6 7 8 9
#0 1 1 0 1 0 0 0 0 0
#1 0 0 0 0 1 1 1 1 0
#2 0 0 1 0 1 0 0 0 1
pandas
Very similar to #Psidom's answer. However, I use pd.value_counts and will handle repeats
Use #Psidom's df
df = pd.DataFrame({'A': L})
df.A.apply(pd.value_counts).fillna(0).astype(int)
numpy
More involved, but speedy
lst = df.A.values.tolist()
n = len(lst)
lengths = [len(sub) for sub in lst]
flat = np.concatenate(lst)
u, inv = np.unique(flat, return_inverse=True)
rng = np.arange(n)
slc = np.hstack([
rng.repeat(lengths)[:, None],
inv[:, None]
])
data = np.zeros((n, u.shape[0]), dtype=np.uint8)
data[slc[:, 0], slc[:, 1]] = 1
pd.DataFrame(data, df.index, u)
Results
1 2 3 4 5 6 7 8 9
0 1 1 0 1 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0
2 0 0 1 0 1 0 0 0 1

How to get DataFrame indices for multiple rows configurations using Pandas in Python?

Consider the DataFrame P1 and P2:
P1 =
A B
0 0 0
1 0 1
2 1 0
3 1 1
P2 =
A B C
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I would like to know if there is a concise and efficient way of getting the indices in P1 for the row (tuple/configurations/assignments) of columns ['A','B'] in P2.
That is, given P2['A','B']:
P2['A','B'] =
A B
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 1 1
I would like to get [0, 0, 1, 1, 2, 2, 3, 3], since the first and second rows in P2['A','B'] corresponds to the first row in P1, and so on.
You could use merge and extract the overlapping keys
In [3]: tmp = p2[['A', 'B']].merge(p1.reset_index())
In [4]: tmp
Out[4]:
A B index
0 0 0 0
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 2
5 1 0 2
6 1 1 3
7 1 1 3
Get the values.
In [5]: tmp['index'].values
Out[5]: array([0, 0, 1, 1, 2, 2, 3, 3], dtype=int64)
However, there could be a native NumPy method to do this aswell.

Categories