pandas the row data transform to the column data - python

I have a dataframe like :
user_id category view collect
1 1 a 2 3
2 1 b 5 9
3 2 a 8 6
4 3 a 7 3
5 3 b 4 2
6 3 c 3 0
7 4 e 1 4
how to change it to a new dataframe ,each user_id can appear once,then the category with the view and collect appears to the columns ,if there is no data ,fill it with 0, like this :
user_id a_view a_collect b_view b_collect c_view c_collect d_view d_collect e_view e_collect
1 2 3 5 6 0 0 0 0 0 0
2 8 6 0 0 0 0 0 0 0 0
3 7 3 4 2 3 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1 4

The desired result can be obtained by pivoting df, with values from user_id becoming the index and values from category becoming a column level:
import numpy as np
import pandas as pd
df = pd.DataFrame({'category': ['a', 'b', 'a', 'a', 'b', 'c', 'e'],
'collect': [3, 9, 6, 3, 2, 0, 4],
'user_id': [1, 1, 2, 3, 3, 3, 4],
'view': [2, 5, 8, 7, 4, 3, 1]})
result = (df.pivot(index='user_id', columns='category')
.swaplevel(axis=1).sortlevel(axis=1).fillna(0))
yields
category a b c e
view collect view collect view collect view collect
user_id
1 2.0 3.0 5.0 9.0 0.0 0.0 0.0 0.0
2 8.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
3 7.0 3.0 4.0 2.0 3.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4.0
Above, result has a MultiIndex. In general I think this should be preferred over a flattened single index, since it retains more of the structure of the data.
However, the MultiIndex can be flattened into a single index:
result.columns = ['{}_{}'.format(cat,col) for cat, col in result.columns]
print(result)
yields
a_view a_collect b_view b_collect c_view c_collect e_view \
user_id
1 2.0 3.0 5.0 9.0 0.0 0.0 0.0
2 8.0 6.0 0.0 0.0 0.0 0.0 0.0
3 7.0 3.0 4.0 2.0 3.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 1.0
e_collect
user_id
1 0.0
2 0.0
3 0.0
4 4.0

Related

How to count number of rows dropped in a pandas dataframe

How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7

Combine list with dataframe and extend the multi-values cells into columns

Assuming I have the following input:
table = pd.DataFrame({'a':[0,0,0,0],'b':[1,1,1,3,],'c':[2,2,5,4],'d':[3,6,6,6]},dtype='float64')
list = [[55,66],
[77]]
#output of the table
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 1.0 2.0 6.0
2 0.0 1.0 5.0 6.0
3 0.0 3.0 4.0 6.0
I want to combine list with table so the final shape would be like:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
I found a way but it looks a bit long and might be a shorter way to do it.
Step1:
x = pd.Series(list, name ="ID")
new = pd.concat([table, x], axis=1)
# output
a b c d ID
0 0.0 1.0 2.0 3.0 [5, 6]
1 0.0 1.0 2.0 6.0 [77]
2 0.0 1.0 5.0 6.0 NaN
3 0.0 3.0 4.0 6.0 NaN
step2:
ID = new['ID'].apply(pd.Series)
ID = ID.rename(columns = lambda x : 'ID_' + str(x))
new_x = pd.concat([new[:], ID[:]], axis=1)
# output
a b c d ID ID_0 ID_1
0 0.0 1.0 2.0 3.0 [5, 6] 5.0 6.0
1 0.0 1.0 2.0 6.0 [77] 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN NaN
step3:
new_x = new_x.drop(columns=['ID'], axis = 1)
Any shorter way to achieve the same result?
Assuming a default index on table (as shown in the question), we can simply create a DataFrame (either from_records or with the constructor) and join back to table and allow the indexes to align. (add_prefix is an easy way to add the 'ID_' prefix to the default numeric columns)
new_df = table.join(
pd.DataFrame.from_records(lst).add_prefix('ID_')
)
new_df:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
Working with 2 DataFrames is generally easier than a DataFrame and a list. Here is what from_records does to lst:
pd.DataFrame.from_records(lst)
0 1
0 55 66.0
1 77 NaN
Index (rows) 0 and 1 will now align with the corresponding index values in table (0 and 1 respectively).
add_prefix fixes the column names before joining:
pd.DataFrame.from_records(lst).add_prefix('ID_')
ID_0 ID_1
0 55 66.0
1 77 NaN
Setup and imports:
import pandas as pd # v1.4.4
table = pd.DataFrame({
'a': [0, 0, 0, 0],
'b': [1, 1, 1, 3, ],
'c': [2, 2, 5, 4],
'd': [3, 6, 6, 6]
}, dtype='float64')
lst = [[55, 66],
[77]]

Python: how to select indexes from pandas dataframe?

I have a dataframe that looks like the following:
df
0 1 2 3 4
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 0.0 NaN 4.0
2 NaN 2.0 0.0 NaN 5.0
3 NaN NaN NaN 0.0 NaN
4 NaN 0.0 3.0 NaN 0.0
I would like to have a dataframe of all the couples of values different from NaN. The dataframe should be like the following
df
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 5.0
4 3 3 0.0
5 4 1 0.0
6 4 2 3.0
7 4 4 0.0
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('i','j')).reset_index(name='val')
print (df)
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0
Like this:
In [379]: df.stack().reset_index(name='val').rename(columns={'level_0':'i', 'level_1':'j'})
Out[379]:
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0

pandas count the number of value different from zero between two zeros

I have the following dataframe
0 0 0
1 0 0
1 1 0
1 1 1
1 1 1
0 0 0
0 1 0
0 1 0
0 0 0
how do you get a dataframe which looks like this
0 0 0
4 0 0
4 3 0
4 3 2
4 3 2
0 0 0
0 2 0
0 2 0
0 0 0
Thank you for your help.
You may need using for loop here , with tranform, and using cumsum create the key and assign the position back to your original df
for x in df.columns:
df.loc[df[x]!=0,x]=df[x].groupby(df[x].eq(0).cumsum()[df[x]!=0]).transform('count')
df
Out[229]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
Or without for loop
s=df.stack().sort_index(level=1)
s2=s.groupby([s.index.get_level_values(1),s.eq(0).cumsum()]).transform('count').sub(1).unstack()
df=df.mask(df!=0).combine_first(s2)
df
Out[255]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0

How to pivot with binning with complicated condition in pandas

I have dataframe like below
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1
first I would like to binning with age
age
[0~4]
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
Then sum up and count days by grouping with type
sum count
a 6 2
b 9 3
c 0 0
d 0 0
e 0 0
f 0 0
Then I would like to apply this method to another binns.
[5~9]
[11~14]
My desired result is below
[0~4] [5~9] [10~14]
sum count sum count sum count
a 6 2 0 0 1 1
b 9 3 0 0 0 0
c 0 0 1 1 0 0
d 0 0 4 1 0 0
e 0 0 0 0 2 1
f 0 0 0 1 0 0
How can this be done?
It is very complicated for me..
Consider a pivot_table with pd.cut if you do not care too much about column ordering as count and sum are not paired together under the bin. With manipulation you can change such ordering.
df['bin'] = pd.cut(df.age, [0,4,9,14])
pvtdf = df.pivot_table(index='type', columns=['bin'], values='days',
aggfunc=('count', 'sum')).fillna(0)
# count sum
# bin (0, 4] (4, 9] (9, 14] (0, 4] (4, 9] (9, 14]
# type
# a 2.0 0.0 1.0 6.0 0.0 1.0
# b 3.0 0.0 0.0 9.0 0.0 0.0
# c 0.0 1.0 0.0 0.0 1.0 0.0
# d 0.0 1.0 0.0 0.0 4.0 0.0
# e 0.0 0.0 1.0 0.0 0.0 2.0
# f 0.0 1.0 0.0 0.0 0.0 0.0
We'll use some stacking and groupby operations to get us to the desired output.
string_ = io.StringIO('''age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1''')
df = pd.read_csv(string_, sep='\s+')
df['age_bins'] = pd.cut(df['age'], [0,4,9,14])
df_stacked = df.groupby(['age_bins', 'type']).agg({'days': np.sum,
'type': 'count'}).transpose().stack().fillna(0)
df_stacked.rename(index={'days': 'sum', 'type': 'count'}, inplace=True)
>>> df_stacked
age_bins (0, 4] (4, 9] (9, 14]
type
sum a 6.0 0.0 1.0
b 9.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 4.0 0.0
e 0.0 0.0 2.0
f 0.0 0.0 0.0
count a 2.0 0.0 1.0
b 3.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 1.0 0.0
e 0.0 0.0 1.0
f 0.0 1.0 0.0
This doesn't produce the exact output you listed, but it's similar, and I think it will be easier to index and retrieve data from. Alternatively, you could do use the following to get something like the desired output.
>>> df_stacked.unstack(level=0)
age_bins (0, 4] (4, 9] (9, 14]
count sum count sum count sum
type
a 2.0 6.0 0.0 0.0 1.0 1.0
b 3.0 9.0 0.0 0.0 0.0 0.0
c 0.0 0.0 1.0 1.0 0.0 0.0
d 0.0 0.0 1.0 4.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 2.0
f 0.0 0.0 1.0 0.0 0.0 0.0

Categories