Insert values into columns without NaN - python

I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_a event_type_b event_type_c
1 a 2 1 0
1 a 2 1 0
1 b 2 1 0
2 a 1 1 1
2 b 1 1 1
2 c 1 1 1
I`ve tried code like
df[' event_type_a'] = df['user_id', 'event_type'].where(df['event_type']=='a').groupby([user_id]).count()
and get table like
user_id count_a
1 2
2 1
How i should insert this values into default df, to fill all rows without NaN items?
Maybe exsists method like, for exaple, "insert into df_1['column'] from df_2['column'] where df_1['user_id'] == df_1['user_id'] "

Use crosstab with add_prefix for new columns names and join:
df2 = pd.crosstab(df['user_id'],df['event_type'])
#alternatives
#df2 = df.groupby(['user_id','event_type']).size().unstack(fill_value=0)
#df2 = df.pivot_table(index='user_id', columns='event_type', fill_value=0, aggfunc='size')
df = df.join(df2.add_prefix('event_type_'), on='user_id')
print (df)
user_id event_type event_type_a event_type_b event_type_c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1

Here is another way for getting df2 as Jez mentioned but slightly different , since I using the transform and did not provide the agg format , So the df2 shape has the same length as original df
df2= df.set_index('user_id').event_type.str.get_dummies().groupby(level=0).transform('sum')
df2
Out[11]:
a b c
user_id
1 2 1 0
1 2 1 0
1 2 1 0
2 1 1 1
2 1 1 1
2 1 1 1
Then using concat
df2.index=df.index
pd.concat([df,df2],axis=1)
Out[19]:
user_id event_type a b c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1

Related

Transform cell values as column headers and fill it with 1 if matching in python

I have a dataframe:
df
ID 0 1 2 3 4 ....
1 10 20 5 1 2 ....
2 3 4 NaN 10 1 ....
And I need to transpose the cell values of the column 0,1,2,3,4... to the column headers, and fill it for the Id's with 1 if the cell value is present for the respective ID.
Desired Output:
ID 1 2 3 4 5 ... 10 20 ..
1 1 1 0 0 1 ... 1 1 ..
2 1 0 1 1 0 ... 1 0 ..
Note that some entries can be NaN.
How can I get the desired output?
Use DataFrame.set_index with DataFrame.stack for remove missing values, then create indicators by get_dummies and return 1/0 by max by first level, last convert columns to integers:
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
EDIT:
print (df)
ID 0 1 2 3 4 5
0 1 10 20 5.0 1 2 5
1 2 3 4 NaN 10 1 2
If use max then always in output are 0/1 values (check 5 column):
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 1 1 1 0 1 0
But if use sum it count values (check 5 column):
df2 = (pd.get_dummies(df.set_index('ID').stack())
.sum(level=0)
.rename(columns=int)
.reset_index())
print (df2)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 2 1 1
1 2 1 1 1 1 0 1 0
Another way using melt and pd.crosstab
df1 = df.melt('ID')
df_final = pd.crosstab(index=df1.ID, columns=df1.value).reset_index()
Out[673]:
value ID 1.0 2.0 3.0 4.0 5.0 10.0 20.0
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
Note: default counting of pd.crosstab uses frequency. Therefore, duplicate values will count as their frequencies. If you want only 1/0 indicator, just chain ge(1) and astype as follows
pd.crosstab(index=df1.ID, columns=df1.value).ge(1).astype(int).reset_index()

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

How to apply cummulative count on multiple columns of dataframe

Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3

Find first row with condition after each row satisfying another condition

in pandas I have the following data frame:
a b
0 0
1 1
2 1
0 0
1 0
2 1
Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:
a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1
Thanks!
It seems you need:
df['c'] = df.groupby(df.a.eq(0).cumsum())['b'].cumsum().le(1).astype(int)
print (df)
a b c
0 0 0 1
1 1 1 1
2 2 1 0
3 0 0 1
4 1 0 1
5 2 1 1
Detail:
print (df.a.eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: int32

How to set index to an existing dataframe in the form of cartesian product?

I have a list. I want to set_index of dataframe in the form of a cartesian product of list values with dataframe i.e
li = ['A','B']
df = pd.DataFrame([[0,0,0],[1,1,1],[2,2,2]])
I want the resulting dataframe to be like
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
How can I do this?
​
Option 1
pd.concat with keys argument
pd.concat([df] * len(li), keys=li)
0 1 2
A 0 0 0 0
1 1 1 1
2 2 2 2
B 0 0 0 0
1 1 1 1
2 2 2 2
To replicate your output exactly:
pd.concat([df] * len(li), keys=li).reset_index(1, drop=True)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Option 2
np.tile and np.repeat
pd.DataFrame(np.tile(df, [len(li), 1]), np.repeat(li, len(df)), df.columns)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Use MultiIndex.from_product with reindex:
mux = pd.MultiIndex.from_product([li, df.index])
df = df.reindex(mux, level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Or you can using .
li = [['A','B']]
df['New']=li*len(df)
df.set_index([0,1,2])['New'].apply(pd.Series).stack().to_frame().rename(columns={0:'keys'})\
.reset_index().drop('level_3',1).sort_values('keys')
Out[698]:
0 1 2 keys
0 0 0 0 A
2 1 1 1 A
4 2 2 2 A
1 0 0 0 B
3 1 1 1 B
5 2 2 2 B

Categories