>>> df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3],
'b': [0,0,1,1,0,0,1,1,0,0,1,1,],
'c': [5,5,5,8,9,9,6,6,7,8,9,9]})
>>> df
a b c
0 1 0 5
1 1 0 5
2 1 1 5
3 1 1 8
4 2 0 9
5 2 0 9
6 2 1 6
7 2 1 6
8 3 0 7
9 3 0 8
10 3 1 9
11 3 1 9
Is there an alternative way to get this output?
>>> pd.pivot_table(df, index=['a','b'], columns='c', aggfunc=len, fill_value=0).reset_index()
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
I have a large df (>~1m lines) with len(df.c.unique()) being 134 so pivot is taking forever.
I was thinking that, given that this result is returned within a second in my actual df:
>>> df.groupby(by = ['a', 'b', 'c']).size().reset_index()
a b c 0
0 1 0 5 2
1 1 1 5 1
2 1 1 8 1
3 2 0 9 2
4 2 1 6 2
5 3 0 7 1
6 3 0 8 1
7 3 1 9 2
whether I could manually construct the desired outcome from this output above
1. Here's one:
df.groupby(by = ['a', 'b', 'c']).size().unstack(fill_value=0).reset_index()
Output:
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
2. Here's another way:
pd.crosstab([df.a,df.b], df.c).reset_index()
Output:
c a b 5 6 7 8 9
0 1 0 2 0 0 0 0
1 1 1 1 0 0 1 0
2 2 0 0 0 0 0 2
3 2 1 0 2 0 0 0
4 3 0 0 0 1 1 0
5 3 1 0 0 0 0 2
Related
I want get consecutive length labeled data
a
---
1
0
1
0
1
1
1
0
1
1
I want :
a | c
--------
1 1
0 0
1 2
1 2
0 0
1 3
1 3
1 3
0 0
1 2
1 2
then I can calculate the mean of "b" column by group "c". tried with shift and cumsum and cumcount all not work.
Use GroupBy.transform by consecutive groups and then set 0 if not 1 in a column:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.where(df.a.eq(1), 0))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
If there are only 0, 1 values is possible multiple by a:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.mul(df.a))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
Below is the code and output, what I'm trying to get is shown in the "exp" column, as you can see the "countif" column just counts 5 columns, but I want it to only count negative values.
So for example: index 0, df1[0] should equal 2
What am I doing wrong?
Python
import pandas as pd
import numpy as np
a = ['A','B','C','B','C','A','A','B','C','C','A','C','B','A']
b = [2,4,1,1,2,5,-1,2,2,3,4,3,3,3]
c = [-2,4,1,-1,2,5,1,2,2,3,4,3,3,3]
d = [-2,-4,1,-1,2,5,1,2,2,3,4,3,3,3]
exp = [2,1,0,2,0,0,1,0,0,0,0,0,0,0]
df1 = pd.DataFrame({'b':b,'c':c,'d':d,'exp':exp}, columns=['b','c','d','exp'])
df1['sumif'] = df1.where(df1<0,0).sum(1)
df1['countif'] = df1.where(df1<0,0).count(1)
df1
# df1.sort_values(['a','countif'], ascending=[True, True])
Output
You don't need where here, you can simply use df.lt with df.sum(axis=1):
In [1329]: df1['exp'] = df1.lt(0).sum(1)
In [1330]: df1
Out[1330]:
b c d exp
0 2 -2 -2 2
1 4 4 -4 1
2 1 1 1 0
3 1 -1 -1 2
4 2 2 2 0
5 5 5 5 0
6 -1 1 1 1
7 2 2 2 0
8 2 2 2 0
9 3 3 3 0
10 4 4 4 0
11 3 3 3 0
12 3 3 3 0
13 3 3 3 0
EDIT: As per OP's comment including solution with iloc and .lt:
In [1609]: df1['exp'] = df1.iloc[:, :3].lt(0).sum(1)
First DataFrame.where working different, it replace False values to 0 here by condition (here False are greater of equal 0), so cannot be used for count:
print (df1.iloc[:, :3].where(df1<0,0))
b c d
0 0 -2 -2
1 0 0 -4
2 0 0 0
3 0 -1 -1
4 0 0 0
5 0 0 0
6 -1 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
You need compare first 3 columns for less like 0 and sum:
df1['exp1'] = (df1.iloc[:, :3] < 0).sum(1)
#If need compare all columns
#df1['exp1'] = (df1 < 0).sum(1)
print (df1)
b c d exp exp1
0 2 -2 -2 2 2
1 4 4 -4 1 1
2 1 1 1 0 0
3 1 -1 -1 2 2
4 2 2 2 0 0
5 5 5 5 0 0
6 -1 1 1 1 1
7 2 2 2 0 0
8 2 2 2 0 0
9 3 3 3 0 0
10 4 4 4 0 0
11 3 3 3 0 0
12 3 3 3 0 0
13 3 3 3 0 0
I have a data frame with four columns, track,num_tracks playlist, cluster. My goal is to create a new data frame that will output a row that contains the track,pid and columns for each unique value in cluster with its corresponding count.
Here is a sample dataframe:
pid track cluster num_track
0 1 6 4
0 2 1 4
0 3 6 4
0 4 3 4
1 5 10 3
1 6 10 3
1 7 1 4
2 8 9 5
2 9 11 5
2 10 2 5
2 11 2 5
2 12 2 5
So my desired output would be:
pid track cluster num_track c1 c2 c3 c4 c5 c6 c7 ... c12
0 1 6 4 1 0 1 0 0 2 0 0
0 2 1 4 1 0 1 0 0 2 0 0
0 3 6 4 1 0 1 0 0 2 0 0
0 4 3 4 1 0 1 0 0 2 0 0
1 5 10 3 1 0 0 0 0 0 0 0
1 6 10 3 1 0 0 0 0 0 0 0
1 7 1 3 1 0 0 0 0 0 0 0
2 8 9 5 0 3 0 0 0 0 0 0
2 9 11 5 0 3 0 0 0 0 0 0
2 10 2 5 0 3 0 0 0 0 0 0
2 11 2 5 0 3 0 0 0 0 0 0
2 12 2 5 0 3 0 0 0 0 0 0
I hope I have presented my question correctly if anything is incorrect tell me! I haven't enough rep to set up a bounty yet but could repost when I have enough.
Any help would be appreciated!!
You can using crosstab with reindex , then concat back to original df
s=pd.crosstab(df.pid,df.cluster).reindex(df.pid)
s.index=df.index
df=pd.concat([df,s.add_prefix('c')],1)
df
Out[209]:
pid track cluster num_track c1 c2 c3 c6 c9 c10 c11
0 0 1 6 4 1 0 1 2 0 0 0
1 0 2 1 4 1 0 1 2 0 0 0
2 0 3 6 4 1 0 1 2 0 0 0
3 0 4 3 4 1 0 1 2 0 0 0
4 1 5 10 3 1 0 0 0 0 2 0
5 1 6 10 3 1 0 0 0 0 2 0
6 1 7 1 4 1 0 0 0 0 2 0
7 2 8 9 5 0 3 0 0 1 0 1
8 2 9 11 5 0 3 0 0 1 0 1
9 2 10 2 5 0 3 0 0 1 0 1
10 2 11 2 5 0 3 0 0 1 0 1
11 2 12 2 5 0 3 0 0 1 0 1
I have a dataset with multiple IDs and dates where I have created a column for Cumulative supply in python.
My data is as follows
SKU Date Demand Supply Cum_Supply
1 20160207 6 2 2
1 20160214 5 0 2
1 20160221 1 0 2
1 20160228 6 0 2
1 20160306 1 0 2
1 20160313 101 0 2
1 20160320 1 0 2
1 20160327 1 0 2
2 20160207 0 0 0
2 20160214 0 0 0
2 20160221 2 0 0
2 20160228 2 0 0
2 20160306 2 0 0
2 20160313 1 0 0
2 20160320 1 0 0
2 20160327 1 0 0
Where Cum_supply was calculated by
idx = pd.MultiIndex.from_product([np.unique(data.Date), data.SKU.unique()])
data2 = data.set_index(['Date', 'SKU']).reindex(idx).fillna(0)
data2 = pd.concat([data2, data2.groupby(level=1).cumsum().add_prefix('Cum_')],1).sort_index(level=1).reset_index()
I want to create a Column 'True_Demand' which is max unfulfilled demand till that date max(Demand-Supply) + Cum_supply.
So my output would be something this:
SKU Date Demand Supply Cum_Supply True_Demand
1 20160207 6 2 2 6
1 20160214 5 0 2 7
1 20160221 1 0 2 7
1 20160228 6 0 2 8
1 20160306 1 0 2 8
1 20160313 101 0 2 103
1 20160320 1 0 2 103
1 20160327 1 0 2 103
2 20160207 0 0 0 0
2 20160214 0 0 0 0
2 20160221 2 0 0 2
2 20160228 2 0 0 2
2 20160306 2 0 0 2
2 20160313 1 0 0 2
2 20160320 1 0 0 2
2 20160327 1 0 0 2
So for the 3rd record(20160221) the max unfulfilled demand before 20160221 was 5. So the True demand is 5+2 = 7 despite the unfulfilled demand on that date was 1+2.
Code for the dataframe
data = pd.DataFrame({'SKU':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'Date':[20160207,20160214,20160221,20160228,20160306,20160313,20160320,20160327,20160207,20160214,20160221,20160228,20160306,20160313,20160320,20160327],
'Demand':[6,5,1,6,1,101,1,1,0,0,2,2,2,1,1,1],
'Supply':[2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}
,columns=['Date', 'SKU', 'Demand', 'Supply'])
Would you try this pretty fun one-liner?
(data.groupby('SKU',
as_index=False,
group_keys=False)
.apply(lambda x:
x.assign(Cum_Supply=x.Supply.cumsum())
.pipe(lambda x:
x.assign(True_Demand = (x.Demand - x.Supply + x.Cum_Supply).cummax()))))
Output:
Date SKU Demand Supply Cum_Supply True_Demand
0 20160207 1 6 2 2 6
1 20160214 1 5 0 2 7
2 20160221 1 1 0 2 7
3 20160228 1 6 0 2 8
4 20160306 1 1 0 2 8
5 20160313 1 101 0 2 103
6 20160320 1 1 0 2 103
7 20160327 1 1 0 2 103
8 20160207 2 0 0 0 0
9 20160214 2 0 0 0 0
10 20160221 2 2 0 0 2
11 20160228 2 2 0 0 2
12 20160306 2 2 0 0 2
13 20160313 2 1 0 0 2
14 20160320 2 1 0 0 2
15 20160327 2 1 0 0 2
I have
{"A":[0,1], "B":[4,5], "C":[0,1], "D":[0,1]}
what I want it
A B C D
0 4 0 0
0 4 0 1
0 4 1 0
0 4 1 1
1 4 0 1
...and so on. Basically all the combinations of values for each of the categories.
What would be the best way to achieve this?
If x is your dict:
>>> pandas.DataFrame(list(itertools.product(*x.values())), columns=x.keys())
A C B D
0 0 0 4 0
1 0 0 4 1
2 0 0 5 0
3 0 0 5 1
4 0 1 4 0
5 0 1 4 1
6 0 1 5 0
7 0 1 5 1
8 1 0 4 0
9 1 0 4 1
10 1 0 5 0
11 1 0 5 1
12 1 1 4 0
13 1 1 4 1
14 1 1 5 0
15 1 1 5 1
If you want the columns in a particular order you'll need to switch them afterwards (with, e.g., df[["A", "B", "C", "D"]].