I have a dataframe that looks like
day type col d_1 d_2 d_3 d_4 d_5...
1 A 1 1 0 1 0
1 A 2 1 0 1 0
2 B 1 1 1 0 0
That is, I have one normal column (col) and many columns prefixed by d_
I need to perform a groupby by day and type and I want to compute the sum of the values in every d_ column for every day-type combination. I also need to perform other aggregation functions on the other columns in my data (such as col in the example)
I can use:
agg_df=df.groupby(['day','type']).agg({'d_1': 'sum', 'col': 'mean'})
but this computes the sum only for one d_ column. How can I specify all the possible d_ columns in my data?
In other words, I would like to write something like
agg_df=df.groupby(['day','type']).agg({'d_*': 'sum', 'col': 'mean'})
so that the expected output is:
day type col d_1 d_2 d_3 d_4 d_5...
1 A 1.5 2 0 2 0 ...
2 B 1 1 1 0 0
As you can see, col is aggregated by mean, while the d_ columns are summed.
Thanks for your help!
IIUC you need to subset your groupby dataframe with your d_* columns. You could find that columns with str.contain and pass it to the groupby dataframe:
cols = df.columns[df.columns.str.contains('(d_)+|col')]
agg_df=df.groupby(['day','type'])[cols].sum()
In [150]: df
Out[150]:
day type col d_1 d_2 d_3 d_4
0 1 A 1 1 0 1 0
1 1 A 2 1 0 1 0
2 2 B 1 1 1 0 0
In [155]: agg_df
Out[155]:
col d_1 d_2 d_3 d_4
day type
1 A 3 2 0 2 0
2 B 1 1 1 0 0
Note: I added the col columns to the contains pattern as you requested. You could specify whatever regex expression you want and pass it with | symbol.
You can use filter:
In [23]: df.groupby(['day','type'], as_index=False)[df.filter(regex='d_.*').columns].sum()
Out[23]:
day type d_1 d_2 d_3 d_4
0 1 A 2 0 2 0
1 2 B 1 1 0 0
If you wanna apply all functions in one shot:
dic = {}
dic.update({i:np.sum for i in df.filter(regex='d_.*').columns})
dic.update({'col':np.mean})
In [48]: df.groupby(['day','type'], as_index=False).agg(dic)
#Out[48]:
# day type d_2 d_3 d_1 col d_4
#0 1 A 0 2 2 1.5 0
#1 2 B 1 0 1 1.0 0
Related
I have two pandas dataframe, one has columns a, b, c & other has column a, b, d.
df1
id month c
1 1 TE
2 1 TE
1 1 NTE
2 1 NTE
df2
id month price
1 1 4
2 1 6
I want to merge these dataframes on id & month columns combinedly. so i did following -
df1.merge(df2, how='left', left_on=['id', 'month'], right_on=['id', 'month'])
results of above code is as expected. now, what i want is, after merge, only one pair of id & month should have price, other pair have price as 0
so the result should be like
id month c price
1 1 TE 4
2 1 TE 6
1 1 NTE 0
2 1 NTE 0
Above can be done, while calling a check one each row, which i think is very expensive, n*n complexity.
Any leads with less expense is most welcome.
df.price *= ~df.groupby(["id", "month"]).cumcount().astype(bool)
i use .cumcount() as a "is this first in the group" mask:
In [89]: df
Out[89]:
id month c price
0 1 1 TE 4
1 2 1 TE 6
2 1 1 NTE 4
3 2 1 NTE 6
In [90]: df.groupby(["id", "month"]).cumcount()
Out[90]:
0 0
1 0
2 1
3 1
dtype: int64
In [91]: ~_.astype(bool)
Out[91]:
0 True
1 True
2 False
3 False
dtype: bool
In [92]: df.price *= _
In [93]: df
Out[93]:
id month c price
0 1 1 TE 4
1 2 1 TE 6
2 1 1 NTE 0
3 2 1 NTE 0
merged = df1.merge(df2, how='left', on=['id', 'month'])
def f(x):
x.iloc[1:] = 0
return x
merged['price'] = merged.groupby(['id', 'month'])['price'].transform(f)
Note: if they are the same, you don't have to specify
left_on and right_on separately.
You can use a cumcount as grouper and fill the rest:
cols = ['id', 'month']
(df1.assign(rank=df1.groupby(cols).cumcount())
.merge(df2.assign(rank=0), how='left',
on=cols+['rank'])
.fillna({'price': 0}, downcast='infer')
.drop(columns='rank')
)
output:
id month c price
0 1 1 TE 4
1 2 1 TE 6
2 1 1 NTE 0
3 2 1 NTE 0
different column names
cols_left = ['id', 'month']
cols_right = ['id', 'month']
(df1.assign(rank=df1.groupby(cols_left).cumcount())
.merge(df2.assign(rank=0), how='left',
left_on=cols_left+['rank'],
right_on=cols_right+['rank'])
.fillna({'price': 0}, downcast='infer')
.drop(columns='rank')
)
I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
You can use Series.str.split with get_dummies and sum:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
I am joining the two dataframes (a,b) with identical columns / column names using the user ID key and while joining, I had to give suffix characters, in order for it to get created. The following is the command I used,
a.join(b,how='inner', on='userId',lsuffix="_1")
If I dont use this suffix, I am getting error. But I dont want the column names to change because, that is causing a problem while running other analysis. So I want to remove this "_1" character from all the column names of the resulting dataframe. Can anybody suggest me an efficient way to remove last two characters of names of all the columns in the Pandas dataframe?
Thanks
This snippet should get the job done :
df.columns = pd.Index(map(lambda x : str(x)[:-2], df.columns))
Edit : This is a better way to do it
df.rename(columns = lambda x : str(x)[:-2])
In both cases, all we're doing is iterating through the columns and apply some function. In this case, the function converts something into a string and takes everything up until the last two characters.
I'm sure there are a few other ways you could do this.
You could use str.rstrip like so
In [214]: import functools as ft
In [215]: f = ft.partial(np.random.choice, *[5, 3])
In [225]: df = pd.DataFrame({'a': f(), 'b': f(), 'c': f(), 'a_1': f(), 'b_1': f(), 'c_1': f()})
In [226]: df
Out[226]:
a b c a_1 b_1 c_1
0 4 2 0 2 3 2
1 0 0 3 2 1 1
2 4 0 4 4 4 3
In [227]: df.columns = df.columns.str.rstrip('_1')
In [228]: df
Out[228]:
a b c a b c
0 4 2 0 2 3 2
1 0 0 3 2 1 1
2 4 0 4 4 4 3
However if you need something more flexible (albeit probably a bit slower), you can use str.extract which, with the power of regexes, will allow you to select which part of the column name you would like to keep
In [216]: df = pd.DataFrame({f'{c}_{i}': f() for i in range(3) for c in 'abc'})
In [217]: df
Out[217]:
a_0 b_0 c_0 a_1 b_1 c_1 a_2 b_2 c_2
0 0 1 0 2 2 4 0 0 3
1 0 0 3 1 4 2 4 3 2
2 2 0 1 0 0 2 2 2 1
In [223]: df.columns = df.columns.str.extract(r'(.*)_\d+')[0]
In [224]: df
Out[224]:
0 a b c a b c a b c
0 1 1 0 0 0 2 1 1 2
1 1 0 1 0 1 2 0 4 1
2 1 3 1 3 4 2 0 1 1
Idea to use df.columns.str came from this answer
We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3