Pandas groupby sum changes the key, why? - python

I have this dataset called 'event'
id event_type_1 event_type_2 event_type_3
234 0 1 0
234 1 0 0
345 0 0 0
and I want to produce this
id event_type_1 event_type_2 event_type_3
234 1 1 0
345 0 0 0
I tried using
event.groupby('id').sum()
but that just produced
id event_type_1 event_type_2 event_type_3
1 1 1 0
2 0 0 0
The id has has been replaced with an incremental value starting at '1'. Why? And how do I get my desired result?

Use as_index=False parameter:
In [163]: event.groupby('id', as_index=False).sum()
Out[163]:
id event_type_1 event_type_2 event_type_3
0 234 1 1 0
1 345 0 0 0
From the docs:
as_index : boolean, default True
For aggregated output, return object with group labels as the index.
Only relevant for DataFrame input. as_index=False is effectively
“SQL-style” grouped output

Related

How to print the value counts of all the columns in a dataframe

I have a dataset where it has 24 dependent variables. i want to count the occurrences of each value(class) in all the columns.
I have used the following code:
for i in target_c:
print(f'{target_col[i].value_counts}')
The output is as following:
<bound method IndexOpsMixin.value_counts of 0 0
1 0
2 0
3 0
4 0
..
13647304 0
13647305 0
13647306 0
13647307 0
13647308 0
Name: saving_account, Length: 13647309, dtype: int64>
<bound method IndexOpsMixin.value_counts of 0 0
1 0
2 0
3 0
4 0
..
13647304 0
13647305 0
13647306 0
13647307 0
13647308 0
The expected output is like this for all the columns.
saving_account
0 13645913
1 1396
dtype: int64
As #Karl Knechtel said, value_counts should be called from the dataframe object. So if your dataframe is df, you should use:
df.value_counts()
Group by the class available column.sum those
import pandas as pd
df = pd.read_csv('sample.csv')
df.groupby(['Target']).sum()

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

Using two different data frames to compute new variable

I have two dataframes of the same dimensions that look like:
df1
ID flag
0 1
1 0
2 1
df2
ID flag
0 0
1 1
2 0
In both dataframes I want to create a new variable that denotes an additive flag. So the new variable will look like this:
df1
ID flag new_flag
0 1 1
1 0 1
2 1 1
df2
ID flag new_flag
0 0 1
1 1 1
2 0 1
So if either flag columns is a 1 the new flag will be a 1.
I tried this code:
df1['new_flag']= 1
df2['new_flag']= 1
df1['new_flag'][(df1['flag']==0)&(df1['flag']==0)]=0
df2['new_flag'][(df2['flag']==0)&(df2['flag']==0)]=0
I would expect the same number of 1 in both new_flag but they differ. Is this because I'm not going row by row? Like this question?
pandas create new column based on values from other columns
If so how do I include criteria from both datafrmes?
You can use np.logical_or to achieve this, if we set df1 to be all 0's except for the last row so we don't just get a column of 1's, we can cast the result of np.logical_or using astype(int) to convert the boolean array to 1 and 0:
In [108]:
df1['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df2['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df1
Out[108]:
ID flag new_flag
0 0 0 0
1 1 0 1
2 2 1 1
In [109]:
df2
Out[109]:
ID flag new_flag
0 0 0 0
1 1 1 1
2 2 0 1

pandas - groupby by partial string

I would like to group a DataFrame by partial substrings. This is a sample .csv file:
GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours
What I did so far is importing the file as df = pd.read_csv(sample.csv), and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower(). The first thing I tried is groupby by GridCode and Key with:
g = df.groupby([df['GridCode'],df['Key']]).size()
then unstack and fill:
d = g.unstack().fillna(0)
and the resulting DataFrame is:
Key behaviour behaviours colors colour colours favourite honours
GridCode
1000 0 0 0 1 1 0 0
1001 1 1 0 0 0 0 0
1002 0 0 0 0 0 1 0
1003 0 0 1 0 0 0 0
1004 0 0 0 0 0 0 1
Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring.
The expected result would be like:
Key 'our'
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
I tried also to mask the DataFrame with masked = df['Key'].str.contains('our'), then df1 = df[mask], but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.
>>> import re # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
Name: Key, dtype: float64
also, instead of
df.groupby([df['GridCode'],df['Key']])
it is better to do:
df.groupby(['GridCode', 'Key'])

retaining order of columns after pivot

I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?

Categories