does groupby concatenate the columns? - python

i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')

you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()

Related

Python - Column in CSV file contains multiple delimiters and results

I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.
The main delimiter is ";", which separates days of results.
The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).
The third delimiter is "/", which separates the result day and the calendar value of the result.
I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.
Col1
Col2
X&Y
A
B
20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6
AA
BB
20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66
I want to see:
Col1
Col2
Date
CalendarValue
X
Y
A
B
20200331
1D
1
2
A
B
20200401
2D
3
4
A
B
2020040
3D
5
6
AA
BB
20210330
1Y
11
22
AA
BB
20220330
2Y
33
44
AA
BB
20220330
3Y
55
66
import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})
Here is a solution you can try out, split based on delimiter (;) followed by explode to transform into rows. Followed by extract & finally concat the frames to get resultant frame.
import pandas as pd
import re
df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
'20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})
df['Col3'] = df['Col3'].str.split(";")
# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")
pd.concat([
df.drop(columns='Col3'),
df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)
Out[*]:
Col1 Col2 Date CalendarValue X Y
0 A B 20200331 1D 1 2
0 A B 20200401 2D 3 4
0 A B 20200402 3D 5 6
1 AA BB 20210330 1Y 11 22
1 AA BB 20220330 2Y 33 44
1 AA BB 20230330 3Y 55 66
Regex Demo

Retain the values only in those rows of the column based on the condition on other columns in pandas

I have a dataframe df_in, which contains column names that start with pi and pm.
df_in = pd.DataFrame([[1,2,3,4,"",6,7,8,9],["",1,32,43,59,65,"",83,97],["",51,62,47,58,64,74,86,99],[73,51,42,67,54,65,"",85,92]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
If a row in column name which starts with pi is blank, make the same rows of columns which starts with pm also blank till we have a new column which starts with pi. And repeat the same process for other columns also.
Expected Output:
df_out = pd.DataFrame([[1,2,3,4,"","",7,8,9],["","","","",59,65,"","",""],["","","","",58,64,74,86,99],[73,51,42,67,54,65,"","",""]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
How to do it?
You can create groups by compare columns names by str.startswith with cumulative sum and then compare values by empty spaces in groupby for mask used for set empty spaces in DataFrame.mask:
g = df_in.columns.str.startswith('pi').cumsum()
df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform(lambda x: x.iat[0]), '')
#first for me failed in pandas 1.2.3
#df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform('first'), '')
print (df)
piabc pmed pmrde pmret pirtc pmere piuyt pmfgf pmthg
0 1 2 3 4 7 8 9
1 59 65
2 58 64 74 86 99
3 73 51 42 67 54 65

Stacking columns one below other when the column names are same

I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1

Get rid of excess Labels on Pandas DataFrames

so I got a DataFrame by doing:
dfgrp=df.groupby(['CCS_Category_ICD9','Gender'])['f0_'].sum()
ndf=pd.DataFrame(dfgrp)
ndf
f0_
CCS_Category_ICD9 Gender
1 F 889
M 796
U 2
2 F 32637
M 33345
U 34
Where f0_ is the sum of the counts by Gender
All I really want is a simple one level dataframe similar to this which I got via
ndf=ndf.unstack(level=1)
ndf
f0_
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
But what I want is:
CCS_Category_ICD9 F M U
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
I cannot figure out how to flatten or get rid of the levels associated with f0_ and Gender All I need is the "M","F","U" column headings so I have a simple one level dataframe. I have tried reset_index and set_index along with several other variations, with no luck...
At the end I want to have a simple crosstab with row and column totals (which my example does not show..
well I did (as suggested in one answer):
ndf = ndf.f0_.unstack()
ndf
Which gave me:
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
Followed by:
nndf=ndf.reset_index(['CCS_Category_ICD9','F','M','U'])
nndf
Gender CCS_Category_ICD9 F M U
0 1 889.0 796.0 2.0
1 2 32637.0 33345.0 34.0
2 3 2546.0 1812.0 NaN
3 4 347284.0 213782.0 34.0
4 5 3493.0 7964.0 1.0
5 6 12295.0 9998.0 4.0
Which just about does it But I cannot change the index name from Gender to something like Idx no matter what I do I get an extra row added with the New name ie a row titled Idx just under Gender.. Also is there a more straight forward solution?
You can
df.loc[:, 'f0_']
for the DataFrame resulting from .unstack(), ie, select the first level of your MultiIndex columns which only leaves the gender level , or alternatively
df.columns = df.columns.droplevel()
see MultiIndex.droplevel docs
Because ndf is a pd.DataFrame it has a column index. When you performed unstack() it appends the last level from the row index to the column index. Since columns already had f0_, you got a second level. To flatten the way you'd like, call unstack() on the column instead.
ndf = ndf.f0_.unstack()
The text Gender is the name of the column index. If you want to get rid of it, you have to overwrite the name attribute for that object.
ndf.columns.name = None
Use this right after the ndf.f0_.unstack()
Generally, use df.pivot when you want use a column as the row index and another column as the column index. Use df.pivot_table when you need to aggregate values due to rows with duplicate (row,column) pairs.
In this case, instead of df.groupby(...)[...].sum().unstack() you could use
df.pivot_table:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'CCS': np.random.choice([1,2], size=N),
'Gender':np.random.choice(['F','M','U'], size=N),
'f0':np.random.randint(10, size=N)})
result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum')
result.columns.name = None
result = result.reset_index()
yields
CCS F M U
0 1 89 104 90
1 2 66 65 65
Notice that after calling pivot_table(), the DataFrame result has named
index and column Indexes:
In [176]: result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum'); result
Out[176]:
Gender F M U
CCS
1 89 104 90
2 66 65 65
The index is named CSS:
In [177]: result.index
Out[177]: Int64Index([1, 2], dtype='int64', name='CCS')
and the columns index is named Gender:
In [178]: result.columns
Out[178]: Index(['F', 'M', 'U'], dtype='object', name='Gender') # <-- notice the name='Gender'
To remove the name from an Index, assign None to the name attribute:
In [179]: result.columns.name = None
In [180]: result
Out[180]:
F M U
CCS
1 95 68 67
2 82 63 68
Though it's not needed here, to remove names from the levels of a MultiIndex,
assign a list of Nones to the names (plural) attribute:
result.columns.names = [None]*numlevels

Best possible way to merge two pandas dataframe using a key and divide it

I have two pandas dataframes.
df1
unique numerator
23 4
29 10
df2
unique denominator
23 2
29 5
Now I want like this
unique result
23 2
29 2
Without using loops... or whichever is the most efficient way. Its a division numerator/denominator
if you set the index to unique for both dfs then you can just divide the 2 columns:
In [6]:
df.set_index('unique')['numerator']/df1.set_index('unique')['denominator']
Out[6]:
unique
23 2
29 2
dtype: float64
or merge on 'unique' and then do the calc as normal:
In [9]:
merged=df.merge(df1, on='unique')
merged
Out[9]:
unique numerator denominator
0 23 4 2
1 29 10 5
In [10]:
merged['result'] = merged['numerator']/merged['denominator']
merged
Out[10]:
unique numerator denominator result
0 23 4 2 2
1 29 10 5 2
EdChum has provided 2 good options.
An alternative is in using the div() or divide() function.
df1 = pd.DataFrame ({'unique':[23,29],'numerator': [4,10]})
df2 = pd.DataFrame ({'unique':[23,29],'denominator': [2,5]})
df1.set_index('unique',inplace=True)
df2.set_index('unique',inplace=True)
print df1.div(df2['denominator'],axis=0)
An important thing to note is that you need to divide by a series aka df2['denominator']
df1.div(df2,axis=0) will produce
denominator numerator
unique
23 NaN NaN
29 NaN NaN
this is because the label 'denominator' in df2 does not match 'numerator' in df1. However a series does not have column label as it were and its values are broadcast across the columns of df1.

Categories