pandas groupby on multiple columns - python

I have a data set which contains state code and its status.
code status
1 AZ a
2 CA b
3 KS c
4 MO c
5 NY d
6 AZ d
7 MO a
8 MO b
9 MN b
10 NV a
11 NV e
12 MO f
13 NY a
14 NY a
15 NY b
I want to filter out this data set which code contains only a status and count how many they have. Example output will be,
code status
1 AZ a
2 MO a
3 NY a
AZ =1 MO = 1 NY =2
I used df.groupyby("code").loc[df.status == 'a'] but didn't have any luck.
Any help appreciated!

Let's filter the dataframe first for a, then groupby and count.
df[df.status == 'a'].groupby('code').size()
Output:
code
AZ 1
MO 1
NV 1
NY 2
dtype: int64

I've recreated your dataset
data = [["AZ","CA", "KS","MO","NY","AZ","MO","MO","MN","NV","NV","MO","NY","NY" ,"NY"],
["a","b","c","c","d","d","a","b","b","a","e","f","a","a","b"]]
df = pd.DataFrame(data)
df = df.T
df.columns = ["code","status" ]
df[df["status"] == "a"].groupby(["code", "status"]).size()
gives
code status
AZ a 1
MO a 1
NV a 1
NY a 2
dtype: int64

Related

How to join two dataframes and create a table counting corresponding values using Pandas?

I have Table 1 which looks like this:
State District ID Race Party
0 GA 1 White Dem
1 SC 5 Black Dem
2 VA 4 Black Ind
3 VA 4 White Repub
4 NY 2 White Dem
5 GA 1 Black Dem
Then table 2 which looks like this:
State District ID Event Type
0 GA 1 A; B; C
1 SC 5 B; A
2 VA 4 A; C
3 NY 2 B
4 GA 1 A; C
And I want the resulting dataset to look like this:
State District ID # Event A # Event B # Event C # White # Black # Dem # Repub # Ind
0 GA 1 2 1 2 1 1 2 0 0
1 SC 5 1 1 0 1 1 0 0
2 VA 4 1 0 1 1 1 0 1 1
3 NY 2 0 1 0 1 0 1 0 0
I'm very shaky when it comes to joins and creating a resulting table that counts corresponding rows, and I've also never done it using Pandas, so I'm not quite sure how to start with this. Which table would even be considered the left or right table? This is probably a very common use case, I just can't wrap my head around what the line of code (or the SQL query if I was using Postgres) would look like.
First we load the sample data. Note I removed spaces in column names to make it a bit easier
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO(
"""
State District_ID Race Party
0 GA 1 White Dem
1 SC 5 Black Dem
2 VA 4 Black Ind
3 VA 4 White Repub
4 NY 2 White Dem
5 GA 1 Black Dem
"""), delim_whitespace=True)
df2 = pd.read_csv(StringIO(
"""
State District_ID Event_Type
0 GA 1 A
1 SC 5 B
2 VA 4 A
3 NY 2 B
4 GA 1 A
"""), delim_whitespace=True)
Then we create three pivot tables, one each for Race, Party, Event_Type:
dfa = df1.assign(count = 1).pivot_table(index = ['State','District_ID'], columns = ['Race'], values='count' ,fill_value=0, aggfunc = 'sum')
dfb = df1.assign(count = 1).pivot_table(index = ['State','District_ID'], columns = ['Party'], values='count' ,fill_value=0, aggfunc = 'sum')
dfc = df2.assign(count = 1).pivot_table(index = ['State','District_ID'], columns = ['Event_Type'], values='count' ,fill_value=0, aggfunc = 'sum')
Finally we join them together
dfa.join(dfb).join(dfc)
output
Black White Dem Ind Repub A B
State District_ID
GA 1 1 1 2 0 0 2 0
NY 2 0 1 1 0 0 0 1
SC 5 1 0 1 0 0 0 1
VA 4 1 1 0 1 1 1 0
edit after change to df2 by OP
If the second dataframe has lists separated by ';' in Event Type, it could be converted to the original form using split and explode:
df2 = pd.read_csv(StringIO(
"""
State District ID Event Type
0 GA 1 A; B; C
1 SC 5 B; A
2 VA 4 A; C
3 NY 2 B
4 GA 1 A; C
"""), sep='\s\s+')
df2['Event Type'] = df2['Event Type'].str.split(';')
df2.explode('Event Type')
output
State District ID Event Type
-- ------- ------------- ------------
0 GA 1 A
0 GA 1 B
0 GA 1 C
1 SC 5 B
1 SC 5 A
2 VA 4 A
2 VA 4 C
3 NY 2 B
4 GA 1 A
4 GA 1 C

DataFrame condition on multiple values python

I need to drop some lines from dataframe with python , based on multiple values
Code Names Country
1 a France
2 b France
3 c USA
4 d Canada
5 e TOTO
6 f TITI
7 g Corona
I need to have this
Code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I do this :
df.drop(df[('f','b','c')in df['names']].index)
But it doesnt work : KeyError: False
it works for only one key like this : df.drop(df['f' in df['names']].index)
Do you have any idea ?
To remove rows of certain values:
indexNames = df[df['Names'].isin(['f', 'b', 'c'])].index
df.drop(indexNames, inplace=True)
print(df)
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
Based on your example, I think this may be what you are looking for.
new_df = df.loc[~df.Names.isin(['f','b','c'])].copy()
new_df
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
In pandas, we can use .drop() function to drop column and rows.
For dropping specific rows, we need to use axis = 0
So your required output can be achieved by following line of code :
df4.drop([1,2,5], axis=0)
The output will be :
code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona

Merge multiple tables and join the same column with comma split

I have about 15 csv files with the same number of unique IDs. And for each of the file the col1 contains different text. How can I join them together to create a new table contains all the information from those 15 files? I tried to use pd.merge, create a new col1 comma split those text and delete the duplicates col1. There will be some columns named col1_x,col1_y, col1_y,etc.. Is there any other better ways to implement this?
My input is,
df1:
ID col1 location gender
1 Airplane NY F
2 Bus CA M
3 NaN FL M
4 Bus WA F
df2:
ID col1 location gender
1 Apple NY F
2 Peach CA M
3 Melon FL M
4 Banana WA F
df3:
ID col1 location gender
1 NaN NY F
2 Football CA M
3 Boxing FL M
4 Running WA F
Expected output is,
ID col1 location gender
1 Airplane,Apple NY F
2 Bus,Peach,Football CA M
3 Melon,Boxing FL M
4 Bus,Banana,Running WA F
You could use concat + groupby:
merged = pd.concat([df1, df2, df3], sort=False)
result = merged.dropna().groupby(['location', 'gender'], as_index=False).agg({'col1' : ','.join}).reset_index(drop=True)
print(result)
Output
location gender col1
0 CA M Bus,Peach,Football
1 FL M Melon,Boxing
2 NY F Airplane,Apple
3 WA F Bus,Banana,Running
For your data, you can do:
(pd.concat(df.melt(id_vars='ID').dropna() for df in [df1,df2,df3])
.groupby(['ID','variable'])['value'].apply(lambda x: ','.join(x.unique()))
.unstack()
)
Output:
variable col1 gender location
ID
1 Airplane,Apple F NY
2 Bus,Peach,Football M CA
3 Melon,Boxing M FL
4 Bus,Banana,Running F WA

Dividing each row by the previous one

I have pandas dataframe:
df = pd.DataFrame()
df['city'] = ['NY','NY','LA','LA']
df['hour'] = ['0','12','0','12']
df['value'] = [12,24,3,9]
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
What's the most pythonic way to do this?
First divide by shifted values per groups:
df['ratio'] = df['value'].div(df.groupby('city')['value'].shift(1))
print (df)
city hour value ratio
0 NY 0 12 NaN
1 NY 12 24 2.0
2 LA 0 3 NaN
3 LA 12 9 3.0
Then remove NaNs and select only city and ratio column:
df = df.dropna(subset=['ratio'])[['city', 'ratio']]
print (df)
city ratio
1 NY 2.0
3 LA 3.0
You can use pct_change:
In [20]: df[['city']].assign(ratio=df.groupby('city').value.pct_change().add(1)).dropna()
Out[20]:
city ratio
1 NY 2.0
3 LA 3.0
This'll do it:
df.groupby('city')['value'].agg({'ratio': lambda x: x.max()/x.min()}).reset_index()
# city ratio
#0 LA 3
#1 NY 2
This is one way using a custom function. It assumes you want to ignore the NaN rows in the result of dividing one series by a shifted version of itself.
def divider(x):
return x['value'] / x['value'].shift(1)
res = df.groupby('city').apply(divider)\
.dropna().reset_index()\
.rename(columns={'value': 'ratio'})\
.loc[:, ['city', 'ratio']]
print(res)
city ratio
0 LA 3.0
1 NY 2.0
one way is,
df.groupby(['city']).apply(lambda x:x['value']/x['value'].shift(1))
for further improvement,
print df.groupby(['city']).apply(lambda x:(x['value']/x['value'].shift(1)).fillna(method='bfill'))).reset_index().drop_duplicates(subset=['city']).drop('level_1',axis=1)
city value
0 LA 3.0
2 NY 2.0

Pandas Dataframe: Select row where column contains X, in multiindex

So I am struggling with this data processing, I have a data file with
~100 rows * 24 columns.
I load the file with
data = pd.read_csv(fileName, header=[0,1])
df = pd.DataFrame(data=data)
Then I only want to work on a part of it so I select only those columns:
postSurvey = questionaireData[['QUESTNNR','PS01_01', 'PS02_01', 'PS02_02', 'PS03_01', 'PS03_02', 'PS03_03', 'PS04', 'PS05', 'PS06', 'PS07_01']]
Problem: Now I want to select the rows which contain 'PS' in 'QUESTNNR'
I can create this "list" of True/false, but when I try to use it i get:
onlyPS = postSurvey['QUESTNNR'] == 'PS'
postSurvey = postSurvey[onlyPS]
ValueError: cannot join with no level specified and no overlapping names
With this I get:
postSurvey.xs('PS', level='QUESTNNR')
AttributeError: 'RangeIndex' object has no attribute 'get_loc_level'
I have tried all sorts of solutions from stackoverflow and other sources, but need help.
dataframe:
A B C D E F G H Q W E R T Y U I O P S J K L Z X C V N M A1 A2 A3 S4 F4 G5
ASDF1 ASDF2 ASDF3 ASDF4 ASDF5 ASDF6 ASDF7 ASDF8 ASDF9 ASDF10 ASDF11 ASDF12 ASDF13 ASDF14 ASDF15 ASDF16 ASDF17 ASDF18 ASDF19 ASDF20 ASDF21 ASDF22 ASDF23 ASDF24 ASDF25 ASDF26 ASDF27 ASDF28 ASDF29 ASDF30 ASDF31 ASDF32 ASDF33 ASDF34
138 PS interview eng date 10 2 5 7 2012 10 1 13 1 26 1 0 1 1
129 QB2 interview eng date 4 6 5 56,10,34,7,20 1 0 2 2
130 QC1 interview eng date 6 2 6 7 1 0 2 2
131 QD2 interview eng date 3 8 6 5,8,15 1 0 2 2

Categories