I have a dataframe in which i have four columns id,opposition,innings and wickets . I want to group by innings and opposition and want the sum of wicket and count of opposition.
consider this is my dataframe.
and my required output of the dataframe should be
The wickets column is the sum of wickets group by innings and opposition, and the match_play is the count of opposition group by opposition and innings.
I have tried with pivot table but got 'Opposition' not 1-dimensional
table = inn.pivot_table(values=['Opposition', 'Wickets'], index=['Opposition', 'Inning_no'],
aggfunc=['count','sum'])
Just use .groupby() on a dataframe. And reset_index() to convert Opposition and Innings to normal columns again (they are converted to multiindex during groupby)
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5], 'Opposition':['Sri Lanka', 'Sri Lanka', 'UAE','UAE','Sri Lanka'],
'Innings':[1,2,1,2,1], 'Wickets':[13,17,14,18,29]})
t = df.groupby(['Opposition', 'Innings'])['Wickets'].agg(Wickets=('sum'),
Match_play=('count')).reset_index()
print(t)
Output:
Opposition Innings Wickets Match_play
0 Sri Lanka 1 42 2
1 Sri Lanka 2 17 1
2 UAE 1 14 1
3 UAE 2 18 1
Related
I have datasets where I have tried to use the pandas groupby function to group the selected columns in the dataset. I would like to get the count of items in a particular column as part of the same dataframe. I can't seem to find a way. I am new to python and pandas. Thanks for the help.
example:-
country customer_no Treatment_Group Open
Atlantis 1352202109 Group A 1
Atlantis 1354540751 Group B 1
Atlantis 1354849289 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356553036 Group A 1
Oceania 1356883118 Group B 0
Oceania 1356883118 Group B 0
Group Country Rate (count opened/total unique customer))
A Atlantis (2/2)*100
A Oceania (1/1)*100 rate of distinct customer number who opened
B Atlantis (1/1)*100
B Oceania *0/1)*100
I'm facing the following situation. I have a dataframe which looks like this (due to sensitivity of the data I have to paraphrase it)
Column A Column B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
My goal is to group Column A by the recurring pattern, which describes a certain property.
Meaning: Group by A1, Group by B12, Group by C123.
In order to do that, I split the Column and created new ones for each level of hierarchy, e.g.:
Column A Column B Column C
A1 B12 C123
A2 B34 C456
A3 B45 C789
A1 B15 C729
Those columns I have to add to my existing dataframe, and then I'll be able to group the way I wanted to.
I think this can work, but it seems a bit tedious and unelegant.
Is there a possibility or a way in Pandas to do this more elegantly?
I'd be happy about any input on that matter.
Best regards
Taking Seyi Daniel's idea from the comments, you can use exctractall() string method on the Column A to explode it based on regex groups and join Column B on to it.
import pandas as pd
from io import StringIO
data = StringIO("""
Column_A Column_B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
""")
df = pd.read_csv(data, delim_whitespace=True)
regex_df = df["Column_A"].str.extractall(r"(A\d*)|(B\d*)|(C\d*)")
# drop extra levels
regex_s = regex_df.stack().reset_index((1,2), drop=True)
# give the new column a name
regex_s.name = "group"
# add column B
result = pd.merge(regex_s, df["Column_B"], left_index=True, right_index=True)
print(result)
group Column_B
0 A1 Japan
0 B12 Japan
0 C123 Japan
1 A2 Switzerland
1 B34 Switzerland
1 C456 Switzerland
2 A3 Japan
2 B45 Japan
2 C789 Japan
3 A1 Japan
3 B15 Japan
3 C729 Japan
I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2
I have the following code:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Bank','GSE','PSE'],
'Sub Cat':['Tier1','Small','Small', 'Small'],
'Location':['US','US','UK','UK'],
'Amount':[50, 55, 65, 55],
'Amount1':[1,2,3,4]})
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum()
df2.dtypes
df1.dtypes
The df2 data frame does not have the columns that I am aggregating across ( Counterparty and Location). Any ideas why this is the case ? Both Amount and Amount1 are numeric fields. I just want to sum across Amount and aggregate across Amount1
For columns from index add as_index=False parameter or reset_index:
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum().reset_index()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
df2=df1.groupby(['Counterparty','Location'], as_index=False)[['Amount']].sum()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
If aggregate by all columns here happens automatic exclusion of nuisance columns - column Sub Cat is omitted:
df2=df1.groupby(['Counterparty','Location']).sum().reset_index()
print (df2)
Counterparty Location Amount Amount1
0 Bank US 105 3
1 GSE UK 65 3
2 PSE UK 55 4
df2=df1.groupby(['Counterparty','Location'], as_index=False).sum()
Remove the double brackets around the 'Amount' and make them single brackets. You're telling it to only select one column.
I want to mutate the dataframe object. I want to make 1st row as column index. And 1st column as row index.
import pandas as pd
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
df = pd.read_html(wiki)[1]
df2 = df.copy()
df2.head()
Currently I'm doing it like this (I'm loosing the row index name in this):
df2.columns = df.iloc[0]
df2.drop(0, inplace=True)
df2.drop('No.', axis=1, inplace=True)
df2.head()
How can I do it in a more Pythonic way preserving the row index name?
You can specify directly in the read_html your wishes, with header specifying which row to use as column, and index_col which column to use as index:
In [16]: df = pd.read_html(wiki,header=0,index_col=0)[1]
In [17]: df.head()
Out[17]:
State or union territory Administrative capitals Legislative capitals \
No.
1 Andaman and Nicobar Islands Port Blair Port Blair
2 Andhra Pradesh Hyderabad[a] Hyderabad
3 Arunachal Pradesh Itanagar Itanagar
4 Assam Dispur Guwahati
5 Bihar Patna Patna
Judiciary capitals Year capital was established The Former capital
No.
1 Kolkata 1955 Calcutta (1945–1956)
2 Hyderabad 1959 Kurnool (1953-1956)
3 Guwahati 1986 NaN
4 Guwahati 1975 Shillong[b] (1874–1972)
5 Patna 1912 NaN