Drop rows in pandas dataframe based on fraction of total - python

country state year area
usa iowa 2000 30
usa iowa 2001 30
usa iowa 2002 30
usa iowa 2003 30
usa kansas 2000 500
usa kansas 2001 500
usa kansas 2002 500
usa kansas 2003 500
usa washington 2000 245
usa washington 2001 245
usa washington 2002 245
usa washington 2003 245
In the dataframe above, I want to drop the rows where the % of total area < 10%. In this case that would be all rows with state as iowa. What is the best way to do it in pandas? I tried groupby but not sure how to proceed.
df.groupby('area').sum()

Another solution with drop_duplicates and double boolean indexing:
a = df.drop_duplicates(['state','area'])
print (a)
country state year area
0 usa iowa 2000 30
4 usa kansas 2000 500
8 usa washington 2000 245
states = a.loc[a.area.div(a.area.sum()) >.1, 'state']
print (states)
4 kansas
8 washington
Name: state, dtype: object
print (df[df.state.isin(states)])
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245

You want to take any of the area values within each state and sum them up. I take the first.
groupby('state').area.first().sum() is the thing we normalize by.
df[df.area.div(df.groupby('state').area.first().sum()) >= .1]
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245

Related

how to merge Two datasets with different time ranges?

I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000
You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0

Identify which country won the most gold, in each olympic games

When I write the below codes in pandas
gold.groupby(['Games','country'])['Medal'].value_counts()
I get the below result, how to extract the top medal winner for each Games,The result should be all the games,country with most medal,medal tally
Games country Medal
1896 Summer Australia Gold 2
Austria Gold 2
Denmark Gold 1
France Gold 5
Germany Gold 25
...
2016 Summer UK Gold 64
USA Gold 139
Ukraine Gold 2
Uzbekistan Gold 4
Vietnam Gold 1
Name: Medal, Length: 1101, dtype: int64
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal country notes
68 17294 Cai Yalin M 23.0 174.0 60.0 China CHN 2000 Summer 2000 Summer Sydney Shooting Shooting Men's Air Rifle, 10 metres Gold China NaN
77 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN 2012 Summer 2012 Summer London Badminton Badminton Men's Doubles Gold China NaN
87 17995 Cao Lei F 24.0 168.0 75.0 China CHN 2008 Summer 2008 Summer Beijing Weightlifting Weightlifting Women's Heavyweight Gold China NaN
104 18005 Cao Yuan M 17.0 160.0 42.0 China CHN 2012 Summer 2012 Summer London Diving Diving Men's Synchronized Platform Gold China NaN
105 18005 Cao Yuan M 21.0 160.0 42.0 China CHN 2016 Summer 2016 Summer Rio de Janeiro Diving Diving Men's Springboard Gold China NaN
The data Your data only included Chinese gold medal winners so I added a row:
ID Name Sex Age Height Weight Team NOC \
0 17294 Cai Yalin M 23.0 174.0 60.0 China CHN
1 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN
2 17995 Cao Lei F 24.0 168.0 75.0 China CHN
3 18005 Cao Yuan M 17.0 160.0 42.0 China CHN
4 18005 Cao Yuan M 21.0 160.0 42.0 China CHN
5 292929 Serge de Gosson M 52.0 178.0 69.0 France FR
Games Year Season City Sport \
0 2000 Summer 2000 Summer Sydney Shooting
1 2012 Summer 2012 Summer London Badminton
2 2008 Summer 2008 Summer Beijing Weightlifting
3 2012 Summer 2012 Summer London Diving
4 2016 Summer 2016 Summer Rio de Janeiro Diving
5 2022 Summer 2022 Summer Stockholm Calisthenics
Event Medal country notes
0 Shooting Men's Air Rifle, 10 metres Gold China NaN
1 Badminton Men's Doubles Gold China NaN
2 Weightlifting Women's Heavyweight Gold China NaN
3 Diving Men's Synchronized Platform Gold China NaN
4 Diving Men's Springboard Gold China NaN
5 Planche Gold France NaN
YOu want to de exactly what you did but sort the data and keep the top row:
gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1)
Which returns:
Games country Medal
2000 Summer China Gold 1
2008 Summer China Gold 1
2012 Summer China Gold 2
2016 Summer China Gold 1
2022 Summer France Gold 1
Name: Medal, dtype: int64
or as a dataframe:
GOLD_TOP = pd.DataFrame(gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1))
df_gold = df[df["Medal"]=="Gold"].groupby("Team").Medal.count().reset_index()
df_gold = df_gold.sort_values(by="Medal",ascending=False)[:8]
df_gold

How to convert a string datatype column with more than 185 unique values in a pandas dataframe [duplicate]

This question already has answers here:
Running get_dummies on several DataFrame columns?
(5 answers)
Closed 1 year ago.
I have a pandas dataframe that consists of more than 17 columns and in that one particular column named Countries which contains more than 180 unique values in the column. How can I perform one-hot encoding in a column with more than 180 columns?
Providing a sample of the entire dataframe
Region Country
0 Sub-Saharan Africa Cote d'Ivoire
1 Sub-Saharan Africa Ethiopia
2 Central America and Caribbean Panama
3 Europe Sweden
4 Europe Romania
5 Asia Maldives
6 Sub-Saharan Africa Tanzania
7 Australia Tonga
8 Middle East Pakistan
9 Sub-Saharan Africa Chad
10 Central America Costa Rica
11 Sub-Saharan Africa Malawi
12 Asia Kyrgyzstan
13 Asia Maldives
14 Australia Fiji
15 Middle East Lebanon
16 Australia East Timor
17 Central America Guatemala
18 Europe Denmark
19 Europe Andorra
Pandas has get_dummies method to perform one-hot encoding.
pd.get_dummies(df['Country'])

For Loop Throwing Me For A Loop [duplicate]

This question already has answers here:
How to iterate over rows in a DataFrame in Pandas
(32 answers)
Closed 2 years ago.
I have a loop cycling through the length of a data frame and going through a list of teams. My loop should go through 41 rows but it only does 2 rows and then stops, I have no idea why it is stalling out. It seems to me I should be cycling through the entire 41 team list but it stops after indexing two teams.
import pandas as pd
excel_data_df = pd.read_excel('New_Schedule.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(excel_data_df)
print('Data Frame Above')
yahoot = len(excel_data_df)
print('Length Of Dataframe Below')
print(yahoot)
for games in excel_data_df:
yahoot -= 1
print(yahoot)
searching = excel_data_df.iloc[yahoot, 0]
print(searching)
excel_data_df2 = pd.read_excel('allstats.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(excel_data_df2)
finding = excel_data_df2[excel_data_df2['TEAM:'] == searching].index
print(finding)
Here is the run log
HOME TEAM: AWAY TEAM:
0 Portland St. Weber St.
1 Nevada Air Force
2 Utah Idaho
3 San Jose St. Santa Clara
4 Southern Utah SAGU American Indian
5 West Virginia Iowa St.
6 Missouri Prairie View
7 Southeast Mo. St. UT Martin
8 Little Rock Champion Chris.
9 Tennessee St. Belmont
10 Wichita St. Emporia St.
11 Tennessee Tennessee Tech
12 FGCU Webber Int'l
13 Jacksonville St. Ga. Southwestern
14 Northern Ill. Chicago St.
15 Col. of Charleston Western Caro.
16 Georgia Tech Florida A&M
17 Rider Iona
18 Tulsa Northwestern St.
19 Rhode Island Davidson
20 Washington St. Montana St.
21 Montana Dickinson St.
22 Robert Morris Bowling Green
23 South Dakota Drake
24 Richmond Loyola Chicago
25 Coastal Carolina Alice Lloyd
26 Presbyterian South Carolina St.
27 Morehead St. SIUE
28 San Diego St. BYU
29 Siena Canisius
30 Monmouth Saint Peter's
31 Howard Hampton
32 App State Columbia Int'l
33 Southern Ill. North Dakota
34 Norfolk St. UNCW
35 Niagara Fairfield
36 N.C. A&T Greensboro
37 Western Mich. Central Mich.
38 DePaul Xavier
39 Georgia St. Carver
40 Northern Ariz. Eastern Wash.
41 Gardner-Webb VMI
Data Frame Above
Length Of Dataframe Below
42
41
Gardner-Webb
TEAM: TOTAL POINTS: ... TURNOVER RATIO: ASSIST TO TURNOVER RANK
0 Mount St. Marys 307 ... 65 239.0
1 Saint Josephs 163 ... 28 81.0
2 Saint Marys (CA) 518 ... 78 114.0
3 Saint Peters 399 ... 86 145.0
4 St. John's (NY) 656 ... 115 73.0
.. ... ... ... ... ...
314 Wofford 327 ... 54 113.0
315 Wright St. 220 ... 47 206.0
316 Wyoming 517 ... 64 27.0
317 Xavier 582 ... 84 12.0
318 Youngstown St. 231 ... 30 79.0
[319 rows x 18 columns]
Int64Index([85], dtype='int64')
40
Northern Ariz.
TEAM: TOTAL POINTS: ... TURNOVER RATIO: ASSIST TO TURNOVER RANK
0 Mount St. Marys 307 ... 65 239.0
1 Saint Josephs 163 ... 28 81.0
2 Saint Marys (CA) 518 ... 78 114.0
3 Saint Peters 399 ... 86 145.0
4 St. John's (NY) 656 ... 115 73.0
.. ... ... ... ... ...
314 Wofford 327 ... 54 113.0
315 Wright St. 220 ... 47 206.0
316 Wyoming 517 ... 64 27.0
317 Xavier 582 ... 84 12.0
318 Youngstown St. 231 ... 30 79.0
[319 rows x 18 columns]
Int64Index([180], dtype='int64')
Use:for i in index,data in excel_data_df.iterrrows() instead.
pandas.DataFrame.iterrows
DataFrame.iterrows()
Iterate over DataFrame rows as (index, Series) pairs.

Web scraping the second of two tables on a page in Python 3 with BeautifulSoup

I'm working on my python skills and I'm trying to scrape only the "Results" table from this page https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results . I'm new to web scraping, could anyone help me with an elegant solution for scraping the Results wikitable? Thanks!
The easiest way is to use Pandas to load the tables:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
# print second table (index 1):
print(tables[1])
Prints:
Date Venue Home team Away team Score Competition Winner Match report
0 7 March 2020 Twickenham Stadium England Wales 33–30 2020 Six Nations England BBC
1 22 February 2020 Principality Stadium Wales France 23–27 2020 Six Nations France BBC
2 8 February 2020 Aviva Stadium Ireland Wales 24–14 2020 Six Nations Ireland BBC
3 1 February 2020 Principality Stadium Wales Italy 42–0 2020 Six Nations Wales BBC
4 30 November 2019 Principality Stadium Wales Barbarians 43–33 Tour Match Wales BBC
.. ... ... ... ... ... ... ... ...
741 5 January 1884 Cardigan Fields England Wales 1G 2T–1G 1884 Home Nations Championship England NaN
742 8 January 1883 Raeburn Place Scotland Wales 3G–1G 1883 Home Nations Championship Scotland NaN
743 16 December 1882 St Helen's Wales England 0–2G 4T 1883 Home Nations Championship England NaN
744 28 January 1882 Lansdowne Road Ireland Wales 0–2G 2T NaN Wales NaN
745 19 February 1881 Richardson's Field England Wales 7G 6T 1D–0 NaN England NaN
[746 rows x 8 columns]

Categories