creating new column by merging on column name and other column value - python

Trying to create a new column in DF1 that lists the home teams number of allstars for that year.
DF1
Date Visitor V_PTS Home H_PTS \
0 2012-10-30 19:00:00 Washington Wizards 84 Cleveland Cavaliers 94
1 2012-10-30 19:30:00 Dallas Mavericks 99 Los Angeles Lakers 91
2 2012-10-30 20:00:00 Boston Celtics 107 Miami Heat 120
3 2012-10-31 19:00:00 Dallas Mavericks 94 Utah Jazz 113
4 2012-10-31 19:00:00 San Antonio Spurs 99 New Orleans Pelicans 95
Attendance Arena Location Capacity \
0 20562 Quicken Loans Arena Cleveland, Ohio 20562
1 18997 Staples Center Los Angeles, California 18997
2 20296 American Airlines Arena Miami, Florida 19600
3 17634 Vivint Smart Home Arena Salt Lake City, Utah 18303
4 15358 Smoothie King Center New Orleans, Louisiana 16867
Yr Arena Opened Season
0 1994 2012-13
1 1992 2012-13
2 1999 2012-13
3 1991 2012-13
4 1999 2012-13
DF2
2012-13 2013-14 2014-15 2015-16 2016-17
Cleveland Cavaliers 1 1 2 1 3
Los Angeles Lakers 2 1 1 1 0
Miami Heat 3 3 2 2 1
Chicago Bulls 2 1 2 2 1
Detroit Pistons 0 0 0 1 1
Los Angeles Clippers 2 2 2 1 1
New Orleans Pelicans 0 1 1 1 1
Philadelphia 76ers 1 0 0 0 0
Phoenix Suns 0 0 0 0 0
Portland Trail Blazers 1 2 2 0 0
Toronto Raptors 0 1 1 2 2
DF1['H_Allstars']=DF2[DF1['Season'],DF1['Home']])
results in TypeError: 'Series' objects are mutable, thus they cannot be hashed
I understand the error just am not sure how else to do it.

I've removed the extra columns and just focused on the necessary ones for demonstration.
Input:
df1
Home 2012-13 2013-14 2014-15 2015-16 2016-17
0 Cleveland Cavaliers 1 1 2 1 3
1 Los Angeles Lakers 2 1 1 1 0
2 Miami Heat 3 3 2 2 1
3 Chicago Bulls 2 1 2 2 1
4 Detroit Pistons 0 0 0 1 1
5 Los Angeles Clippers 2 2 2 1 1
6 New Orleans Pelicans 0 1 1 1 1
7 Philadelphia 76ers 1 0 0 0 0
8 Phoenix Suns 0 0 0 0 0
9 Portland Trail Blazers 1 2 2 0 0
10 Toronto Raptors 0 1 1 2 2
df2
Visitor Home Season
0 Washington Wizards Cleveland Cavaliers 2012-13
1 Dallas Mavericks Los Angeles Lakers 2012-13
2 Boston Celtics Miami Heat 2012-13
3 Dallas Mavericks Utah Jazz 2012-13
4 San Antonio Spurs New Orleans Pelicans 2012-13
Step 1: Melt df1 to get the allstars column
df3 = pd.melt(df1, id_vars='Home', value_vars = df1.columns[df.columns.str.contains('20')], var_name = 'Season', value_name='H_Allstars')
Ouput:
Home Season H_Allstars
0 Cleveland Cavaliers 2012-13 1
1 Los Angeles Lakers 2012-13 2
2 Miami Heat 2012-13 3
3 Chicago Bulls 2012-13 2
4 Detroit Pistons 2012-13 0
5 Los Angeles Clippers 2012-13 2
6 New Orleans Pelicans 2012-13 0
7 Philadelphia 76ers 2012-13 1
8 Phoenix Suns 2012-13 0
...
Step 2: Merge this new dataframe with df2 to get the H_Allstars and V_Allstars columns
df4 = pd.merge(df2, df3, how='left', on=['Home', 'Season'])
Output:
Visitor Home Season H_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0
2 Boston Celtics Miami Heat 2012-13 3.0
3 Dallas Mavericks Utah Jazz 2012-13 NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0
Step 3: Add the V_Allstars column
# renaming column as required
df3.rename(columns={'Home': 'Visitor', 'H_Allstars': 'V_Allstars'}, inplace=True)
df5 = pd.merge(df4, df3, how='left', on=['Visitor', 'Season'])
Output:
Visitor Home Season H_Allstars V_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 NaN
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 NaN
2 Boston Celtics Miami Heat 2012-13 3.0 NaN
3 Dallas Mavericks Utah Jazz 2012-13 NaN NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 NaN

You can use pandas.melt . Bring your data df2 to long format, i.e. Home and season as columns and Allstars as values and then merge to df1 on 'Home' and 'Season'.
import pandas as pd
df2['Home'] = df2.index
df2 = pd.melt(df2, id_vars = 'Home', value_vars = ['2012-13', '2013-14', '2014-15', '2015-16', '2016-17'], var_name = 'Season', value_name='H_Allstars')
df = df1.merge(df2, on=['Home','Season'], how='left')

Related

how to merge Two datasets with different time ranges?

I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000
You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0

Identify which country won the most gold, in each olympic games

When I write the below codes in pandas
gold.groupby(['Games','country'])['Medal'].value_counts()
I get the below result, how to extract the top medal winner for each Games,The result should be all the games,country with most medal,medal tally
Games country Medal
1896 Summer Australia Gold 2
Austria Gold 2
Denmark Gold 1
France Gold 5
Germany Gold 25
...
2016 Summer UK Gold 64
USA Gold 139
Ukraine Gold 2
Uzbekistan Gold 4
Vietnam Gold 1
Name: Medal, Length: 1101, dtype: int64
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal country notes
68 17294 Cai Yalin M 23.0 174.0 60.0 China CHN 2000 Summer 2000 Summer Sydney Shooting Shooting Men's Air Rifle, 10 metres Gold China NaN
77 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN 2012 Summer 2012 Summer London Badminton Badminton Men's Doubles Gold China NaN
87 17995 Cao Lei F 24.0 168.0 75.0 China CHN 2008 Summer 2008 Summer Beijing Weightlifting Weightlifting Women's Heavyweight Gold China NaN
104 18005 Cao Yuan M 17.0 160.0 42.0 China CHN 2012 Summer 2012 Summer London Diving Diving Men's Synchronized Platform Gold China NaN
105 18005 Cao Yuan M 21.0 160.0 42.0 China CHN 2016 Summer 2016 Summer Rio de Janeiro Diving Diving Men's Springboard Gold China NaN
The data Your data only included Chinese gold medal winners so I added a row:
ID Name Sex Age Height Weight Team NOC \
0 17294 Cai Yalin M 23.0 174.0 60.0 China CHN
1 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN
2 17995 Cao Lei F 24.0 168.0 75.0 China CHN
3 18005 Cao Yuan M 17.0 160.0 42.0 China CHN
4 18005 Cao Yuan M 21.0 160.0 42.0 China CHN
5 292929 Serge de Gosson M 52.0 178.0 69.0 France FR
Games Year Season City Sport \
0 2000 Summer 2000 Summer Sydney Shooting
1 2012 Summer 2012 Summer London Badminton
2 2008 Summer 2008 Summer Beijing Weightlifting
3 2012 Summer 2012 Summer London Diving
4 2016 Summer 2016 Summer Rio de Janeiro Diving
5 2022 Summer 2022 Summer Stockholm Calisthenics
Event Medal country notes
0 Shooting Men's Air Rifle, 10 metres Gold China NaN
1 Badminton Men's Doubles Gold China NaN
2 Weightlifting Women's Heavyweight Gold China NaN
3 Diving Men's Synchronized Platform Gold China NaN
4 Diving Men's Springboard Gold China NaN
5 Planche Gold France NaN
YOu want to de exactly what you did but sort the data and keep the top row:
gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1)
Which returns:
Games country Medal
2000 Summer China Gold 1
2008 Summer China Gold 1
2012 Summer China Gold 2
2016 Summer China Gold 1
2022 Summer France Gold 1
Name: Medal, dtype: int64
or as a dataframe:
GOLD_TOP = pd.DataFrame(gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1))
df_gold = df[df["Medal"]=="Gold"].groupby("Team").Medal.count().reset_index()
df_gold = df_gold.sort_values(by="Medal",ascending=False)[:8]
df_gold

python selenium: <input> element not interactable

I was trying to crawl down nba player info from https://nba.com/players and click the button "Show Historic" on the webpage
nba_webpage_picture
part of the html code for the input button shows below:
<div aria-label="Show Historic Toggle" class="Toggle_switch__2e_90">
<input type="checkbox" class="Toggle_input__gIiFd" name="showHistoric">
<span class="Toggle_slider__hCMQQ Toggle_sliderActive__15Jrf Toggle_slidercerulean__1UnnV">
</span>
</div>
I simply use find_element_by_xpath to locate the input button and click
button_show_historic = driver.find_element_by_xpath("//input[#name='showHistoric']")
button_show_historic.click()
However it says:
Exception has occurred: ElementNotInteractableException
Message: element not interactable
(Session info: chrome=88.0.4324.192)
Could anyone help on solving this issue? Is this because the input is not visible?
Simply wait for the span element not the input element and click.
wait = WebDriverWait(driver, 30)
driver.get('https://www.nba.com/players')
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[.='I Accept']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,"//input[#name='showHistoric']/preceding::span[1]"))).click()
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Also to find an api just look under Developer tools ->Network->Headers
and Response to find if it gets populated.
Most probably problem is you don't have any wait code. You should wait until page is loaded. You can use simple python sleep function:
import time
time.sleep(3) #it will wait 3 seconds
##Do your action
Or You can use explicit wait. Check this page: selenium.dev
No need to use selenium when there's an api. Try this:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/playerindex'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Referer': 'http://stats.nba.com'}
payload = {
'College': '',
'Country': '',
'DraftPick': '',
'DraftRound': '',
'DraftYear': '',
'Height': '' ,
'Historical': '1',
'LeagueID': '00',
'Season': '2020-21',
'SeasonType': 'Regular Season',
'TeamID': '0',
'Weight': ''}
jsonData = requests.get(url, headers=headers, params=payload).json()
cols = jsonData['resultSets'][0]['headers']
data = jsonData['resultSets'][0]['rowSet']
df = pd.DataFrame(data, columns=cols)
Output: [4589 rows x 26 columns]
print(df.head(20).to_string())
PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION HEIGHT WEIGHT COLLEGE COUNTRY DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME FROM_YEAR TO_YEAR
0 76001 Abdelnaby Alaa alaa-abdelnaby 1.610613e+09 blazers 0 Portland Trail Blazers POR 30 F 6-10 240 Duke USA 1990.0 1.0 25.0 NaN 5.7 3.3 0.3 Career 1990 1994
1 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1.610613e+09 rockets 0 Houston Rockets HOU 54 C 6-9 235 Iowa State USA 1968.0 1.0 5.0 NaN 9.0 8.0 1.2 Career 1968 1977
2 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1.610613e+09 lakers 0 Los Angeles Lakers LAL 33 C 7-2 225 UCLA USA 1969.0 1.0 1.0 NaN 24.6 11.2 3.6 Career 1969 1988
3 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1.610613e+09 nuggets 0 Denver Nuggets DEN 1 G 6-1 162 Louisiana State USA 1990.0 1.0 3.0 NaN 14.6 1.9 3.5 Career 1990 2000
4 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1.610613e+09 kings 0 Sacramento Kings SAC 9 F-G 6-6 235 San Jose State France 1997.0 1.0 11.0 NaN 7.8 3.3 1.1 Career 1997 2003
5 949 Abdur-Rahim Shareef shareef-abdur-rahim 1.610613e+09 grizzlies 0 Memphis Grizzlies MEM 3 F 6-9 245 California USA 1996.0 1.0 3.0 NaN 18.1 7.5 2.5 Career 1996 2007
6 76005 Abernethy Tom tom-abernethy 1.610613e+09 warriors 0 Golden State Warriors GSW 5 F 6-7 220 Indiana USA 1976.0 3.0 43.0 NaN 5.6 3.2 1.2 Career 1976 1980
7 76006 Able Forest forest-able 1.610613e+09 sixers 0 Philadelphia 76ers PHI 6 G 6-3 180 Western Kentucky USA 1956.0 NaN NaN NaN 0.0 1.0 1.0 Career 1956 1956
8 76007 Abramovic John john-abramovic 1.610610e+09 None 1 Pittsburgh Ironmen PIT None F 6-3 195 Salem USA NaN NaN NaN NaN 9.5 NaN 0.7 Career 1946 1947
9 203518 Abrines Alex alex-abrines 1.610613e+09 thunder 0 Oklahoma City Thunder OKC 8 G 6-6 190 FC Barcelona Spain 2013.0 2.0 32.0 NaN 5.3 1.4 0.5 Career 2016 2018
10 1630173 Achiuwa Precious precious-achiuwa 1.610613e+09 heat 0 Miami Heat MIA 5 F 6-8 225 Memphis Nigeria 2020.0 1.0 20.0 1.0 5.9 3.9 0.6 Season 2020 2020
11 101165 Acker Alex alex-acker 1.610613e+09 clippers 0 LA Clippers LAC 3 G 6-5 185 Pepperdine USA 2005.0 2.0 60.0 NaN 2.7 1.0 0.5 Career 2005 2008
12 76008 Ackerman Donald donald-ackerman 1.610613e+09 knicks 0 New York Knicks NYK G 6-0 183 Long Island-Brooklyn USA 1953.0 2.0 NaN NaN 1.5 0.5 0.8 Career 1953 1953
13 76009 Acres Mark mark-acres 1.610613e+09 magic 0 Orlando Magic ORL 42 C 6-11 220 Oral Roberts USA 1985.0 2.0 40.0 NaN 3.6 4.1 0.5 Career 1987 1992
14 76010 Acton Charles charles-acton 1.610613e+09 rockets 0 Houston Rockets HOU 24 F 6-6 210 Hillsdale USA NaN NaN NaN NaN 3.3 2.0 0.5 Career 1967 1967
15 203112 Acy Quincy quincy-acy 1.610613e+09 kings 0 Sacramento Kings SAC 13 F 6-7 240 Baylor USA 2012.0 2.0 37.0 NaN 4.9 3.5 0.6 Career 2012 2018
16 76011 Adams Alvan alvan-adams 1.610613e+09 suns 0 Phoenix Suns PHX 33 C 6-9 210 Oklahoma USA 1975.0 1.0 4.0 NaN 14.1 7.0 4.1 Career 1975 1987
17 76012 Adams Don don-adams 1.610613e+09 pistons 0 Detroit Pistons DET 10 F 6-7 210 Northwestern USA 1970.0 8.0 120.0 NaN 8.7 5.6 1.8 Career 1970 1976
18 200801 Adams Hassan hassan-adams 1.610613e+09 nets 0 Brooklyn Nets BKN 8 F 6-4 220 Arizona USA 2006.0 2.0 54.0 NaN 2.5 1.2 0.2 Career 2006 2008
19 1629121 Adams Jaylen jaylen-adams 1.610613e+09 bucks 0 Milwaukee Bucks MIL 6 G 6-0 225 St. Bonaventure USA NaN NaN NaN 1.0 0.3 0.4 0.3 Season 2018 2020

VLookup in Pandas using merge

I have 2 dataframes:
df_dict:
Bet365 Team (Dataset) Record ID
-- -------------------- ---------------- -----------
0 Lincoln City Lincoln 50
1 Peterborough Peterboro 65
2 Cambridge Utd Cambridge 72
3 Harrogate Town Harrogate 87
4 Cologne FC Koln 160
5 Hertha Berlin Hertha 167
6 Arminia Bielefeld Bielefeld 169
7 Schalke Schalke 04 173
8 TSG Hoffenheim Hoffenheim 174
9 SC Freiburg Freiburg 175
10 Zulte-Waregem Waregem 320
11 Royal Excel Mouscron Mouscron 325
Other dataframe:
df_odds:
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ----------------- -------------------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln City Peterborough 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Utd Harrogate Town 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Zulte-Waregem Royal Excel Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 SC Freiburg Cologne 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke TSG Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Arminia Bielefeld Hertha Berlin 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 Cologne Hertha Berlin 3.2 3.3 2.25
I would like to merge the dataset to get the final dataframe as:
df_expected
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ---------- ---------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln Peterboro 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Harrogate 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Waregem Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Freiburg FC Koln 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke 04 Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Bielefeld Hertha 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 FC Koln Hertha 3.2 3.3 2.25
The common key is the df_dict.Bet365
I am trying merge pd.merge but I am unable to get the right keys and the correct join
Help would be greatly appreciated
Use Series.map for both columns by Series with Bet365 column converted to index:
s = df_dict.set_index('Bet365')['Team (Dataset)']
df_odds['HomeTeam'] = df_odds['HomeTeam'].map(s)
df_odds['AwayTeam'] = df_odds['AwayTeam'].map(s)

Problem with joining dataframes using pd.merge

I have the following formula:
import pandas as pd
houses = houses.reset_index()
houses['difference'] = houses[start]-houses[bottom]
towns = pd.merge(houses, unitowns, how='inner', on=['State','RegionName'])
print(towns)
However, the output is a dataframe 'towns' with 0 rows.
I can't understand why this is given that the dataframe 'houses' looks like this:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
And the dataframe 'unitowns' looks like this:
State RegionName 2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
0 New York New York NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0
1 California Los Angeles 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0
2 Illinois Chicago 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0
3 Pennsylvania Philadelphia 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0
4 Arizona Phoenix 1.118333e+05 1.143667e+05 1.160000e+05 1.174000e+05 1.196000e+05 1.215667e+05 1.227000e+05 1.243000e+05 ... 1.642667e+05 1.653667e+05 1.685000e+05 1.715333e+05 1.741667e+05 1.790667e+05 1.838333e+05 1.879000e+05 1.914333e+05 195200.0

Categories