Reshape pivot table in pandas - python

I need to reshape a csv pivot table. A small extract looks like:
country location confirmedcases_10-02-2020 deaths_10-02-2020 confirmedcases_11-02-2020 deaths_11-02-2020
0 Australia New South Wales 4.0 0.0 4 0.0
1 Australia Victoria 4.0 0.0 4 0.0
2 Australia Queensland 5.0 0.0 5 0.0
3 Australia South Australia 2.0 0.0 2 0.0
4 Cambodia Sihanoukville 1.0 0.0 1 0.0
5 Canada Ontario 3.0 0.0 3 0.0
6 Canada British Columbia 4.0 0.0 4 0.0
7 China Hubei 31728.0 974.0 33366 1068.0
8 China Zhejiang 1177.0 0.0 1131 0.0
9 China Guangdong 1177.0 1.0 1219 1.0
10 China Henan 1105.0 7.0 1135 8.0
11 China Hunan 912.0 1.0 946 2.0
12 China Anhui 860.0 4.0 889 4.0
13 China Jiangxi 804.0 1.0 844 1.0
14 China Chongqing 486.0 2.0 505 3.0
15 China Sichuan 417.0 1.0 436 1.0
16 China Shandong 486.0 1.0 497 1.0
17 China Jiangsu 515.0 0.0 543 0.0
18 China Shanghai 302.0 1.0 311 1.0
19 China Beijing 342.0 3.0 352 3.0
is there any ready to use pandas tool to achieve it?
into something like:
country location date confirmedcases deaths
0 Australia New South Wales 2020-02-10 4.0 0.0
1 Australia Victoria 2020-02-10 4.0 0.0
2 Australia Queensland 2020-02-10 5.0 0.0
3 Australia South Australia 2020-02-10 2.0 0.0
4 Cambodia Sihanoukville 2020-02-10 1.0 0.0
5 Canada Ontario 2020-02-10 3.0 0.0
6 Canada British Columbia 2020-02-10 4.0 0.0
7 China Hubei 2020-02-10 31728.0 974.0
8 China Zhejiang 2020-02-10 1177.0 0.0
9 China Guangdong 2020-02-10 1177.0 1.0
10 China Henan 2020-02-10 1105.0 7.0
11 China Hunan 2020-02-10 912.0 1.0
12 China Anhui 2020-02-10 860.0 4.0
13 China Jiangxi 2020-02-10 804.0 1.0
14 China Chongqing 2020-02-10 486.0 2.0
15 China Sichuan 2020-02-10 417.0 1.0
16 China Shandong 2020-02-10 486.0 1.0
17 China Jiangsu 2020-02-10 515.0 0.0
18 China Shanghai 2020-02-10 302.0 1.0
19 China Beijing 2020-02-10 342.0 3.0
20 Australia New South Wales 2020-02-11 4.0 0.0
21 Australia Victoria 2020-02-11 4.0 0.0
22 Australia Queensland 2020-02-11 5.0 0.0
23 Australia South Australia 2020-02-11 2.0 0.0
24 Cambodia Sihanoukville 2020-02-11 1.0 0.0
25 Canada Ontario 2020-02-11 3.0 0.0
26 Canada British Columbia 2020-02-11 4.0 0.0
27 China Hubei 2020-02-11 33366.0 1068.0
28 China Zhejiang 2020-02-11 1131.0 0.0
29 China Guangdong 2020-02-11 1219.0 1.0
30 China Henan 2020-02-11 1135.0 8.0
31 China Hunan 2020-02-11 946.0 2.0
32 China Anhui 2020-02-11 889.0 4.0
33 China Jiangxi 2020-02-11 844.0 1.0
34 China Chongqing 2020-02-11 505.0 3.0
35 China Sichuan 2020-02-11 436.0 1.0
36 China Shandong 2020-02-11 497.0 1.0
37 China Jiangsu 2020-02-11 543.0 0.0
38 China Shanghai 2020-02-11 311.0 1.0
39 China Beijing 2020-02-11 352.0 3.0

Use pd.wide_to_long:
print (pd.wide_to_long(df,stubnames=["confirmedcases","deaths"],
i=["country","location"],j="date",sep="_",
suffix=r'\d{2}-\d{2}-\d{4}').reset_index())
country location date confirmedcases deaths
0 Australia New South Wales 10-02-2020 4.0 0.0
1 Australia New South Wales 11-02-2020 4.0 0.0
2 Australia Victoria 10-02-2020 4.0 0.0
3 Australia Victoria 11-02-2020 4.0 0.0
4 Australia Queensland 10-02-2020 5.0 0.0
5 Australia Queensland 11-02-2020 5.0 0.0
6 Australia South Australia 10-02-2020 2.0 0.0
7 Australia South Australia 11-02-2020 2.0 0.0
8 Cambodia Sihanoukville 10-02-2020 1.0 0.0
9 Cambodia Sihanoukville 11-02-2020 1.0 0.0
10 Canada Ontario 10-02-2020 3.0 0.0
11 Canada Ontario 11-02-2020 3.0 0.0
12 Canada British Columbia 10-02-2020 4.0 0.0
13 Canada British Columbia 11-02-2020 4.0 0.0
14 China Hubei 10-02-2020 31728.0 974.0
15 China Hubei 11-02-2020 33366.0 1068.0
16 China Zhejiang 10-02-2020 1177.0 0.0
17 China Zhejiang 11-02-2020 1131.0 0.0
18 China Guangdong 10-02-2020 1177.0 1.0
19 China Guangdong 11-02-2020 1219.0 1.0
20 China Henan 10-02-2020 1105.0 7.0
21 China Henan 11-02-2020 1135.0 8.0
22 China Hunan 10-02-2020 912.0 1.0
23 China Hunan 11-02-2020 946.0 2.0
24 China Anhui 10-02-2020 860.0 4.0
25 China Anhui 11-02-2020 889.0 4.0
26 China Jiangxi 10-02-2020 804.0 1.0
27 China Jiangxi 11-02-2020 844.0 1.0
28 China Chongqing 10-02-2020 486.0 2.0
29 China Chongqing 11-02-2020 505.0 3.0
30 China Sichuan 10-02-2020 417.0 1.0
31 China Sichuan 11-02-2020 436.0 1.0
32 China Shandong 10-02-2020 486.0 1.0
33 China Shandong 11-02-2020 497.0 1.0
34 China Jiangsu 10-02-2020 515.0 0.0
35 China Jiangsu 11-02-2020 543.0 0.0
36 China Shanghai 10-02-2020 302.0 1.0
37 China Shanghai 11-02-2020 311.0 1.0
38 China Beijing 10-02-2020 342.0 3.0
39 China Beijing 11-02-2020 352.0 3.0

Yes, and you can achieve it by reshaping the dataframe.
Firs you have to melt the columns to have them as values:
df = df.melt(['country', 'location'],
[ p for p in df.columns if p not in ['country', 'location'] ],
'key',
'value')
#> country location key value
#> 0 Australia New South Wales confirmedcases_10-02-2020 4
#> 1 Australia Victoria confirmedcases_10-02-2020 4
#> 2 Australia Queensland confirmedcases_10-02-2020 5
#> 3 Australia South Australia confirmedcases_10-02-2020 2
#> 4 Cambodia Sihanoukville confirmedcases_10-02-2020 1
#> .. ... ... ... ...
#> 75 China Sichuan deaths_11-02-2020 1
#> 76 China Shandong deaths_11-02-2020 1
#> 77 China Jiangsu deaths_11-02-2020 0
#> 78 China Shanghai deaths_11-02-2020 1
#> 79 China Beijing deaths_11-02-2020 3
After that you need to separate the values in the column key:
key_split_series = df.key.str.split("_", expand=True)
df["key"] = key_split_series[0]
df["date"] = key_split_series[1]
#> country location key value date
#> 0 Australia New South Wales confirmedcases 4 10-02-2020
#> 1 Australia Victoria confirmedcases 4 10-02-2020
#> 2 Australia Queensland confirmedcases 5 10-02-2020
#> 3 Australia South Australia confirmedcases 2 10-02-2020
#> 4 Cambodia Sihanoukville confirmedcases 1 10-02-2020
#> .. ... ... ... ... ...
#> 75 China Sichuan deaths 1 11-02-2020
#> 76 China Shandong deaths 1 11-02-2020
#> 77 China Jiangsu deaths 0 11-02-2020
#> 78 China Shanghai deaths 1 11-02-2020
#> 79 China Beijing deaths 3 11-02-2020
In the end, you just need to pivot the table to have confirmedcases and deaths back as columns:
df = df.set_index(["country", "location", "date", "key"])["value"].unstack().reset_index()
#> key country location date confirmedcases deaths
#> 0 Australia New South Wales 10-02-2020 4 0
#> 1 Australia New South Wales 11-02-2020 4 0
#> 2 Australia Queensland 10-02-2020 5 0
#> 3 Australia Queensland 11-02-2020 5 0
#> 4 Australia South Australia 10-02-2020 2 0
#> .. ... ... ... ... ...
#> 35 China Shanghai 11-02-2020 311 1
#> 36 China Sichuan 10-02-2020 417 1
#> 37 China Sichuan 11-02-2020 436 1
#> 38 China Zhejiang 10-02-2020 1177 0
#> 39 China Zhejiang 11-02-2020 1131 0

Use {dataframe}.reshape((-1,1)) if there is only one feature and {dataframe}.reshape((1,-1)) if there is only one sample

Related

Assign column names into rows on the condition of value presence in Pandas

I am trying to assign the column name on the condition that the value is not zero. How do I do this in Pandas?
import pandas as pd
df = pd.DataFrame({
'date' : ['2021-08-01', '2021-08-01', '2021-08-01', '2021-08-02', '2021-08-02', '2021-08-02'],
'person': ['type_A', 'type_C', 'type_C', 'type_A', 'type_B', 'type_D'],
'London' : [1, 4, 0, 5, 7, 9],
'New York' : [0.0, 0.0, 3, 0.0, 0.0, 0.0],
'Boston' : [3, 6, 9, 1, 0.0, 7],
'Hong Kong' : [0.0, 0.0, 0.0, 0.0, 3, 0.0]
})
date person London New York Boston Hong Kong
0 2021-08-01 type_A 1 0.0 3.0 0.0
1 2021-08-01 type_C 4 0.0 6.0 0.0
2 2021-08-01 type_C 0 3.0 9.0 0.0
3 2021-08-02 type_A 5 0.0 1.0 0.0
4 2021-08-02 type_B 7 0.0 0.0 3.0
5 2021-08-02 type_D 9 0.0 7.0 0.0
Expected output:
date person London New York Boston Hong Kong Col_name
0 2021-08-01 type_A 1 0.0 3.0 0.0 London, Boston
1 2021-08-01 type_C 4 0.0 6.0 0.0 London, Boston
2 2021-08-01 type_C 0 3.0 9.0 0.0 New York, Boston
3 2021-08-02 type_A 5 0.0 1.0 0.0 London, Boston
4 2021-08-02 type_B 7 0.0 0.0 3.0 London, Hong Kong
5 2021-08-02 type_D 9 0.0 7.0 0.0 London, Boston, Hong Kong
Try:
country_cols = ['London', 'New York', 'Boston', 'Hong Kong']
df['Col_name'] = df[country_cols].apply(lambda x: ', '.join(x[x!=0].index.to_list()), axis=1)
Output:
date person London New York Boston Hong Kong Col_name
0 2021-08-01 type_A 1 0.0 3.0 0.0 London, Boston
1 2021-08-01 type_C 4 0.0 6.0 0.0 London, Boston
2 2021-08-01 type_C 0 3.0 9.0 0.0 New York, Boston
3 2021-08-02 type_A 5 0.0 1.0 0.0 London, Boston
4 2021-08-02 type_B 7 0.0 0.0 3.0 London, Hong Kong
5 2021-08-02 type_D 9 0.0 7.0 0.0 London, Boston
Adapting this beautiful answer
you could avoid apply:
df["not_zeros"] = (
df[df.columns[2:]].ne(0).astype("int")).dot(
df.columns[2:] + ', ').str.rstrip(', ')
date person London New York Boston Hong Kong not_zeros
0 2021-08-01 type_A 1 0.0 3.0 0.0 London, Boston
1 2021-08-01 type_C 4 0.0 6.0 0.0 London, Boston
2 2021-08-01 type_C 0 3.0 9.0 0.0 New York, Boston
3 2021-08-02 type_A 5 0.0 1.0 0.0 London, Boston
4 2021-08-02 type_B 7 0.0 0.0 3.0 London, Hong Kong
5 2021-08-02 type_D 9 0.0 7.0 0.0 London, Boston
try following code. It is just more simplified
country_names = ['London', 'New York', 'Boston', 'Hong Kong']
# main list to save country names for each row of dataframe
col_name = []
for index, row in df.iterrows():
# temporary list to save country name for a row
temp = []
for country_name in country_names:
if float(row[country_name]) != 0.0:
temp.append(country_name)
# appending coutnry names to col_name list as a string
col_name.append(" ".join([i for i in temp]))
df["col_name"] = col_name

Simple web scrape issues

Apologies if this question is elementary, but I'm a newbie to scraping and am trying to perform a simple scrape of NFL Future prices off of a website, but am not having any luck. My code is below. At this point, I'm just trying to get something/anything to return (ultimately will pull the text of the team names and futures prices), but this code returns "None" and "[]" (an empty list) for the find and find_all functions, respectively. I get the find/find_all parameters by inspecting the first line of the page (Baltimore Ravens) when I see that the team names are held in a span with the class of "style_label__2KJur".
I suspect this has something to do with how the html is loaded. When I print(nfl_futures), I don't see any of the html that I inspected for the first line which is presumably why I get no results. If this is true, how do I expose all of the html I need in order to scrape this data?
Appreciate the help.
import requests
from bs4 import BeautifulSoup
url = "https://www.pinnacle.com/en/football/nfl/matchups#futures"
r = requests.get(url).content
nfl_futures = BeautifulSoup(r, "lxml")
first_line = nfl_futures.find('span', class_="style_label__2KJur")
lines = nfl_futures.find_all('span', class_="style_label__2KJur")
print(first_line)
print(lines)
Output:
None
[]
Process finished with exit code 0
This site is hardly a simple scrape. The page is dynamic. You could use selenium to first render the page, then grab the html to parse with bs4. Or as stated, grab the dat from the api, but then you need to do a little data manipulation to join them. I always like going the api method as it's robust and more efficient.
import requests
import pandas as pd
url = 'https://www.pinnacle.com/config/app.json'
jsonData = requests.get(url).json()
x_api_key = jsonData['api']['haywire']['apiKey']
headers = {
'X-API-Key': x_api_key}
matchups_url = "https://guest.api.arcadia.pinnacle.com/0.1/leagues/889/matchups"
jsonData_matchups = requests.get(matchups_url, headers=headers).json()
df = pd.json_normalize(jsonData_matchups,
record_path = ['participants'],
meta = ['id','type',['special', 'category'],['special', 'description']],
meta_prefix = 'participants.',
errors='ignore')
df['id'] = df['id'].fillna(0).astype(int).astype(str)
df['participants.id'] = df['participants.id'].fillna(0).astype(int).astype(str)
df = df.rename(columns={'id':'participantId','participants.id':'matchupId'})
df_matchups = df[df['participants.type'] == 'matchup']
df_special = df[df['participants.type'] == 'special']
straight_url = 'https://guest.api.arcadia.pinnacle.com/0.1/leagues/889/markets/straight'
jsonData_straight = requests.get(straight_url, headers=headers).json()
df_straight = pd.json_normalize(jsonData_straight,
record_path = ['prices'],
meta = ['type', 'matchupId'],
errors='ignore')
df_straight['matchupId'] = df_straight['matchupId'].fillna(0).astype(int).astype(str)
df_straight['participantId'] = df_straight['participantId'].fillna(0).astype(int).astype(str)
df_filter = df_straight[df_straight['designation'].isin(['home','away','over','under'])]
df_filter = df_filter.pivot_table(index=['matchupId', 'participantId'],
columns='designation',
values=['points','price']).reset_index(drop=False)
df_filter.columns = ['.'.join(x) if x[-1] != '' else x[0] for x in df_filter.columns]
nfl_futures = pd.merge(df_special, df_straight, how='left', left_on=['matchupId', 'participantId'], right_on=['matchupId', 'participantId'])
nfl_matchups = pd.merge(df_matchups, df_filter, how='left', left_on=['matchupId', 'participantId'], right_on=['matchupId', 'participantId'])
Output:
Here's what the first 5 rows of 324 rows looks like for futures:
print(nfl_futures.head(10).to_string())
alignment participantId name order rotation matchupId participants.type participants.special.category participants.special.description points price designation type
0 neutral 1326753860 Over 0 3017.0 1326753859 special Regular Season Wins Dallas Cowboys Regular Season Wins? 9.5 108 NaN total
1 neutral 1326753861 Under 0 3018.0 1326753859 special Regular Season Wins Dallas Cowboys Regular Season Wins? 9.5 -129 NaN total
2 neutral 1336218775 Trevor Lawrence 0 5801.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 312 NaN moneyline
3 neutral 1336218776 Justin Fields 0 5802.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 461 NaN moneyline
4 neutral 1336218777 Zach Wilson 0 5803.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 790 NaN moneyline
5 neutral 1336218778 Trey Lance 0 5804.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 655 NaN moneyline
6 neutral 1336218779 Mac Jones 0 5805.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 807 NaN moneyline
7 neutral 1336218780 Kyle Pitts 0 5806.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 1095 NaN moneyline
8 neutral 1336218781 Najee Harris 0 5807.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 1015 NaN moneyline
9 neutral 1336218782 DeVonta Smith 0 5808.0 1336218774 special NFL Offensive Rookie of the Year NFL Offensive Rookie of the Year 2021-22? NaN 1903 NaN moneyline
And here is week 1 matchup lines:
print(nfl_matchups.to_string())
alignment participantId name order rotation matchupId participants.type participants.special.category participants.special.description points.away points.home points.over points.under price.away price.home price.over price.under
0 home 0 Tampa Bay Buccaneers 1 NaN 1327265167 matchup NaN NaN 6.5 -6.5 51.5 51.5 107.0 -118.0 101.0 -112.0
1 away 0 Dallas Cowboys 0 NaN 1327265167 matchup NaN NaN 6.5 -6.5 51.5 51.5 107.0 -118.0 101.0 -112.0
2 home 0 Washington Football Team 1 NaN 1327265554 matchup NaN NaN 0.0 0.0 44.5 44.5 -115.0 104.0 -106.0 -106.0
3 away 0 Los Angeles Chargers 0 NaN 1327265554 matchup NaN NaN 0.0 0.0 44.5 44.5 -115.0 104.0 -106.0 -106.0
4 home 0 Detroit Lions 1 NaN 1327265774 matchup NaN NaN -7.5 7.5 46.0 46.0 101.0 -111.0 -106.0 -106.0
5 away 0 San Francisco 49ers 0 NaN 1327265774 matchup NaN NaN -7.5 7.5 46.0 46.0 101.0 -111.0 -106.0 -106.0
6 home 0 Las Vegas Raiders 1 NaN 1327266134 matchup NaN NaN -4.5 4.5 51.0 51.0 -110.0 -100.0 -106.0 -106.0
7 away 0 Baltimore Ravens 0 NaN 1327266134 matchup NaN NaN -4.5 4.5 51.0 51.0 -110.0 -100.0 -106.0 -106.0
8 home 0 Los Angeles Rams 1 NaN 1327266054 matchup NaN NaN 7.5 -7.5 45.0 45.0 -114.0 103.0 -106.0 -106.0
9 away 0 Chicago Bears 0 NaN 1327266054 matchup NaN NaN 7.5 -7.5 45.0 45.0 -114.0 103.0 -106.0 -106.0
10 home 0 Kansas City Chiefs 1 NaN 1327265828 matchup NaN NaN 6.0 -6.0 52.5 52.5 102.0 -112.0 -106.0 -106.0
11 away 0 Cleveland Browns 0 NaN 1327265828 matchup NaN NaN 6.0 -6.0 52.5 52.5 102.0 -112.0 -106.0 -106.0
12 home 0 Carolina Panthers 1 NaN 1327265337 matchup NaN NaN 4.0 -4.0 43.0 43.0 -105.0 -105.0 -106.0 -106.0
13 away 0 New York Jets 0 NaN 1327265337 matchup NaN NaN 4.0 -4.0 43.0 43.0 -105.0 -105.0 -106.0 -106.0
14 home 0 Cincinnati Bengals 1 NaN 1327265711 matchup NaN NaN -3.5 3.5 48.0 48.0 -105.0 -105.0 -106.0 -106.0
15 away 0 Minnesota Vikings 0 NaN 1327265711 matchup NaN NaN -3.5 3.5 48.0 48.0 -105.0 -105.0 -106.0 -106.0
16 home 0 New Orleans Saints 1 NaN 1327266000 matchup NaN NaN -2.5 2.5 50.0 50.0 -118.0 107.0 -106.0 -106.0
17 away 0 Green Bay Packers 0 NaN 1327266000 matchup NaN NaN -2.5 2.5 50.0 50.0 -118.0 107.0 -106.0 -106.0
18 home 0 Buffalo Bills 1 NaN 1327265283 matchup NaN NaN 7.0 -7.0 50.0 50.0 -116.0 105.0 -106.0 -106.0
19 away 0 Pittsburgh Steelers 0 NaN 1327265283 matchup NaN NaN 7.0 -7.0 50.0 50.0 -116.0 105.0 -106.0 -106.0
20 home 0 Tennessee Titans 1 NaN 1327265444 matchup NaN NaN 3.0 -3.0 51.0 51.0 -102.0 -108.0 -116.0 104.0
21 away 0 Arizona Cardinals 0 NaN 1327265444 matchup NaN NaN 3.0 -3.0 51.0 51.0 -102.0 -108.0 -116.0 104.0
22 home 0 New York Giants 1 NaN 1327265931 matchup NaN NaN -1.0 1.0 42.5 42.5 -110.0 100.0 -106.0 -106.0
23 away 0 Denver Broncos 0 NaN 1327265931 matchup NaN NaN -1.0 1.0 42.5 42.5 -110.0 100.0 -106.0 -106.0
24 home 0 Atlanta Falcons 1 NaN 1327265598 matchup NaN NaN 3.5 -3.5 48.0 48.0 -108.0 -102.0 -106.0 -105.0
25 away 0 Philadelphia Eagles 0 NaN 1327265598 matchup NaN NaN 3.5 -3.5 48.0 48.0 -108.0 -102.0 -106.0 -105.0
26 home 0 Indianapolis Colts 1 NaN 1327265657 matchup NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 away 0 Seattle Seahawks 0 NaN 1327265657 matchup NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 home 0 New England Patriots 1 NaN 1327265876 matchup NaN NaN 2.5 -2.5 45.5 45.5 104.0 -115.0 103.0 -115.0
29 away 0 Miami Dolphins 0 NaN 1327265876 matchup NaN NaN 2.5 -2.5 45.5 45.5 104.0 -115.0 103.0 -115.0
try to use the html.parser instead of the lxml. Also, try to print your nfl_futures variable to check if you are getting an html page.
If that is the case then check inside the html code if the element(s) that your are looking for exist.

python selenium: <input> element not interactable

I was trying to crawl down nba player info from https://nba.com/players and click the button "Show Historic" on the webpage
nba_webpage_picture
part of the html code for the input button shows below:
<div aria-label="Show Historic Toggle" class="Toggle_switch__2e_90">
<input type="checkbox" class="Toggle_input__gIiFd" name="showHistoric">
<span class="Toggle_slider__hCMQQ Toggle_sliderActive__15Jrf Toggle_slidercerulean__1UnnV">
</span>
</div>
I simply use find_element_by_xpath to locate the input button and click
button_show_historic = driver.find_element_by_xpath("//input[#name='showHistoric']")
button_show_historic.click()
However it says:
Exception has occurred: ElementNotInteractableException
Message: element not interactable
(Session info: chrome=88.0.4324.192)
Could anyone help on solving this issue? Is this because the input is not visible?
Simply wait for the span element not the input element and click.
wait = WebDriverWait(driver, 30)
driver.get('https://www.nba.com/players')
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[.='I Accept']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,"//input[#name='showHistoric']/preceding::span[1]"))).click()
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Also to find an api just look under Developer tools ->Network->Headers
and Response to find if it gets populated.
Most probably problem is you don't have any wait code. You should wait until page is loaded. You can use simple python sleep function:
import time
time.sleep(3) #it will wait 3 seconds
##Do your action
Or You can use explicit wait. Check this page: selenium.dev
No need to use selenium when there's an api. Try this:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/playerindex'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Referer': 'http://stats.nba.com'}
payload = {
'College': '',
'Country': '',
'DraftPick': '',
'DraftRound': '',
'DraftYear': '',
'Height': '' ,
'Historical': '1',
'LeagueID': '00',
'Season': '2020-21',
'SeasonType': 'Regular Season',
'TeamID': '0',
'Weight': ''}
jsonData = requests.get(url, headers=headers, params=payload).json()
cols = jsonData['resultSets'][0]['headers']
data = jsonData['resultSets'][0]['rowSet']
df = pd.DataFrame(data, columns=cols)
Output: [4589 rows x 26 columns]
print(df.head(20).to_string())
PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION HEIGHT WEIGHT COLLEGE COUNTRY DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME FROM_YEAR TO_YEAR
0 76001 Abdelnaby Alaa alaa-abdelnaby 1.610613e+09 blazers 0 Portland Trail Blazers POR 30 F 6-10 240 Duke USA 1990.0 1.0 25.0 NaN 5.7 3.3 0.3 Career 1990 1994
1 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1.610613e+09 rockets 0 Houston Rockets HOU 54 C 6-9 235 Iowa State USA 1968.0 1.0 5.0 NaN 9.0 8.0 1.2 Career 1968 1977
2 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1.610613e+09 lakers 0 Los Angeles Lakers LAL 33 C 7-2 225 UCLA USA 1969.0 1.0 1.0 NaN 24.6 11.2 3.6 Career 1969 1988
3 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1.610613e+09 nuggets 0 Denver Nuggets DEN 1 G 6-1 162 Louisiana State USA 1990.0 1.0 3.0 NaN 14.6 1.9 3.5 Career 1990 2000
4 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1.610613e+09 kings 0 Sacramento Kings SAC 9 F-G 6-6 235 San Jose State France 1997.0 1.0 11.0 NaN 7.8 3.3 1.1 Career 1997 2003
5 949 Abdur-Rahim Shareef shareef-abdur-rahim 1.610613e+09 grizzlies 0 Memphis Grizzlies MEM 3 F 6-9 245 California USA 1996.0 1.0 3.0 NaN 18.1 7.5 2.5 Career 1996 2007
6 76005 Abernethy Tom tom-abernethy 1.610613e+09 warriors 0 Golden State Warriors GSW 5 F 6-7 220 Indiana USA 1976.0 3.0 43.0 NaN 5.6 3.2 1.2 Career 1976 1980
7 76006 Able Forest forest-able 1.610613e+09 sixers 0 Philadelphia 76ers PHI 6 G 6-3 180 Western Kentucky USA 1956.0 NaN NaN NaN 0.0 1.0 1.0 Career 1956 1956
8 76007 Abramovic John john-abramovic 1.610610e+09 None 1 Pittsburgh Ironmen PIT None F 6-3 195 Salem USA NaN NaN NaN NaN 9.5 NaN 0.7 Career 1946 1947
9 203518 Abrines Alex alex-abrines 1.610613e+09 thunder 0 Oklahoma City Thunder OKC 8 G 6-6 190 FC Barcelona Spain 2013.0 2.0 32.0 NaN 5.3 1.4 0.5 Career 2016 2018
10 1630173 Achiuwa Precious precious-achiuwa 1.610613e+09 heat 0 Miami Heat MIA 5 F 6-8 225 Memphis Nigeria 2020.0 1.0 20.0 1.0 5.9 3.9 0.6 Season 2020 2020
11 101165 Acker Alex alex-acker 1.610613e+09 clippers 0 LA Clippers LAC 3 G 6-5 185 Pepperdine USA 2005.0 2.0 60.0 NaN 2.7 1.0 0.5 Career 2005 2008
12 76008 Ackerman Donald donald-ackerman 1.610613e+09 knicks 0 New York Knicks NYK G 6-0 183 Long Island-Brooklyn USA 1953.0 2.0 NaN NaN 1.5 0.5 0.8 Career 1953 1953
13 76009 Acres Mark mark-acres 1.610613e+09 magic 0 Orlando Magic ORL 42 C 6-11 220 Oral Roberts USA 1985.0 2.0 40.0 NaN 3.6 4.1 0.5 Career 1987 1992
14 76010 Acton Charles charles-acton 1.610613e+09 rockets 0 Houston Rockets HOU 24 F 6-6 210 Hillsdale USA NaN NaN NaN NaN 3.3 2.0 0.5 Career 1967 1967
15 203112 Acy Quincy quincy-acy 1.610613e+09 kings 0 Sacramento Kings SAC 13 F 6-7 240 Baylor USA 2012.0 2.0 37.0 NaN 4.9 3.5 0.6 Career 2012 2018
16 76011 Adams Alvan alvan-adams 1.610613e+09 suns 0 Phoenix Suns PHX 33 C 6-9 210 Oklahoma USA 1975.0 1.0 4.0 NaN 14.1 7.0 4.1 Career 1975 1987
17 76012 Adams Don don-adams 1.610613e+09 pistons 0 Detroit Pistons DET 10 F 6-7 210 Northwestern USA 1970.0 8.0 120.0 NaN 8.7 5.6 1.8 Career 1970 1976
18 200801 Adams Hassan hassan-adams 1.610613e+09 nets 0 Brooklyn Nets BKN 8 F 6-4 220 Arizona USA 2006.0 2.0 54.0 NaN 2.5 1.2 0.2 Career 2006 2008
19 1629121 Adams Jaylen jaylen-adams 1.610613e+09 bucks 0 Milwaukee Bucks MIL 6 G 6-0 225 St. Bonaventure USA NaN NaN NaN 1.0 0.3 0.4 0.3 Season 2018 2020

VLookup in Pandas using merge

I have 2 dataframes:
df_dict:
Bet365 Team (Dataset) Record ID
-- -------------------- ---------------- -----------
0 Lincoln City Lincoln 50
1 Peterborough Peterboro 65
2 Cambridge Utd Cambridge 72
3 Harrogate Town Harrogate 87
4 Cologne FC Koln 160
5 Hertha Berlin Hertha 167
6 Arminia Bielefeld Bielefeld 169
7 Schalke Schalke 04 173
8 TSG Hoffenheim Hoffenheim 174
9 SC Freiburg Freiburg 175
10 Zulte-Waregem Waregem 320
11 Royal Excel Mouscron Mouscron 325
Other dataframe:
df_odds:
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ----------------- -------------------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln City Peterborough 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Utd Harrogate Town 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Zulte-Waregem Royal Excel Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 SC Freiburg Cologne 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke TSG Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Arminia Bielefeld Hertha Berlin 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 Cologne Hertha Berlin 3.2 3.3 2.25
I would like to merge the dataset to get the final dataframe as:
df_expected
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ---------- ---------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln Peterboro 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Harrogate 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Waregem Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Freiburg FC Koln 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke 04 Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Bielefeld Hertha 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 FC Koln Hertha 3.2 3.3 2.25
The common key is the df_dict.Bet365
I am trying merge pd.merge but I am unable to get the right keys and the correct join
Help would be greatly appreciated
Use Series.map for both columns by Series with Bet365 column converted to index:
s = df_dict.set_index('Bet365')['Team (Dataset)']
df_odds['HomeTeam'] = df_odds['HomeTeam'].map(s)
df_odds['AwayTeam'] = df_odds['AwayTeam'].map(s)

Pandas: Fill missing date with information from other rows

Suppose I have the following pandas dataframe:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
I need to update the dataframe with the missing dates on the Northern Terriroty, 2020-3-10 and 2020-3-11. However, I want to use all the information except for cases and deaths. Like this:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-10 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-11 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
The only way I can think of doing this is to iterate through all combinations of dates and countries.
EDIT
Efran seems to be on the right track but I can't get it to work. Here is the actual data I'm working with instead of a toy example.
import pandas as pd
unique_group = ['province','country','county']
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
csbs_df['Date'] = pd.to_datetime(csbs_df['Date'], infer_datetime_format=True)
new_df = (
csbs_df.set_index('Date')
.groupby(unique_group)
.resample('D').first()
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.ffill()
.reset_index(level=3)
.reset_index(drop=True))
new_df.head()
Date id lat lon Timestamp province country_code country county confirmed deaths source Date_text
0 2020-03-25 1094.0 32.534893 -86.642709 2020-03-25 00:00:00+00:00 Alabama US US Autauga 1.0 0.0 CSBS 03/25/20
1 2020-03-26 901.0 32.534893 -86.642709 2020-03-26 00:00:00+00:00 Alabama US US Autauga 4.0 0.0 CSBS 03/26/20
2 2020-03-24 991.0 30.735891 -87.723525 2020-03-24 00:00:00+00:00 Alabama US US Baldwin 3.0 0.0 CSBS 03/24/20
3 2020-03-25 1080.0 30.735891 -87.723525 2020-03-25 00:00:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/25/20
4 2020-03-26 1139.0 30.735891 -87.723525 2020-03-26 16:52:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/26/20
You can see that it is not inserting the day resample as its specified. I'm not sure whats wrong.
Edit 2
Here is my solution based on Erfan.
import pandas as pd
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
date_range = pd.date_range(csbs_df['Date'].min(),csbs_df['Date'].max(),freq='1D')
unique_group = ['country','province','county']
gb = csbs_df.groupby(unique_group)
sub_dfs =[]
for g in gb.groups:
sub_df = gb.get_group(g)
sub_df = (
sub_df.set_index('Date')
.reindex(date_range)
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.bfill()
.ffill()
.reset_index()
.rename({'index':'Date'},axis=1)
.drop({'id':1},axis=1))
sub_df['Date_text'] = sub_df['Date'].dt.strftime('%m/%d/%y')
sub_df['Timestamp'] = pd.to_datetime(sub_df['Date'],utc=True)
sub_dfs.append(sub_df)
all_concat = pd.concat(sub_dfs)
assert((all_concat.groupby(['province','country','county']).count() == 3).all().all())
Using GroupBy.resample, ffill and fillna:
The idea here is that we want to "fill" the missing gaps of dates for each group of Region and Country. This is called resampling of timeseries.
So that's why we use GroupBy.resample instead of DataFrame.resample here. Further more fillna and ffill is needed to fill the data accordingly to your logic.
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
dfn = (
df.set_index('Date')
.groupby(['Region', 'Country'])
.resample('D').first()
.fillna(dict.fromkeys(['Cases', 'Deaths'], 0))
.ffill()
.reset_index(level=2)
.reset_index(drop=True)
)
Date Region Country Cases Deaths Lat Long
0 2020-03-08 Northern Territory Australia 27.0 49.0 -12.4634 130.8456
1 2020-03-09 Northern Territory Australia 80.0 85.0 -12.4634 130.8456
2 2020-03-10 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
3 2020-03-11 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
4 2020-03-12 Northern Territory Australia 35.0 73.0 -12.4634 130.8456
5 2020-03-08 Western Australia Australia 48.0 20.0 -31.9505 115.8605
6 2020-03-09 Western Australia Australia 70.0 12.0 -31.9505 115.8605
7 2020-03-10 Western Australia Australia 66.0 95.0 -31.9505 115.8605
8 2020-03-11 Western Australia Australia 31.0 38.0 -31.9505 115.8605
9 2020-03-12 Western Australia Australia 40.0 83.0 -31.9505 115.8605
Edit:
Seems indeed that not all places have same start and end date, so we have to take that into account, the following works:
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz'
).iloc[:, 1:]
csbs_df['Date_text'] = pd.to_datetime(csbs_df['Date_text'])
date_range = pd.date_range(csbs_df['Date_text'].min(), csbs_df['Date_text'].max(), freq='D')
def reindex_dates(data, dates):
data = data.reindex(dates).fillna(dict.fromkeys(['Cases', 'Deaths'], 0)).ffill().bfill()
return data
dfn = (
csbs_df.set_index('Date_text')
.groupby('id').apply(lambda x: reindex_dates(x, date_range))
.reset_index(level=0, drop=True)
.reset_index()
.rename(columns={'index': 'Date'})
)
print(dfn.head())
Date id lat lon Timestamp \
0 2020-03-24 0.0 40.714550 -74.007140 2020-03-24 00:00:00+00:00
1 2020-03-25 0.0 40.714550 -74.007140 2020-03-25 00:00:00+00:00
2 2020-03-26 0.0 40.714550 -74.007140 2020-03-26 00:00:00+00:00
3 2020-03-24 1.0 41.163198 -73.756063 2020-03-24 00:00:00+00:00
4 2020-03-25 1.0 41.163198 -73.756063 2020-03-25 00:00:00+00:00
Date province country_code country county confirmed deaths \
0 2020-03-24 New York US US New York 13119.0 125.0
1 2020-03-25 New York US US New York 15597.0 192.0
2 2020-03-26 New York US US New York 20011.0 280.0
3 2020-03-24 New York US US Westchester 2894.0 0.0
4 2020-03-25 New York US US Westchester 3891.0 1.0
source
0 CSBS
1 CSBS
2 CSBS
3 CSBS
4 CSBS

Categories