How Do I Merge DFs with a for loop - python

Below is my code. What I want to do is merge the spread and total values for each week that I have saved in separate files. It works perfectly for individual weeks, but doesn't when I introduce the for loop. I assume its overwriting each time it merges, but when I place the .merge code outside the for loop, it only writes the last iteration to the excel file.
year = 2015
weeks = np.arange(1,18)
for week in weeks:
odds = pd.read_excel(fr'C:\Users\logan\Desktop\Gambling_Scraper\Odds_{year}\Odds{year}Wk{week}.xlsx')
odds['Favorite'] = odds['Favorite'].map(lambda x: x.lstrip('at '))
odds['Underdog'] = odds['Underdog'].map(lambda x: x.lstrip('at '))
odds['UD_Spread'] = odds['Spread'] * -1
#new df to add spread
new_df = pd.DataFrame(odds['Favorite'].append(odds['Underdog']))
new_df['Tm'] = new_df
new_df['Wk'] = new_df['Tm'] + str(week)
new_df['Spread'] = odds['Spread'].append(odds['UD_Spread'])
#new df to add total
total_df = pd.DataFrame(odds['Favorite'].append(odds['Underdog']))
total_df['Tm'] = total_df
total_df['Wk'] = total_df['Tm'] + str(week)
total_df['Total']= pd.DataFrame(odds['Total'].append(odds['Total']))
df['Week'] = df['Week'].astype(int)
df['Merge'] = df['Tm'].astype(str) + df['Week'].astype(str)
df = df.merge(new_df['Spread'], left_on='Merge', right_on=new_df['Wk'], how='left')
df = df.merge(total_df['Total'], left_on='Merge', right_on=total_df['Wk'], how='left')
df['Implied Tm Pts'] = df['Total'].astype(float) /2 - df['Spread'].astype(float)/2
df.to_excel('DFS2015.xlsx')
What I get:
Name Position Week Tm Merge Spread Total Implied Tm Pts
Devonta Freeman RB 1 Falcons Falcons1 3 55 26
Devonta Freeman RB 2 Falcons Falcons2
Devonta Freeman RB 3 Falcons Falcons3
Devonta Freeman RB 4 Falcons Falcons4
Devonta Freeman RB 5 Falcons Falcons5
Devonta Freeman RB 6 Falcons Falcons6
Devonta Freeman RB 7 Falcons Falcons7
Devonta Freeman RB 8 Falcons Falcons8
Devonta Freeman RB 9 Falcons Falcons9
Devonta Freeman RB 11 Falcons Falcons11
Devonta Freeman RB 13 Falcons Falcons13
Devonta Freeman RB 14 Falcons Falcons14
Devonta Freeman RB 15 Falcons Falcons15
Devonta Freeman RB 16 Falcons Falcons16
Devonta Freeman RB 17 Falcons Falcons17
Antonio Brown WR 1 Steelers Steelers1 7 51 22
But I need a value in each row.
Trying to merge 'Spread' and Total from this data:
Date Favorite Spread Underdog Spread2 Total Away Money
Line Home Money Line Week Favs Spread Uds Spread2
September 10, 2015 8:30 PM Patriots -7.0 Steelers 7 51.0 +270 -340 1 Patriots1 -7.0 Steelers1 7
September 13, 2015 1:00 PM Packers -6.0 Bears 6 48.0 -286 +230 1 Packers1 -6.0 Bears1 6
September 13, 2015 1:00 PM Chiefs -1.0 Texans 1 40.0 -115 -105 1 Chiefs1 -1.0 Texans1 1
September 13, 2015 1:00 PM Jets -4.0 Browns 4 40.0 +170 -190 1 Jets1 -4.0 Browns1 4
September 13, 2015 1:00 PM Colts -1.0 Bills 1 44.0 -115 -105 1 Colts1 -1.0 Bills1 1
September 13, 2015 1:00 PM Dolphins -4.0 Football Team 4 46.0 -210 +175 1 Dolphins1 -4.0 Football Team1 4
September 13, 2015 1:00 PM Panthers -3.0 Jaguars 3 41.0 -150 +130 1 Panthers1 -3.0 Jaguars1 3
September 13, 2015 1:00 PM Seahawks -4.0 Rams 4 42.0 -185 +160 1 Seahawks1 -4.0 Rams1 4
September 13, 2015 4:05 PM Cardinals -2.0 Saints 2 49.0 +120 -140 1 Cardinals1 -2.0 Saints1 2
September 13, 2015 4:05 PM Chargers -4.0 Lions 4 46.0 +160 -180 1 Chargers1 -4.0 Lions1 4
September 13, 2015 4:25 PM Buccaneers -3.0 Titans 3 40.0 +130 -150 1 Buccaneers1 -3.0 Titans1 3
September 13, 2015 4:25 PM Bengals -3.0 Raiders 3 43.0 -154 +130 1 Bengals1 -3.0 Raiders1 3
September 13, 2015 4:25 PM Broncos -4.0 Ravens 4 46.0 +180 -220 1 Broncos1 -4.0 Ravens1 4
September 13, 2015 8:30 PM Cowboys -7.0 Giants 7 52.0 +240 -300 1 Cowboys1 -7.0 Giants1 7
September 14, 2015 7:10 PM Eagles -3.0 Falcons 3 55.0 -188 +150 1 Eagles1 -3.0 Falcons1 3
September 14, 2015 10:20 PM Vikings -2.0 49ers 2 42.0 -142 +120 1 Vikings1 -2.0 49ers1 2

Related

How to duplicate rows on DataFrame based on most recent row date

My data looks something like this:
Report Date
Location
Data
8/6/2021
St. Louis
100
8/1/2021
St. Louis
89
7/29/2021
St. Louis
85
7/24/2021
St. Louis
80
7/30/2021
Louisville
92
7/25/2021
Louisville
79
But when I plot the data in plotly using the built-in animation_groups and animation_frames the slider bar jumps from row to row by nature, which doesn't lead to an intuitive animation when each 'jump' is not the same amount of days.
What I'm trying to work-around and do is create a new table, which duplicates rows and keeps the true report data, but creates an additional 'animation date' to keep the slider bar transition intuitive. I'd like the new data table to look something like the below. Assume the date the code was ran was 8/6/2021.
Report Date
Animation Date
Location
Data
Days Since Most Recent Report
8/6/2021
8/6/2021
St. Louis
100
0
8/1/2021
8/5/2021
St. Louis
89
4
8/1/2021
8/4/2021
St. Louis
89
3
8/1/2021
8/3/2021
St. Louis
89
2
8/1/2021
8/2/2021
St. Louis
89
1
8/1/2021
8/1/2021
St. Louis
89
0
7/29/2021
7/30/2021
St. Louis
85
1
7/29/2021
7/29/2021
St. Louis
85
0
7/24/2021
7/28/2021
St. Louis
80
4
7/24/2021
7/27/2021
St. Louis
80
3
7/24/2021
7/26/2021
St. Louis
80
2
7/24/2021
7/25/2021
St. Louis
80
1
7/24/2021
7/24/2021
St. Louis
80
0
7/30/2021
8/6/2021
Louisville
92
7
7/30/2021
8/5/2021
Louisville
92
6
7/30/2021
8/4/2021
Louisville
92
5
7/30/2021
8/3/2021
Louisville
92
4
7/30/2021
8/2/2021
Louisville
92
3
7/30/2021
8/1/2021
Louisville
92
2
7/30/2021
7/31/2021
Louisville
92
1
7/30/2021
7/30/2021
Louisville
92
0
7/25/2021
7/29/2021
Louisville
79
4
7/25/2021
7/28/2021
Louisville
79
3
7/25/2021
7/27/2021
Louisville
79
2
7/25/2021
7/26/2021
Louisville
79
1
7/25/2021
7/25/2021
Louisville
79
0
By doing this, the animation could display 'Days Since Most Recent Report' or 'Report Date' to show that as the animation plays, some data displayed might have some antiquity to it, but the animation traverses through time appropriately and there is data displayed throughout the animation. Each time the 'Animation Date' matches up with a 'Report Date' a new bit of data will be displayed for each 'Animation Date' until a new 'Report Date' is hit and the cycle repeats itself til the animation is brought up to the present day.
If there is any easier way to work around this in plotly, please let me know! Otherwise, I'm having trouble getting off the ground with the logic creating a new DataFrame while iterating through the old DataFrame.
IIUC you can reindex through pd.MultiIndex.from_tuples:
df["Animation Date"] = pd.to_datetime(df["Report Date"])
max_date = df["Report Date"].max()
idx = pd.MultiIndex.from_tuples([[x, d] for x, y in df.groupby("Location")["Animation Date"]
for d in pd.date_range(min(y), max_date)],
names=["Location", "Animation Date"])
s = df.set_index(["Location", "Animation Date"]).reindex(idx).reset_index()
s["Days Since"] = s.groupby(["Location", s.Data.notnull().cumsum()]).cumcount()
print (s.ffill())
Location Animation Date Report Date Data Days Since
0 Louisville 2021-07-25 7/25/2021 79.0 0
1 Louisville 2021-07-26 7/25/2021 79.0 1
2 Louisville 2021-07-27 7/25/2021 79.0 2
3 Louisville 2021-07-28 7/25/2021 79.0 3
4 Louisville 2021-07-29 7/25/2021 79.0 4
5 Louisville 2021-07-30 7/30/2021 92.0 0
6 Louisville 2021-07-31 7/30/2021 92.0 1
7 Louisville 2021-08-01 7/30/2021 92.0 2
8 Louisville 2021-08-02 7/30/2021 92.0 3
9 Louisville 2021-08-03 7/30/2021 92.0 4
10 Louisville 2021-08-04 7/30/2021 92.0 5
11 Louisville 2021-08-05 7/30/2021 92.0 6
12 Louisville 2021-08-06 7/30/2021 92.0 7
13 St. Louis 2021-07-24 7/24/2021 80.0 0
14 St. Louis 2021-07-25 7/24/2021 80.0 1
15 St. Louis 2021-07-26 7/24/2021 80.0 2
16 St. Louis 2021-07-27 7/24/2021 80.0 3
17 St. Louis 2021-07-28 7/24/2021 80.0 4
18 St. Louis 2021-07-29 7/29/2021 85.0 0
19 St. Louis 2021-07-30 7/29/2021 85.0 1
20 St. Louis 2021-07-31 7/29/2021 85.0 2
21 St. Louis 2021-08-01 8/1/2021 89.0 0
22 St. Louis 2021-08-02 8/1/2021 89.0 1
23 St. Louis 2021-08-03 8/1/2021 89.0 2
24 St. Louis 2021-08-04 8/1/2021 89.0 3
25 St. Louis 2021-08-05 8/1/2021 89.0 4
26 St. Louis 2021-08-06 8/6/2021 100.0 0

python selenium: <input> element not interactable

I was trying to crawl down nba player info from https://nba.com/players and click the button "Show Historic" on the webpage
nba_webpage_picture
part of the html code for the input button shows below:
<div aria-label="Show Historic Toggle" class="Toggle_switch__2e_90">
<input type="checkbox" class="Toggle_input__gIiFd" name="showHistoric">
<span class="Toggle_slider__hCMQQ Toggle_sliderActive__15Jrf Toggle_slidercerulean__1UnnV">
</span>
</div>
I simply use find_element_by_xpath to locate the input button and click
button_show_historic = driver.find_element_by_xpath("//input[#name='showHistoric']")
button_show_historic.click()
However it says:
Exception has occurred: ElementNotInteractableException
Message: element not interactable
(Session info: chrome=88.0.4324.192)
Could anyone help on solving this issue? Is this because the input is not visible?
Simply wait for the span element not the input element and click.
wait = WebDriverWait(driver, 30)
driver.get('https://www.nba.com/players')
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[.='I Accept']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,"//input[#name='showHistoric']/preceding::span[1]"))).click()
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Also to find an api just look under Developer tools ->Network->Headers
and Response to find if it gets populated.
Most probably problem is you don't have any wait code. You should wait until page is loaded. You can use simple python sleep function:
import time
time.sleep(3) #it will wait 3 seconds
##Do your action
Or You can use explicit wait. Check this page: selenium.dev
No need to use selenium when there's an api. Try this:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/playerindex'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Referer': 'http://stats.nba.com'}
payload = {
'College': '',
'Country': '',
'DraftPick': '',
'DraftRound': '',
'DraftYear': '',
'Height': '' ,
'Historical': '1',
'LeagueID': '00',
'Season': '2020-21',
'SeasonType': 'Regular Season',
'TeamID': '0',
'Weight': ''}
jsonData = requests.get(url, headers=headers, params=payload).json()
cols = jsonData['resultSets'][0]['headers']
data = jsonData['resultSets'][0]['rowSet']
df = pd.DataFrame(data, columns=cols)
Output: [4589 rows x 26 columns]
print(df.head(20).to_string())
PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION HEIGHT WEIGHT COLLEGE COUNTRY DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME FROM_YEAR TO_YEAR
0 76001 Abdelnaby Alaa alaa-abdelnaby 1.610613e+09 blazers 0 Portland Trail Blazers POR 30 F 6-10 240 Duke USA 1990.0 1.0 25.0 NaN 5.7 3.3 0.3 Career 1990 1994
1 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1.610613e+09 rockets 0 Houston Rockets HOU 54 C 6-9 235 Iowa State USA 1968.0 1.0 5.0 NaN 9.0 8.0 1.2 Career 1968 1977
2 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1.610613e+09 lakers 0 Los Angeles Lakers LAL 33 C 7-2 225 UCLA USA 1969.0 1.0 1.0 NaN 24.6 11.2 3.6 Career 1969 1988
3 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1.610613e+09 nuggets 0 Denver Nuggets DEN 1 G 6-1 162 Louisiana State USA 1990.0 1.0 3.0 NaN 14.6 1.9 3.5 Career 1990 2000
4 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1.610613e+09 kings 0 Sacramento Kings SAC 9 F-G 6-6 235 San Jose State France 1997.0 1.0 11.0 NaN 7.8 3.3 1.1 Career 1997 2003
5 949 Abdur-Rahim Shareef shareef-abdur-rahim 1.610613e+09 grizzlies 0 Memphis Grizzlies MEM 3 F 6-9 245 California USA 1996.0 1.0 3.0 NaN 18.1 7.5 2.5 Career 1996 2007
6 76005 Abernethy Tom tom-abernethy 1.610613e+09 warriors 0 Golden State Warriors GSW 5 F 6-7 220 Indiana USA 1976.0 3.0 43.0 NaN 5.6 3.2 1.2 Career 1976 1980
7 76006 Able Forest forest-able 1.610613e+09 sixers 0 Philadelphia 76ers PHI 6 G 6-3 180 Western Kentucky USA 1956.0 NaN NaN NaN 0.0 1.0 1.0 Career 1956 1956
8 76007 Abramovic John john-abramovic 1.610610e+09 None 1 Pittsburgh Ironmen PIT None F 6-3 195 Salem USA NaN NaN NaN NaN 9.5 NaN 0.7 Career 1946 1947
9 203518 Abrines Alex alex-abrines 1.610613e+09 thunder 0 Oklahoma City Thunder OKC 8 G 6-6 190 FC Barcelona Spain 2013.0 2.0 32.0 NaN 5.3 1.4 0.5 Career 2016 2018
10 1630173 Achiuwa Precious precious-achiuwa 1.610613e+09 heat 0 Miami Heat MIA 5 F 6-8 225 Memphis Nigeria 2020.0 1.0 20.0 1.0 5.9 3.9 0.6 Season 2020 2020
11 101165 Acker Alex alex-acker 1.610613e+09 clippers 0 LA Clippers LAC 3 G 6-5 185 Pepperdine USA 2005.0 2.0 60.0 NaN 2.7 1.0 0.5 Career 2005 2008
12 76008 Ackerman Donald donald-ackerman 1.610613e+09 knicks 0 New York Knicks NYK G 6-0 183 Long Island-Brooklyn USA 1953.0 2.0 NaN NaN 1.5 0.5 0.8 Career 1953 1953
13 76009 Acres Mark mark-acres 1.610613e+09 magic 0 Orlando Magic ORL 42 C 6-11 220 Oral Roberts USA 1985.0 2.0 40.0 NaN 3.6 4.1 0.5 Career 1987 1992
14 76010 Acton Charles charles-acton 1.610613e+09 rockets 0 Houston Rockets HOU 24 F 6-6 210 Hillsdale USA NaN NaN NaN NaN 3.3 2.0 0.5 Career 1967 1967
15 203112 Acy Quincy quincy-acy 1.610613e+09 kings 0 Sacramento Kings SAC 13 F 6-7 240 Baylor USA 2012.0 2.0 37.0 NaN 4.9 3.5 0.6 Career 2012 2018
16 76011 Adams Alvan alvan-adams 1.610613e+09 suns 0 Phoenix Suns PHX 33 C 6-9 210 Oklahoma USA 1975.0 1.0 4.0 NaN 14.1 7.0 4.1 Career 1975 1987
17 76012 Adams Don don-adams 1.610613e+09 pistons 0 Detroit Pistons DET 10 F 6-7 210 Northwestern USA 1970.0 8.0 120.0 NaN 8.7 5.6 1.8 Career 1970 1976
18 200801 Adams Hassan hassan-adams 1.610613e+09 nets 0 Brooklyn Nets BKN 8 F 6-4 220 Arizona USA 2006.0 2.0 54.0 NaN 2.5 1.2 0.2 Career 2006 2008
19 1629121 Adams Jaylen jaylen-adams 1.610613e+09 bucks 0 Milwaukee Bucks MIL 6 G 6-0 225 St. Bonaventure USA NaN NaN NaN 1.0 0.3 0.4 0.3 Season 2018 2020

VLookup in Pandas using merge

I have 2 dataframes:
df_dict:
Bet365 Team (Dataset) Record ID
-- -------------------- ---------------- -----------
0 Lincoln City Lincoln 50
1 Peterborough Peterboro 65
2 Cambridge Utd Cambridge 72
3 Harrogate Town Harrogate 87
4 Cologne FC Koln 160
5 Hertha Berlin Hertha 167
6 Arminia Bielefeld Bielefeld 169
7 Schalke Schalke 04 173
8 TSG Hoffenheim Hoffenheim 174
9 SC Freiburg Freiburg 175
10 Zulte-Waregem Waregem 320
11 Royal Excel Mouscron Mouscron 325
Other dataframe:
df_odds:
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ----------------- -------------------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln City Peterborough 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Utd Harrogate Town 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Zulte-Waregem Royal Excel Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 SC Freiburg Cologne 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke TSG Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Arminia Bielefeld Hertha Berlin 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 Cologne Hertha Berlin 3.2 3.3 2.25
I would like to merge the dataset to get the final dataframe as:
df_expected
DateTime League HomeTeam AwayTeam B365H B365D B365A
-- -------------------------- ---------------------- ---------- ---------- ------- ------- -------
0 2021-01-09 12:30:00.000001 England League 1 Lincoln Peterboro 2.29 3.4 3.1
1 2021-01-09 15:00:00 England League 2 Cambridge Harrogate 2.29 3.2 3.25
2 2021-01-09 15:14:59.999999 Belgium First Division Waregem Mouscron 1.85 3.75 3.8
3 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Freiburg FC Koln 1.9 3.75 3.75
4 2021-01-09 14:29:59.999999 Germany Bundesliga 1 Schalke 04 Hoffenheim 3.8 3.8 1.85
5 2021-01-10 17:00:00.000001 Germany Bundesliga 1 Bielefeld Hertha 4 3.5 1.9
6 2021-01-16 14:29:59.999999 Germany Bundesliga 1 FC Koln Hertha 3.2 3.3 2.25
The common key is the df_dict.Bet365
I am trying merge pd.merge but I am unable to get the right keys and the correct join
Help would be greatly appreciated
Use Series.map for both columns by Series with Bet365 column converted to index:
s = df_dict.set_index('Bet365')['Team (Dataset)']
df_odds['HomeTeam'] = df_odds['HomeTeam'].map(s)
df_odds['AwayTeam'] = df_odds['AwayTeam'].map(s)

Reshape pivot table in pandas

I need to reshape a csv pivot table. A small extract looks like:
country location confirmedcases_10-02-2020 deaths_10-02-2020 confirmedcases_11-02-2020 deaths_11-02-2020
0 Australia New South Wales 4.0 0.0 4 0.0
1 Australia Victoria 4.0 0.0 4 0.0
2 Australia Queensland 5.0 0.0 5 0.0
3 Australia South Australia 2.0 0.0 2 0.0
4 Cambodia Sihanoukville 1.0 0.0 1 0.0
5 Canada Ontario 3.0 0.0 3 0.0
6 Canada British Columbia 4.0 0.0 4 0.0
7 China Hubei 31728.0 974.0 33366 1068.0
8 China Zhejiang 1177.0 0.0 1131 0.0
9 China Guangdong 1177.0 1.0 1219 1.0
10 China Henan 1105.0 7.0 1135 8.0
11 China Hunan 912.0 1.0 946 2.0
12 China Anhui 860.0 4.0 889 4.0
13 China Jiangxi 804.0 1.0 844 1.0
14 China Chongqing 486.0 2.0 505 3.0
15 China Sichuan 417.0 1.0 436 1.0
16 China Shandong 486.0 1.0 497 1.0
17 China Jiangsu 515.0 0.0 543 0.0
18 China Shanghai 302.0 1.0 311 1.0
19 China Beijing 342.0 3.0 352 3.0
is there any ready to use pandas tool to achieve it?
into something like:
country location date confirmedcases deaths
0 Australia New South Wales 2020-02-10 4.0 0.0
1 Australia Victoria 2020-02-10 4.0 0.0
2 Australia Queensland 2020-02-10 5.0 0.0
3 Australia South Australia 2020-02-10 2.0 0.0
4 Cambodia Sihanoukville 2020-02-10 1.0 0.0
5 Canada Ontario 2020-02-10 3.0 0.0
6 Canada British Columbia 2020-02-10 4.0 0.0
7 China Hubei 2020-02-10 31728.0 974.0
8 China Zhejiang 2020-02-10 1177.0 0.0
9 China Guangdong 2020-02-10 1177.0 1.0
10 China Henan 2020-02-10 1105.0 7.0
11 China Hunan 2020-02-10 912.0 1.0
12 China Anhui 2020-02-10 860.0 4.0
13 China Jiangxi 2020-02-10 804.0 1.0
14 China Chongqing 2020-02-10 486.0 2.0
15 China Sichuan 2020-02-10 417.0 1.0
16 China Shandong 2020-02-10 486.0 1.0
17 China Jiangsu 2020-02-10 515.0 0.0
18 China Shanghai 2020-02-10 302.0 1.0
19 China Beijing 2020-02-10 342.0 3.0
20 Australia New South Wales 2020-02-11 4.0 0.0
21 Australia Victoria 2020-02-11 4.0 0.0
22 Australia Queensland 2020-02-11 5.0 0.0
23 Australia South Australia 2020-02-11 2.0 0.0
24 Cambodia Sihanoukville 2020-02-11 1.0 0.0
25 Canada Ontario 2020-02-11 3.0 0.0
26 Canada British Columbia 2020-02-11 4.0 0.0
27 China Hubei 2020-02-11 33366.0 1068.0
28 China Zhejiang 2020-02-11 1131.0 0.0
29 China Guangdong 2020-02-11 1219.0 1.0
30 China Henan 2020-02-11 1135.0 8.0
31 China Hunan 2020-02-11 946.0 2.0
32 China Anhui 2020-02-11 889.0 4.0
33 China Jiangxi 2020-02-11 844.0 1.0
34 China Chongqing 2020-02-11 505.0 3.0
35 China Sichuan 2020-02-11 436.0 1.0
36 China Shandong 2020-02-11 497.0 1.0
37 China Jiangsu 2020-02-11 543.0 0.0
38 China Shanghai 2020-02-11 311.0 1.0
39 China Beijing 2020-02-11 352.0 3.0
Use pd.wide_to_long:
print (pd.wide_to_long(df,stubnames=["confirmedcases","deaths"],
i=["country","location"],j="date",sep="_",
suffix=r'\d{2}-\d{2}-\d{4}').reset_index())
country location date confirmedcases deaths
0 Australia New South Wales 10-02-2020 4.0 0.0
1 Australia New South Wales 11-02-2020 4.0 0.0
2 Australia Victoria 10-02-2020 4.0 0.0
3 Australia Victoria 11-02-2020 4.0 0.0
4 Australia Queensland 10-02-2020 5.0 0.0
5 Australia Queensland 11-02-2020 5.0 0.0
6 Australia South Australia 10-02-2020 2.0 0.0
7 Australia South Australia 11-02-2020 2.0 0.0
8 Cambodia Sihanoukville 10-02-2020 1.0 0.0
9 Cambodia Sihanoukville 11-02-2020 1.0 0.0
10 Canada Ontario 10-02-2020 3.0 0.0
11 Canada Ontario 11-02-2020 3.0 0.0
12 Canada British Columbia 10-02-2020 4.0 0.0
13 Canada British Columbia 11-02-2020 4.0 0.0
14 China Hubei 10-02-2020 31728.0 974.0
15 China Hubei 11-02-2020 33366.0 1068.0
16 China Zhejiang 10-02-2020 1177.0 0.0
17 China Zhejiang 11-02-2020 1131.0 0.0
18 China Guangdong 10-02-2020 1177.0 1.0
19 China Guangdong 11-02-2020 1219.0 1.0
20 China Henan 10-02-2020 1105.0 7.0
21 China Henan 11-02-2020 1135.0 8.0
22 China Hunan 10-02-2020 912.0 1.0
23 China Hunan 11-02-2020 946.0 2.0
24 China Anhui 10-02-2020 860.0 4.0
25 China Anhui 11-02-2020 889.0 4.0
26 China Jiangxi 10-02-2020 804.0 1.0
27 China Jiangxi 11-02-2020 844.0 1.0
28 China Chongqing 10-02-2020 486.0 2.0
29 China Chongqing 11-02-2020 505.0 3.0
30 China Sichuan 10-02-2020 417.0 1.0
31 China Sichuan 11-02-2020 436.0 1.0
32 China Shandong 10-02-2020 486.0 1.0
33 China Shandong 11-02-2020 497.0 1.0
34 China Jiangsu 10-02-2020 515.0 0.0
35 China Jiangsu 11-02-2020 543.0 0.0
36 China Shanghai 10-02-2020 302.0 1.0
37 China Shanghai 11-02-2020 311.0 1.0
38 China Beijing 10-02-2020 342.0 3.0
39 China Beijing 11-02-2020 352.0 3.0
Yes, and you can achieve it by reshaping the dataframe.
Firs you have to melt the columns to have them as values:
df = df.melt(['country', 'location'],
[ p for p in df.columns if p not in ['country', 'location'] ],
'key',
'value')
#> country location key value
#> 0 Australia New South Wales confirmedcases_10-02-2020 4
#> 1 Australia Victoria confirmedcases_10-02-2020 4
#> 2 Australia Queensland confirmedcases_10-02-2020 5
#> 3 Australia South Australia confirmedcases_10-02-2020 2
#> 4 Cambodia Sihanoukville confirmedcases_10-02-2020 1
#> .. ... ... ... ...
#> 75 China Sichuan deaths_11-02-2020 1
#> 76 China Shandong deaths_11-02-2020 1
#> 77 China Jiangsu deaths_11-02-2020 0
#> 78 China Shanghai deaths_11-02-2020 1
#> 79 China Beijing deaths_11-02-2020 3
After that you need to separate the values in the column key:
key_split_series = df.key.str.split("_", expand=True)
df["key"] = key_split_series[0]
df["date"] = key_split_series[1]
#> country location key value date
#> 0 Australia New South Wales confirmedcases 4 10-02-2020
#> 1 Australia Victoria confirmedcases 4 10-02-2020
#> 2 Australia Queensland confirmedcases 5 10-02-2020
#> 3 Australia South Australia confirmedcases 2 10-02-2020
#> 4 Cambodia Sihanoukville confirmedcases 1 10-02-2020
#> .. ... ... ... ... ...
#> 75 China Sichuan deaths 1 11-02-2020
#> 76 China Shandong deaths 1 11-02-2020
#> 77 China Jiangsu deaths 0 11-02-2020
#> 78 China Shanghai deaths 1 11-02-2020
#> 79 China Beijing deaths 3 11-02-2020
In the end, you just need to pivot the table to have confirmedcases and deaths back as columns:
df = df.set_index(["country", "location", "date", "key"])["value"].unstack().reset_index()
#> key country location date confirmedcases deaths
#> 0 Australia New South Wales 10-02-2020 4 0
#> 1 Australia New South Wales 11-02-2020 4 0
#> 2 Australia Queensland 10-02-2020 5 0
#> 3 Australia Queensland 11-02-2020 5 0
#> 4 Australia South Australia 10-02-2020 2 0
#> .. ... ... ... ... ...
#> 35 China Shanghai 11-02-2020 311 1
#> 36 China Sichuan 10-02-2020 417 1
#> 37 China Sichuan 11-02-2020 436 1
#> 38 China Zhejiang 10-02-2020 1177 0
#> 39 China Zhejiang 11-02-2020 1131 0
Use {dataframe}.reshape((-1,1)) if there is only one feature and {dataframe}.reshape((1,-1)) if there is only one sample

pandasql EOL error while scanning string literal

I have the code below where I'm trying to use pandasql to run a sql query with sqldf. I'm doing some division and aggregation. The query runs just fine when I run it in r with sqldf. I'm totally new to pandasql and I'm getting the error below, can anyone see what my issue is and suggest how to fix it? I've also included some sample data.
Code:
import pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
ExampleDf=pysqldf("select sum(lastSaleAmount-priorSaleAmount)/sum(squareFootage) as AvgPric
,zipcode
from data
where priorSaleDate between '2010-01-01' and '2011-01-01'
group by zipcode
order by
sum(lastSaleAmount-priorSaleAmount)/sum(squareFootage) desc")
Error:
File "<ipython-input-100-679165684772>", line 1
ExampleDf=pysqldf("select sum(lastSaleAmount-priorSaleAmount)/sum(squareFootage) as AvgPric
^
SyntaxError: EOL while scanning string literal
Sample Data:
print(data.iloc[:50])
id address city state zipcode latitude \
0 39525749 8171 E 84th Ave Denver CO 80022 39.849160
1 184578398 10556 Wheeling St Denver CO 80022 39.888020
2 184430015 3190 Wadsworth Blvd Denver CO 80033 39.761710
3 155129946 3040 Wadsworth Blvd Denver CO 80033 39.760780
4 245107 5615 S Eaton St Denver CO 80123 39.616181
5 3523925 6535 W Sumac Ave Denver CO 80123 39.615136
6 30560679 6673 W Berry Ave Denver CO 80123 39.616350
7 39623928 5640 S Otis St Denver CO 80123 39.615213
8 148975825 5342 S Gray St Denver CO 80123 39.620158
9 184623176 4967 S Wadsworth Blvd Denver CO 80123 39.626770
10 39811456 6700 W Dorado Dr # 11 Denver CO 80123 39.614540
11 39591617 4956 S Perry St Denver CO 80123 39.628740
12 39577604 4776 S Gar Way Denver CO 80123 39.630547
13 153665665 8890 W Tanforan Dr Denver CO 80123 39.630738
14 39868673 5538 W Prentice Cir Denver CO 80123 39.620625
15 184328555 4254 W Monmouth Ave Denver CO 80123 39.629000
16 30554949 6600 W Berry Ave Denver CO 80123 39.616165
17 24157982 6560 W Sumac Ave Denver CO 80123 39.614712
18 51335315 5655 S Fenton St Denver CO 80123 39.615488
19 152799217 5626 S Fenton St Denver CO 80123 39.616153
20 51330641 5599 S Fenton St Denver CO 80123 39.616514
21 15598828 6595 W Sumac Ave Denver CO 80123 39.615144
22 49360310 6420 W Sumac Ave Denver CO 80123 39.614531
23 39777745 4962 S Field Ct Denver CO 80123 39.625819
24 18021201 9664 W Grand Ave Denver CO 80123 39.625826
25 39776096 4881 S Jellison St Denver CO 80123 39.628401
26 29850085 5012 S Field Ct Denver CO 80123 39.625537
27 51597934 4982 S Field Ct Denver CO 80123 39.625757
28 39563379 4643 S Hoyt St Denver CO 80123 39.632457
29 18922140 5965 W Sumac Ave Denver CO 80123 39.615199
30 39914328 9740 W Chenango Ave Denver CO 80123 39.627226
31 51323181 5520 W Prentice Cir Denver CO 80123 39.620548
32 3493378 4665 S Garland Way Denver CO 80123 39.632063
33 4115341 5466 W Prentice Cir Denver CO 80123 39.619027
34 39639069 5735 W Berry Ave Denver CO 80123 39.617727
35 184333944 9015 W Tanforan Dr Denver CO 80123 39.631178
36 18197471 4977 S Garland St Denver CO 80123 39.626080
37 49430482 9540 W Bellwood Pl Denver CO 80123 39.624558
38 39868648 5535 S Fenton St Denver CO 80123 39.617145
39 143684222 3761 W Wagon Trail Dr Denver CO 80123 39.631251
40 152898579 4850 S Yukon St Denver CO 80123 39.629025
41 43174426 4951 S Ammons St Denver CO 80123 39.626582
42 39615194 7400 W Grant Ranch Blvd # 31 Denver CO 80123 39.618440
43 184340029 7400 W Grant Ranch Blvd # 7 Denver CO 80123 39.618440
44 3523919 5425 S Gray St Denver CO 80123 39.618265
45 151444231 6610 W Berry Ave Denver CO 80123 39.616148
46 19150871 4756 S Perry St Denver CO 80123 39.630389
47 39545155 4328 W Bellewood Dr Denver CO 80123 39.627883
48 3523923 6585 W Sumac Ave Denver CO 80123 39.615145
49 51337334 5737 W Alamo Dr Denver CO 80123 39.615881
longitude bedrooms bathrooms rooms squareFootage lotSize yearBuilt \
0 -104.893468 3 2.0 6 1378 9968 2003.0
1 -104.830930 2 2.0 6 1653 6970 2004.0
2 -105.081070 3 1.0 0 1882 23875 1917.0
3 -105.081060 4 3.0 0 2400 11500 1956.0
4 -105.058812 3 4.0 8 2305 5600 1998.0
5 -105.069018 3 5.0 7 2051 6045 1996.0
6 -105.070760 4 4.0 8 2051 6315 1997.0
7 -105.070617 3 3.0 7 2051 8133 1997.0
8 -105.063094 3 3.0 7 1796 5038 1999.0
9 -105.081990 3 3.0 0 2054 4050 2007.0
10 -105.071350 3 4.0 7 2568 6397 2000.0
11 -105.040126 3 2.0 6 1290 9000 1962.0
12 -105.100242 3 4.0 6 1804 6952 1983.0
13 -105.097718 3 3.0 6 1804 7439 1983.0
14 -105.059503 4 5.0 8 3855 9656 1998.0
15 -105.042330 2 2.0 4 1297 16600 1962.0
16 -105.069424 4 4.0 9 2321 5961 1996.0
17 -105.069264 4 4.0 8 2321 6337 1997.0
18 -105.060173 3 3.0 7 2321 6151 1998.0
19 -105.059696 3 3.0 7 2071 6831 1999.0
20 -105.060193 3 3.0 7 2071 6050 1998.0
21 -105.069803 3 3.0 7 2074 6022 1996.0
22 -105.067815 4 4.0 9 2588 6432 1996.0
23 -105.099825 3 2.0 7 1567 6914 1980.0
24 -105.106423 3 2.0 5 1317 9580 1983.0
25 -105.108440 3 3.0 5 1317 6718 1982.0
26 -105.099012 2 2.0 6 808 8568 1980.0
27 -105.099484 2 1.0 6 808 6858 1980.0
28 -105.104752 3 2.0 6 1321 6000 1978.0
29 -105.062378 3 4.0 8 2350 6839 1997.0
30 -105.107806 2 2.0 5 1586 6510 1982.0
31 -105.058600 2 4.0 6 2613 8250 1998.0
32 -105.101493 3 2.0 8 1590 7044 1977.0
33 -105.057427 3 5.0 7 2614 9350 1999.0
34 -105.059123 3 4.0 7 2107 6491 1998.0
35 -105.099179 2 1.0 5 1340 6741 1982.0
36 -105.103470 3 2.0 6 1085 6120 1985.0
37 -105.104316 3 1.0 6 1085 13500 1981.0
38 -105.060195 4 3.0 8 2365 6050 1998.0
39 -105.036567 3 2.0 5 1344 9240 1959.0
40 -105.081998 2 3.0 5 1601 6660 1986.0
41 -105.087250 3 2.0 8 1858 6890 1986.0
42 -105.079900 2 2.0 5 1603 5742 1997.0
43 -105.079900 2 2.0 5 1603 6168 1997.0
44 -105.061397 3 3.0 7 1860 6838 1998.0
45 -105.069618 3 4.0 8 2376 5760 1996.0
46 -105.038707 3 2.0 5 1355 9600 1960.0
47 -105.042611 2 2.0 6 1867 11000 1973.0
48 -105.069604 3 3.0 7 2382 5830 1996.0
49 -105.059085 3 3.0 6 1872 5500 1999.0
lastSaleDate lastSaleAmount priorSaleDate priorSaleAmount \
0 2009-12-17 75000 2004-05-13 165700.0
1 2004-09-23 216935 NaN NaN
2 2008-04-03 330000 NaN NaN
3 2008-12-02 185000 2008-06-27 0.0
4 2012-07-18 308000 2011-12-29 0.0
5 2006-09-12 363500 2005-05-16 339000.0
6 2014-12-15 420000 2006-07-07 345000.0
7 2004-03-15 328700 1998-04-09 225200.0
8 2011-08-16 274900 2011-01-10 0.0
9 2015-12-01 407000 2012-10-30 312000.0
10 2014-11-12 638000 2005-03-22 530000.0
11 2004-02-02 235000 2000-10-12 171000.0
12 2004-07-19 247000 1999-06-07 187900.0
13 2013-08-14 249700 2000-09-07 217900.0
14 2004-08-17 580000 1999-01-11 574000.0
15 2011-11-07 150000 NaN NaN
16 2006-01-18 402800 2004-08-16 335000.0
17 2013-12-31 422000 2012-11-05 399000.0
18 1999-12-02 277900 NaN NaN
19 2000-02-04 271800 NaN NaN
20 1999-10-20 274400 NaN NaN
21 2007-11-30 314500 NaN NaN
22 2001-12-31 342500 NaN NaN
23 2016-12-02 328000 2016-08-02 231200.0
24 2017-06-21 376000 2008-02-29 244000.0
25 2004-08-31 225000 NaN NaN
26 2016-09-06 310000 2015-09-15 258900.0
27 1999-12-06 128000 NaN NaN
28 2004-04-28 197000 NaN NaN
29 2011-08-11 365000 2004-08-04 365000.0
30 2015-07-08 302000 2004-07-15 210000.0
31 2000-02-10 425000 1999-04-08 396500.0
32 2016-02-26 275000 2004-12-03 204000.0
33 2005-08-29 580000 1999-09-10 398200.0
34 2004-06-30 355000 2001-02-22 320000.0
35 2015-05-26 90000 1983-06-01 80000.0
36 2017-06-08 312500 2017-05-12 258000.0
37 2001-04-27 184000 1999-11-10 164900.0
38 2004-02-08 335000 2001-05-08 339950.0
39 2016-10-17 290000 NaN 70200.0
40 2010-09-02 260000 1998-04-14 189900.0
41 2012-07-30 231600 2012-03-30 0.0
42 2013-10-24 400000 2004-08-04 388400.0
43 2004-11-19 350000 1998-10-05 292400.0
44 2005-06-23 295000 2004-07-26 300000.0
45 2009-06-24 404500 2000-05-04 304900.0
46 1999-12-14 153500 1999-12-14 153500.0
47 2004-05-25 208000 NaN NaN
48 2016-10-20 502000 2005-05-31 357000.0
49 2013-04-05 369000 2000-08-07 253000.0
estimated_value
0 239753
1 343963
2 488840
3 494073
4 513676
5 496062
6 514953
7 494321
8 496079
9 424514
10 721350
11 331915
12 389415
13 386694
14 784587
15 354031
16 515537
17 544960
18 504791
19 495121
20 495894
21 496281
22 528343
23 349041
24 367754
25 356934
26 346001
27 342927
28 337969
29 500105
30 353827
31 693035
32 350857
33 716655
34 493156
35 349355
36 348079
37 343957
38 504705
39 311996
40 391469
41 418814
42 502894
43 478049
44 475615
45 521467
46 366187
47 386913
48 527104
49 497239
Just change the quotes to be able to read multiline string:
ExampleDf=pysqldf("""select sum(lastSaleAmount-priorSaleAmount)/sum(squareFootage) as AvgPric
,zipcode
from data
where priorSaleDate between '2010-01-01' and '2011-01-01'
group by zipcode
order by
sum(lastSaleAmount-priorSaleAmount)/sum(squareFootage) desc""")

Categories