Below is my code. What I want to do is merge the spread and total values for each week that I have saved in separate files. It works perfectly for individual weeks, but doesn't when I introduce the for loop. I assume its overwriting each time it merges, but when I place the .merge code outside the for loop, it only writes the last iteration to the excel file.
year = 2015
weeks = np.arange(1,18)
for week in weeks:
odds = pd.read_excel(fr'C:\Users\logan\Desktop\Gambling_Scraper\Odds_{year}\Odds{year}Wk{week}.xlsx')
odds['Favorite'] = odds['Favorite'].map(lambda x: x.lstrip('at '))
odds['Underdog'] = odds['Underdog'].map(lambda x: x.lstrip('at '))
odds['UD_Spread'] = odds['Spread'] * -1
#new df to add spread
new_df = pd.DataFrame(odds['Favorite'].append(odds['Underdog']))
new_df['Tm'] = new_df
new_df['Wk'] = new_df['Tm'] + str(week)
new_df['Spread'] = odds['Spread'].append(odds['UD_Spread'])
#new df to add total
total_df = pd.DataFrame(odds['Favorite'].append(odds['Underdog']))
total_df['Tm'] = total_df
total_df['Wk'] = total_df['Tm'] + str(week)
total_df['Total']= pd.DataFrame(odds['Total'].append(odds['Total']))
df['Week'] = df['Week'].astype(int)
df['Merge'] = df['Tm'].astype(str) + df['Week'].astype(str)
df = df.merge(new_df['Spread'], left_on='Merge', right_on=new_df['Wk'], how='left')
df = df.merge(total_df['Total'], left_on='Merge', right_on=total_df['Wk'], how='left')
df['Implied Tm Pts'] = df['Total'].astype(float) /2 - df['Spread'].astype(float)/2
df.to_excel('DFS2015.xlsx')
What I get:
Name Position Week Tm Merge Spread Total Implied Tm Pts
Devonta Freeman RB 1 Falcons Falcons1 3 55 26
Devonta Freeman RB 2 Falcons Falcons2
Devonta Freeman RB 3 Falcons Falcons3
Devonta Freeman RB 4 Falcons Falcons4
Devonta Freeman RB 5 Falcons Falcons5
Devonta Freeman RB 6 Falcons Falcons6
Devonta Freeman RB 7 Falcons Falcons7
Devonta Freeman RB 8 Falcons Falcons8
Devonta Freeman RB 9 Falcons Falcons9
Devonta Freeman RB 11 Falcons Falcons11
Devonta Freeman RB 13 Falcons Falcons13
Devonta Freeman RB 14 Falcons Falcons14
Devonta Freeman RB 15 Falcons Falcons15
Devonta Freeman RB 16 Falcons Falcons16
Devonta Freeman RB 17 Falcons Falcons17
Antonio Brown WR 1 Steelers Steelers1 7 51 22
But I need a value in each row.
Trying to merge 'Spread' and Total from this data:
Date Favorite Spread Underdog Spread2 Total Away Money
Line Home Money Line Week Favs Spread Uds Spread2
September 10, 2015 8:30 PM Patriots -7.0 Steelers 7 51.0 +270 -340 1 Patriots1 -7.0 Steelers1 7
September 13, 2015 1:00 PM Packers -6.0 Bears 6 48.0 -286 +230 1 Packers1 -6.0 Bears1 6
September 13, 2015 1:00 PM Chiefs -1.0 Texans 1 40.0 -115 -105 1 Chiefs1 -1.0 Texans1 1
September 13, 2015 1:00 PM Jets -4.0 Browns 4 40.0 +170 -190 1 Jets1 -4.0 Browns1 4
September 13, 2015 1:00 PM Colts -1.0 Bills 1 44.0 -115 -105 1 Colts1 -1.0 Bills1 1
September 13, 2015 1:00 PM Dolphins -4.0 Football Team 4 46.0 -210 +175 1 Dolphins1 -4.0 Football Team1 4
September 13, 2015 1:00 PM Panthers -3.0 Jaguars 3 41.0 -150 +130 1 Panthers1 -3.0 Jaguars1 3
September 13, 2015 1:00 PM Seahawks -4.0 Rams 4 42.0 -185 +160 1 Seahawks1 -4.0 Rams1 4
September 13, 2015 4:05 PM Cardinals -2.0 Saints 2 49.0 +120 -140 1 Cardinals1 -2.0 Saints1 2
September 13, 2015 4:05 PM Chargers -4.0 Lions 4 46.0 +160 -180 1 Chargers1 -4.0 Lions1 4
September 13, 2015 4:25 PM Buccaneers -3.0 Titans 3 40.0 +130 -150 1 Buccaneers1 -3.0 Titans1 3
September 13, 2015 4:25 PM Bengals -3.0 Raiders 3 43.0 -154 +130 1 Bengals1 -3.0 Raiders1 3
September 13, 2015 4:25 PM Broncos -4.0 Ravens 4 46.0 +180 -220 1 Broncos1 -4.0 Ravens1 4
September 13, 2015 8:30 PM Cowboys -7.0 Giants 7 52.0 +240 -300 1 Cowboys1 -7.0 Giants1 7
September 14, 2015 7:10 PM Eagles -3.0 Falcons 3 55.0 -188 +150 1 Eagles1 -3.0 Falcons1 3
September 14, 2015 10:20 PM Vikings -2.0 49ers 2 42.0 -142 +120 1 Vikings1 -2.0 49ers1 2
I would like to apply a rolling median to replace NaN values in the following dataframe, with a window size of 3:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
17 366000.0 278000.0 330000.0 NaN 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 635300.0 690600.0 800000.0 NaN 821500.0 ... 850800.0 905000.0 947500.0 1016500.0 1043900.0 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 NaN NaN
However pandas rolling function seems to work for columns and not along a row. How can i fix this? Also, the solution should NOT change any of the non NAN values in that row
First compute the rolling medians by using rolling() with axis=1 (row-wise), min_periods=0 (to handle NaN), and closed='both' (otherwise left edge gets excluded).
Then replace only the NaN entries with these medians by using fillna().
medians = df.rolling(3, min_periods=0, closed='both', axis=1).median()
df = df.fillna(medians)
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
# 17 366000.0 278000.0 330000.0 330000.0 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 ... 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 1580000.0 1589500.0
I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""
I am trying to web scrape, by using Python 3, a table off of this website into a .csv file: 2015 NBA National TV Schedule
The chart starts out like:
Date Teams Network
Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
The output I want in a .csv file looks like this:
These are the first four lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once, and the time is in a separate column. How do I implement the scraper to get this output?
pd.read_html will get most of the way there:
In [73]: pd.read_html("https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952")[0]
Out[73]:
0 1 2
0 Date Teams Network
1 Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
2 Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
3 Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
4 Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
.. ... ... ...
139 Apr. 9, 8:30 p.m. ET Cleveland # Chicago ABC
140 Apr. 12, 8:00 p.m. ET Oklahoma City # San Antonio TNT
141 Apr. 12, 10:30 p.m. ET Memphis # L.A. Clippers TNT
142 Apr. 13, 8:00 p.m. ET Orlando # Charlotte ESPN
143 Apr. 13, 10:30 p.m. ET Utah # L.A. Lakers ESPN
You'd just need to parse out the date into columns and separate the teams.
You'll use pandas to grab the table with .read_html(), and then continue with pandas to manipulate the data:
import pandas as pd
import numpy as np
df = pd.read_html('https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952', header=0)[0]
# Split the Date column at the comma into Date, Time columns
df[['Date','Time']] = df.Date.str.split(',',expand=True)
# Replace substrings in Time column
df['Time'] = df['Time'].str.replace('p.m. ET','PM')
# Can't convert to datetime as there is no year. One way to do it here is anything before
# Jan, add the suffix ', 2015', else add ', 2016'
# If you have more than 1 seasin, would have to work this out another way
df['Date'] = np.where(df.Date.str.startswith(('Oct.', 'Nov.', 'Dec.')), df.Date + ', 2015', df.Date + ', 2016')
# If you want 0 padding for the day, remove '#' from %#d below
# Change the date format from abbreviated month to full name (Ie Oct. -> October)
df['Date'] = pd.to_datetime(df['Date'].astype(str)).dt.strftime('%B %#d, %Y')
# Split the Teams column
df[['Team 1','Team 2']] = df.Teams.str.split('#',expand=True)
# Remove any leading/trailing whitespace
df= df.applymap(lambda x: x.strip() if type(x) is str else x)
# Final dataframe with desired columns
df = df[['Date','Time','Team 1','Team 2','Network']]
Output:
Date Time Team 1 Team 2 Network
0 October 27, 2015 8:00 PM Cleveland Chicago TNT
1 October 27, 2015 10:30 PM New Orleans Golden State TNT
2 October 28, 2015 8:00 PM San Antonio Oklahoma City ESPN
3 October 28, 2015 10:30 PM Minnesota L.A. Lakers ESPN
4 October 29, 2015 8:00 PM Atlanta New York TNT
5 October 29, 2015 10:30 PM Dallas L.A. Clippers TNT
6 October 30, 2015 7:00 PM Miami Cleveland ESPN
7 October 30, 2015 9:30 PM Golden State Houston ESPN
8 November 4, 2015 8:00 PM New York Cleveland ESPN
9 November 4, 2015 10:30 PM L.A. Clippers Golden State ESPN
10 November 5, 2015 8:00 PM Oklahoma City Chicago TNT
11 November 5, 2015 10:30 PM Memphis Portland TNT
12 November 6, 2015 8:00 PM Miami Indiana ESPN
13 November 6, 2015 10:30 PM Houston Sacramento ESPN
14 November 11, 2015 8:00 PM L.A. Clippers Dallas ESPN
15 November 11, 2015 10:30 PM San Antonio Portland ESPN
16 November 12, 2015 8:00 PM Golden State Minnesota TNT
17 November 12, 2015 10:30 PM L.A. Clippers Phoenix TNT
18 November 18, 2015 8:00 PM New Orleans Oklahoma City ESPN
19 November 18, 2015 10:30 PM Chicago Phoenix ESPN
20 November 19, 2015 8:00 PM Milwaukee Cleveland TNT
21 November 19, 2015 10:30 PM Golden State L.A. Clippers TNT
22 November 20, 2015 8:00 PM San Antonio New Orleans ESPN
23 November 20, 2015 10:30 PM Chicago Golden State ESPN
24 November 24, 2015 8:00 PM Boston Atlanta TNT
25 November 24, 2015 10:30 PM L.A. Lakers Golden State TNT
26 December 3, 2015 7:00 PM Oklahoma City Miami TNT
27 December 3, 2015 9:30 PM San Antonio Memphis TNT
28 December 4, 2015 7:00 PM Brooklyn New York ESPN
29 December 4, 2015 9:30 PM Cleveland New Orleans ESPN
.. ... ... ... ... ...
113 March 10, 2016 10:30 PM Cleveland L.A. Lakers TNT
114 March 12, 2016 8:30 PM Oklahoma City San Antonio ABC
115 March 13, 2016 3:30 PM Cleveland L.A. Clippers ABC
116 March 14, 2016 8:00 PM Memphis Houston ESPN
117 March 14, 2016 10:30 PM New Orleans Golden State ESPN
118 March 16, 2016 7:00 PM Oklahoma City Boston ESPN
119 March 16, 2016 9:30 PM L.A. Clippers Houston ESPN
120 March 19, 2016 8:30 PM Golden State San Antonio ABC
121 March 22, 2016 8:00 PM Houston Oklahoma City TNT
122 March 22, 2016 10:30 PM Memphis L.A. Lakers TNT
123 March 23, 2016 8:00 PM Milwaukee Cleveland ESPN
124 March 23, 2016 10:30 PM Dallas Portland ESPN
125 March 29, 2016 8:00 PM Houston Cleveland TNT
126 March 29, 2016 10:30 PM Washington Golden State TNT
127 March 31, 2016 7:00 PM Chicago Houston TNT
128 March 31, 2016 9:30 PM L.A. Clippers Oklahoma City TNT
129 April 1, 2016 8:00 PM Cleveland Atlanta ESPN
130 April 1, 2016 10:30 PM Boston Golden State ESPN
131 April 3, 2016 3:30 PM Oklahoma City Houston ABC
132 April 5, 2016 8:00 PM Chicago Memphis TNT
133 April 5, 2016 10:30 PM L.A. Lakers L.A. Clippers TNT
134 April 6, 2016 7:00 PM Cleveland Indiana ESPN
135 April 6, 2016 9:30 PM Houston Dallas ESPN
136 April 7, 2016 8:00 PM Chicago Miami TNT
137 April 7, 2016 10:30 PM San Antonio Golden State TNT
138 April 9, 2016 8:30 PM Cleveland Chicago ABC
139 April 12, 2016 8:00 PM Oklahoma City San Antonio TNT
140 April 12, 2016 10:30 PM Memphis L.A. Clippers TNT
141 April 13, 2016 8:00 PM Orlando Charlotte ESPN
142 April 13, 2016 10:30 PM Utah L.A. Lakers ESPN
[143 rows x 5 columns]
I have a first dataframe containing a serie called 'Date', and a variable number n of series called 'People_1' to 'People_n' :
Id Date People_1 People_2 People_3 People_4 People_5 People_6 People_7
12.0 Sat Dec 19 00:00:00 EST 1970 Loretta Lynn Owen Bradley
13.0 Sat Jun 07 00:00:00 EDT 1980 Sissy Spacek Loretta Lynn Owen Bradley
14.0 Sat Dec 04 00:00:00 EST 2010 Loretta Lynn Sheryl Crow Miranda Lambert
15.0 Sat Aug 09 00:00:00 EDT 1969 Charley Pride Dallas Frazier A.L. "Doodle" Chet Atkins Jack Clement Bob Ferguson Felton Jarvis
I also have another dataframe containing a list of names and biographic datas :
People Birth_date Birth_state Sex Ethnicity
Charles Kelley Fri Sep 11 00:00:00 EDT 1981 GA Male Caucasian
Hillary Scott Tue Apr 01 00:00:00 EST 1986 TN Female Caucasian
Reba McEntire Mon Mar 28 00:00:00 EST 1955 OK Female Caucasian
Wanda Jackson Wed Oct 20 00:00:00 EST 1937 OK Female Caucasian
Carrie UnderwoodThu Mar 10 00:00:00 EST 1983 OK Female Caucasian
Toby Keith Sat Jul 08 00:00:00 EDT 1961 OK Male Caucasian
David Bellamy Sat Sep 16 00:00:00 EDT 1950 FL Male Caucasian
Howard Bellamy Sat Feb 02 00:00:00 EST 1946 FL Male Caucasian
Keith Urban Thu Oct 26 00:00:00 EDT 1967 Northland Male Caucasian
Miranda Lambert Thu Nov 10 00:00:00 EST 1983 TX Female Caucasian
Sam Hunt Sat Dec 08 00:00:00 EST 1984 GA Male Caucasian
Johnny Cash Fri Feb 26 00:00:00 EST 1932 AR Male Caucasian
June Carter Sun Jun 23 00:00:00 EDT 1929 VA Female Caucasian
Merle Haggard Tue Apr 06 00:00:00 EST 1937 CA Male Caucasian
Waylon Jennings Tue Jun 15 00:00:00 EDT 1937 TX Male Caucasian
Willie Nelson Sat Apr 29 00:00:00 EST 1933 TX Male Caucasian
Loretta Lynn Thu Apr 14 00:00:00 EST 1932 KY Female Caucasian
Sissy Spacek Sun Dec 25 00:00:00 EST 1949 TX Female Caucasian
Sheryl Crow Sun Feb 11 00:00:00 EST 1962 MO Female Caucasian
Charley Pride Sun Mar 18 00:00:00 EST 1934 MS Male African American
Rodney Clawon ? TX Male Caucasian
Nathan Chapman ? TN Male Caucasian
I want to get for each date the biographic datas of each people involved that day :
Date Birth_state Sex Ethnicity
Sat Dec 19 00:00:00 EST 1970 KY Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 TX Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 MO Female Caucasian
Sat Dec 04 00:00:00 EST 2010 TX Female Caucasian
Sat Aug 09 00:00:00 EDT 1969 MS Male African American
Precision :
Consider that my bio datas aren't complete yet, some names are missing, which explains why I don't have a row for each person.
So is there a way to perform this task in Python ?
Lionel
You may to use left join in pandas to join the two tables first and then select the columns you need.
For example, you can first aggregate all person into a new column, for example, named 'person'. Then add one single row for each person. After you finished doing this, then left join two dataframe as you did.