I'm using Python, and I have data with a team name and dates of games that have been played, it looks something like this (except there are a few hundred rows):
team date
0 TOR 2016/10/15
1 LAK 2016/10/20
2 CGY 2016/11/03
3 BUF 2016/10/30
4 PIT 2016/10/27
5 CHI 2016/11/05
6 VAN 2016/10/20
7 BUF 2016/10/16
8 STL 2016/10/13
9 BUF 2016/10/29
10 MIN 2016/10/29
11 PIT 2016/11/05
12 CHI 2016/10/18
13 BOS 2016/10/29
14 PIT 2016/10/20
15 COL 2016/10/20
16 MTL 2016/10/20
17 MTL 2016/11/05
18 BOS 2016/11/03
19 EDM 2016/11/05
20 NSH 2016/11/01
I would like to add indicator columns to show which are the most recent 10 games for each team, as well as the most recent 5 games for each team. With a 1 if they are in this group, and a 0 if they are not.
I'm stumped. Any ideas would be much appreciated!
I think you can use SeriesGroupBy.nsmallest with numpy.where for selecting indices by isin:
df.date = pd.to_datetime(df.date)
#in real data use nsmallest(10)
idx = df.groupby('team')['date'].nsmallest(2).index.get_level_values(1)
df['indicator'] = np.where(df.index.isin(idx), 1, 0)
print (df)
team date indicator
0 TOR 2016-10-15 1
1 LAK 2016-10-20 1
2 CGY 2016-11-03 1
3 BUF 2016-10-30 0
4 PIT 2016-10-27 1
5 CHI 2016-11-05 1
6 VAN 2016-10-20 1
7 BUF 2016-10-16 1
8 STL 2016-10-13 1
9 BUF 2016-10-29 1
10 MIN 2016-10-29 1
11 PIT 2016-11-05 0
12 CHI 2016-10-18 1
13 BOS 2016-10-29 1
14 PIT 2016-10-20 1
15 COL 2016-10-20 1
16 MTL 2016-10-20 1
17 MTL 2016-11-05 1
18 BOS 2016-11-03 1
19 EDM 2016-11-05 1
20 NSH 2016-11-01 1
Related
For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far.
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe')
driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27')
# I plan to make a while or for loop later, that is why I used f strings
team = "Coastal Carolina"
first = driver.find_element_by_xpath(f'//a[text()="{team}"]')
# This was the way another similiarly asked question was answered but did not work
#tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
# This grabs data from the very first row of data... not the one I want
tds = first.find_element_by_xpath('//following-sibling::td[4]').text
total_points = first.find_element_by_xpath('//following-sibling::td[10]').text
ppg = first.find_element_by_xpath('//following-sibling::td[11]').text
print(tds, total_points, ppg)
driver.quit()
I have tried to look around for a similarly asked question and was able to find this snippet
tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!
No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you:
import pandas as pd
url = 'https://www.ncaa.com/stats/football/fbs/current/team/27'
dfs = pd.read_html(url)[0]
Output:
print(df)
Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG
0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5
1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3
2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7
3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0
4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7
5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5
6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7
7 - SMU 7 40 36 1 0 7 0 299.0 42.7
8 9 Texas 7 38 37 0 0 8 1 291.0 41.6
9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8
10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3
11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7
12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6
13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5
14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4
15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3
16 17 Houston 6 30 28 0 0 5 0 223.0 37.2
17 - TCU 6 29 28 0 0 7 0 223.0 37.2
18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9
19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9
20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3
21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1
22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9
23 - Memphis 7 33 26 3 0 7 0 251.0 35.9
24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7
25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6
26 27 Florida 7 33 29 1 0 4 0 241.0 34.4
27 - Missouri 7 31 31 0 0 8 0 241.0 34.4
28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3
29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3
30 31 UCF 6 28 26 0 0 3 1 205.0 34.2
31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0
32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8
33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7
34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6
35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8
36 37 LSU 7 29 29 0 0 8 0 227.0 32.4
37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3
38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3
39 - NC State 6 25 24 0 0 6 1 194.0 32.3
40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2
41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9
42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6
43 44 Utah 6 25 22 1 0 5 0 189.0 31.5
44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4
45 46 App State 7 27 24 0 0 11 0 219.0 31.3
46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0
47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0
48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0
49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0
I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).
df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from
I have a dataset that I would like to order by date but second order with 'pass' value lowest inside of highest. The reason I don't have any code is because, I just have no idea where to begin.
dataframe input:
index date pass
0 11/14/2014 1
1 3/13/2015 1
2 3/20/2015 1
3 5/1/2015 2
4 5/1/2015 1
5 5/22/2015 3
6 5/22/2015 1
7 5/22/2015 2
8 9/25/2015 1
9 9/25/2015 2
10 9/25/2015 3
11 12/4/2015 2
12 12/4/2015 1
13 2/12/2016 2
14 2/12/2016 1
15 5/27/2016 1
16 6/10/2016 1
17 9/23/2016 1
18 12/23/2016 1
19 11/24/2017 1
20 12/29/2017 1
21 1/26/2018 2
22 1/26/2018 1
23 2/9/2018 1
24 3/16/2018 1
25 4/6/2018 2
26 4/6/2018 1
27 6/15/2018 1
28 6/15/2018 2
29 10/26/2018 1
30 11/30/2018 1
31 12/21/2018 1
** Expected Output **
index date pass
0 11/14/2014 1
1 3/13/2015 1
2 3/20/2015 1
3 5/1/2015 2
4 5/1/2015 1
5 5/22/2015 3
6 5/22/2015 2
7 5/22/2015 1
8 9/25/2015 3
9 9/25/2015 2
10 9/25/2015 1
11 12/4/2015 2
12 12/4/2015 1
13 2/12/2016 2
14 2/12/2016 1
15 5/27/2016 1
16 6/10/2016 1
17 9/23/2016 1
18 12/23/2016 1
19 11/24/2017 1
20 12/29/2017 1
21 1/26/2018 1
22 1/26/2018 2
23 2/9/2018 1
24 3/16/2018 1
25 4/6/2018 1
26 4/6/2018 2
27 6/15/2018 1
28 6/15/2018 2
29 10/26/2018 1
30 11/30/2018 1
31 12/21/2018 1
I have spaced out the results that would change. index 5,6,7 and 21,21 and 25,26. So all the bigger pass numbers should be inside the lower pass number if the dates are same.
So if you look at INDEX 5,6,7 the pass for it is changed to 3,2,1 and if you look at INDEX 25,26 the pass is changed to 1,2. Hope you understand.
Order first by pass then do it by date. This way you will be sure to have your df the way you want it
A B C D E
0 2002-01-13 Dan 2002-01-15 26 -1
1 2002-01-13 Dan 2002-01-15 10 0
2 2002-01-13 Dan 2002-01-15 16 1
3 2002-01-13 Vic 2002-01-17 14 0
4 2002-01-13 Vic 2002-01-03 18 0
5 2002-01-28 Mel 2002-02-08 37 0
6 2002-01-28 Mel 2002-02-06 29 0
7 2002-01-28 Mel 2002-02-10 20 0
8 2002-01-28 Rob 2002-02-12 30 -1
9 2002-01-28 Rob 2002-02-12 48 1
10 2002-01-28 Rob 2002-02-12 0 1
11 2002-01-28 Rob 2002-02-01 19 0
Wen answered a very similar question an hour ago, but I forgot to include some conditions. I´ll write them down in bold style:
I want to create a new df['F'] column, with next conditions, per each B group and ignoring zeros in D column:
F=D value, where A dates are nearest to 10 days later than C date and where E=0.
If E=0 doesn´t exist in the nearest A date to 10 days (case of 2002-01-28 Rob), F will be the mean of D values when E=-1 and E=1.
If there are two C dates at the same distance to 10 days from A (case of 2002-01-28 Mel), F will be the mean of these same-period D values.
Output should be:
A B C D E F
0 2002-01-13 Dan 2002-01-15 26 -1 10
1 2002-01-13 Dan 2002-01-15 10 0 10
2 2002-01-13 Dan 2002-01-15 16 1 10
3 2002-01-13 Vic 2002-01-17 14 0 14
4 2002-01-13 Vic 2002-01-03 18 0 14
5 2002-01-28 Mel 2002-02-08 37 0 33
6 2002-01-28 Mel 2002-02-06 29 0 33
7 2002-01-28 Mel 2002-02-10 20 0 33
8 2002-01-28 Rob 2002-02-12 30 -1 39
9 2002-01-28 Rob 2002-02-12 48 1 39
10 2002-01-28 Rob 2002-02-12 0 1 39
11 2002-01-28 Rob 2002-02-01 19 0 39
Wen answered:
df['F']=abs((df.C-df.A).dt.days-10)# get the days different
df['F']=df.B.map(df.loc[df.F==df.groupby('B').F.transform('min')].groupby('B').D.mean())# find the min value for the different , and get the mean
df
But now I can´t get to insert the new conditions (that I´ve put in bold style).
Change the mapper to
m=df.loc[(df.F==df.groupby('B').F.transform('min'))&(df.D!=0)].groupby('B').apply(lambda x : x['D'][x['E']==0].mean() if (x['E']==0).any() else x['D'].mean())
df['F']=df.B.map(m)
I am using pandas for some data processing, My panda statement looks like this
yearage.groupby(['year', 'Tm']).size()
It gives me data like this
2014 ATL 9
BOS 9
BRK 7
CHI 10
CHO 9
CLE 8
DAL 9
DEN 8
DET 9
GSW 8
When I convert it into dataframe, I get only two columns compound key and the count. What I actually want is, three columns,
year, Tm, Size
How do I separate out the two compound keys after groupby?
You specify as_index=False in your groupby statement. As a side note, you probably want to use count (which excludes NaNs) instead of size.
>>> df.groupby(['year', 'Tm'], as_index=False).count()
year Tm a
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
For size:
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.
For count:
Compute count of group, excluding missing values
I think you can try reset_index with parameter name for new column name Size:
yearage.groupby(['year','Tm']).size().reset_index(name='Size')
Sample:
print yearage
year Tm a
0 2014 ATL 9
1 2014 ATL 9
2 2014 ATL 9
3 2014 ATL 9
4 2014 BOS 9
5 2014 BRK 7
6 2014 BOS 9
7 2014 BOS 9
8 2014 BOS 9
9 2014 CHI 10
10 2014 CHO 9
11 2014 CLE 8
12 2014 DAL 9
13 2014 DEN 8
14 2014 DET 9
15 2014 GSW 8
print yearage.groupby(['year','Tm']).size().reset_index(name='Size')
year Tm Size
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
Without parameter name get new column 0:
print yearage.groupby(['year','Tm']).size().reset_index()
year Tm 0
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1