I have a df that looks like this:
data start stop
10 1.0 1.5
10 2.0 2.5
10 3.0 3.5
10 4.0 4.5
10 5.0 5.5
10 6.0 6.5
10 7.0 7.5
10 8.0 8.5
14 9.0 9.5
14 10.0 10.5
10 11.0 11.5
10 12.0 12.5
10 13.0 13.5
10 14.0 14.5
14 15.0 15.5
10 16.0 16.5
10 17.0 17.5
11 18.0 18.5
11 19.0 19.5
11 20.0 20.5
I want to group the df by df.data and aggregate the df.start and df.stop time for the group in columns. It should look like this:
data start stop
10 1.0 8.5
14 9.0 10.5
10 11.0 14.5
14 15.0 15.5
10 16.0 17.5
11 18.0 20.5
You use ne + shift + cumsum to group by consecutive values. Then choose the proper aggregation for each column. Given the ordering of your data you could equally use 'first' and 'last' to aggregate start and stop respectively.
d = {'data': 'first', 'start': 'min', 'stop': 'max'} # How to aggregate
s = df.data.ne(df.data.shift(1)).cumsum().rename(None) # How to group
df.groupby(s).agg(d)
# data start stop
#1 10 1.0 8.5
#2 14 9.0 10.5
#3 10 11.0 14.5
#4 14 15.0 15.5
#5 10 16.0 17.5
#6 11 18.0 20.5
Related
i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
Usually, to avoid SettingWithCopyWarning, I replace values using .loc or .iloc, but this does not work when I want to forward fill my column (from the first to the last non-nan value).
Do you know why it does that and how to bypass it ?
My test dataframe :
df3 = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
and the code that raises me a warning :
df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1] = df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1].fillna(method="ffill")
I would like something like that in the end :
Use first_valid_index and last_valid_index to determine range that you want to ffill and then select range of your dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
first=df['test'].first_valid_index()
last=df['test'].last_valid_index()+1
df['test']=df['test'][first:last].ffill()
print(df)
Timestamp test
0 11.1 NaN
1 11.2 NaN
2 11.3 NaN
3 11.4 2.0
4 11.5 22.0
5 11.6 8.0
6 11.7 8.0
7 11.8 4.0
8 11.9 5.0
9 12.0 4.0
10 12.1 5.0
11 12.2 5.0
12 12.3 -3.0
13 12.4 -54.0
14 12.5 -23.0
15 12.6 -23.0
16 12.7 89.0
17 12.8 NaN
18 12.9 NaN
I'm trying to scrape a table from this link:
https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc
when scraping the table, the names and stats categories align but the numbers themselves don't.
import csv
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(
requests.get("https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc", timeout=30).text,
'lxml')
def scrape_data(url):
# the categories of stats (first row)
ct = soup.find_all('tr', class_="Table__TR Table__even")
# player's stats table (the names and numbers)
st = soup.find_all('tr', class_="Table__TR Table__TR--sm Table__even")
header = [th.text.rstrip() for th in ct[1].find_all('th')]
with open('s espn.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
for row in st[1:]:
data = [th.text.rstrip() for th in row.find_all('td')]
writer.writerow(data)
scrape_data(soup)
https://imgur.com/UFHC8wf
That's because those are under 2 separate table tags. A table for the names, then the table for stats. You'll need to merge them.
Use pandas. A lot easier to parse tables (it actually uses beautifulsoup under the hood). pd.read_html() will return a list of dataframes. I just merged them together. Then you can also manipulate the tables any way you need.
import pandas as pd
url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
dfs = pd.read_html(url)
df = dfs[0].join(dfs[1])
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
df.to_csv('s espn.csv', index=False)
Output:
print (df.head(10).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 Trae Young ATL PG 2 36.5 38.5 13.5 23.0 58.7 5.5 10.0 55.0 6.0 8.0 75.0 7.0 9.0 1.5 0.0 5.5 0 0 40.95
1 2 Kyrie Irving BKN PG 3 34.7 37.7 12.0 26.3 45.6 4.7 11.3 41.2 9.0 9.7 93.1 5.7 6.3 1.7 0.7 2.0 0 0 37.84
2 3 Karl-Anthony Towns MIN C 3 33.7 32.0 10.7 20.3 52.5 5.0 9.7 51.7 5.7 9.0 63.0 13.3 5.0 3.0 2.0 2.3 3 0 40.60
3 4 Damian Lillard POR PG 3 36.7 31.7 10.7 20.3 52.5 2.3 8.0 29.2 8.0 9.0 88.9 4.0 6.0 1.7 0.3 2.7 0 0 32.86
4 5 Giannis Antetokounmpo MIL PF 2 32.5 29.5 11.5 19.0 60.5 1.0 5.0 20.0 5.5 10.0 55.0 15.0 10.0 2.0 1.5 5.5 2 1 34.55
5 6 Luka Doncic DAL SF 3 36.3 29.3 10.0 20.0 50.0 3.0 9.7 31.0 6.3 8.0 79.2 10.3 7.3 2.3 0.0 4.3 2 1 29.45
6 7 Pascal Siakam TOR PF 3 34.0 28.7 9.7 21.0 46.0 2.7 5.7 47.1 6.7 7.0 95.2 10.7 3.7 0.3 0.3 4.3 1 0 24.56
7 8 Brandon Ingra mNO SF 3 35.0 27.3 10.7 20.3 52.5 3.3 6.3 52.6 2.7 3.3 80.0 9.3 4.3 0.3 1.7 2.7 1 0 25.72
8 9 Kristaps Porzingis DAL PF 3 31.0 26.3 8.7 18.7 46.4 3.0 7.3 40.9 6.0 8.3 72.0 5.7 3.3 0.3 2.7 2.3 0 0 27.15
9 10 Russell Westbrook HOU PG 2 33.0 26.0 8.0 17.0 47.1 2.0 5.0 40.0 8.0 10.5 76.2 13.0 10.0 1.5 0.5 3.5 2 1 30.96
...
Explained line by line:
dfs = pd.read_html(url)
This will return a list of dataframes. Essentially it parses every <table> tag within the html. When you do this with the given URL, you will see it returns 2 dataframes:
The dataframe in index position 0, has the ranking and player name. If I look at that, I notice the text is the player name followed by the team abbreviation (the last 3 characters in the string).
So, dfs[0]['Team'] = dfs[0]['Name'].str[-3:] is going to create a column called 'Team', where it takes the Name column strings and takes the last 3 characters. dfs[0]['Name'] = dfs[0]['Name'].str[:-3] will store the string from the Name column up until the last 3 charachters. Essentially splitting the Name column into 2 columns.
df = dfs[0].merge(dfs[1], left_index=True, right_index=True)
The last part then takes those 2 dataframes in index position 0 and 1 (stored in the dfs list) and merges them together and merges on the index values.
I am working with wind speed (sknt) and visbility (vsby) data in hourly intervals from weather stations. I was able to calculate the joint probability for both wind speed and visibility using this,
df1=df.groupby('vsby').size().div(len(df))
df2=df.groupby(['vsby', 'sknt']).size().div(len(df)).div(vprob, axis=0, level='vsby')
vsby sknt 0
0 6.0 15.0 1.000000
1 10.0 0.0 1.000000
2 11.0 7.0 0.500000
3 11.0 16.0 0.500000
4 13.0 12.0 1.000000
5 14.0 3.0 0.500000
6 14.0 4.0 0.250000
7 14.0 12.0 0.250000
8 16.0 0.0 0.099796
9 16.0 2.0 0.209776
10 16.0 3.0 0.173116
11 16.0 4.0 0.134420
12 16.0 5.0 0.175153
13 16.0 6.0 0.024440
14 16.0 7.0 0.032587
15 16.0 8.0 0.018330
16 16.0 9.0 0.024440
17 16.0 10.0 0.024440
18 16.0 11.0 0.026477
19 16.0 12.0 0.016293
20 16.0 13.0 0.014257
21 16.0 14.0 0.008147
22 16.0 15.0 0.008147
23 16.0 16.0 0.004073
24 16.0 17.0 0.004073
25 16.0 18.0 0.002037
I am interested in finding the probability of wind speed >= x for all visibility recorded. For example, vsby 16, probability = (0.018330 + 0.024440 + 0.024440 + 0.026477 + 0.016293 + 0.014257 + 0.008147 + 0.008147 + 0.004073 + 0.004073 + 0.002037)
I tried,
df2.loc[df2.sknt >= 7, df2.vsby].sum()
but its not working.
Try the below. To select a column using .loc it is sufficient to just provide the name.
df2 = df2.reset_index()
df2.loc[df2['sknt'] >= 7, 'vsby'].sum()
Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3