Auto Search list and Scrape table

Auto Search list and Scrape table - python

I want to automate the search process on a website and scrape the table of individual players (I'm getting the players' names from an Excel sheet). I want to add that scraped information to an existing Excel sheet with the list of players. For each year that player has been in the league, the player's name needs to be in the first column. So far, I was able to grab the information from the existing Excel sheet, but I'm not sure how to automate the search process using that. I'm not sure if Selenium can help. The website is https://basketball.realgm.com/.
import openpyxl
path = r"C:\Users\Name\Desktop\NBAPlayers.xlsx"
workbook = openpyxl.load_workbook(path)
sheet = workbook.active
rows = sheet.max_row
cols = sheet.max_column
print(rows)
print(cols)
for r in range(2, rows+1):
for c in range(2,cols+1):
print(sheet.cell(row=r,column=c).value, end=" ")
print()

I presume you have got the names from excel sheet so I used a name list and using python request module and get the page text and then use beautiful soup to get table content and Then I have use pandas to get the info in dataframe.
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
playernames=['Dominique Jones', 'Joe Young', 'Darius Adams', 'Lester Hudson', 'Marcus Denmon', 'Courtney Fortson']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
print(url)
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
table=soup.select_one(".tablesaw ")
dfs=pd.read_html(str(table))
for df in dfs:
print(df)
Output:
https://basketball.realgm.com/search?q=Dominique+Jones
Player Pos HT ... Draft Year College NBA
0 Dominique Jones G 6-4 ... 2010 South Florida Dallas Mavericks
1 Dominique Jones G 6-2 ... 2009 Liberty -
2 Dominique Jones PG 5-9 ... 2011 Fort Hays State -
[3 rows x 8 columns]
https://basketball.realgm.com/search?q=Joe+Young
Player Pos HT ... Draft Year College NBA
0 Joe Young F 6-6 ... 2007 Holy Cross -
1 Joe Young G 6-0 ... 2009 Canisius -
2 Joe Young G 6-2 ... 2015 Oregon Indiana Pacers
3 Joe Young G 6-2 ... 2009 Central Missouri -
[4 rows x 8 columns]
https://basketball.realgm.com/search?q=Darius+Adams
Player Pos HT ... Draft Year College NBA
0 Darius Adams PG 6-1 ... 2011 Indianapolis -
1 Darius Adams G 6-0 ... 2018 Coast Guard Academy -
[2 rows x 8 columns]
https://basketball.realgm.com/search?q=Lester+Hudson
Season Team GP GS MIN ... STL BLK PF TOV PTS
0 2009-10 * All Teams 25 0 5.3 ... 0.32 0.12 0.48 0.56 2.32
1 2009-10 * BOS 16 0 4.4 ... 0.19 0.12 0.44 0.56 1.38
2 2009-10 * MEM 9 0 6.8 ... 0.56 0.11 0.56 0.56 4.00
3 2010-11 WAS 11 0 6.7 ... 0.36 0.09 0.91 0.64 1.64
4 2011-12 * All Teams 16 0 20.9 ... 0.88 0.19 1.62 2.00 10.88
5 2011-12 * CLE 13 0 24.2 ... 1.08 0.23 2.00 2.31 12.69
6 2011-12 * MEM 3 0 6.5 ... 0.00 0.00 0.00 0.67 3.00
7 2014-15 LAC 5 0 11.1 ... 1.20 0.20 0.80 0.60 3.60
8 CAREER NaN 57 0 10.4 ... 0.56 0.14 0.91 0.98 4.70
[9 rows x 23 columns]
https://basketball.realgm.com/search?q=Marcus+Denmon
Season Team Location GP GS ... STL BLK PF TOV PTS
0 2012-13 SAN Las Vegas 5 0 ... 0.4 0.0 1.60 0.20 5.40
1 2013-14 SAN Las Vegas 5 1 ... 0.8 0.0 2.20 1.20 10.80
2 2014-15 SAN Las Vegas 6 2 ... 0.5 0.0 1.50 0.17 5.00
3 2015-16 SAN Salt Lake City 2 0 ... 0.0 0.0 0.00 0.00 0.00
4 CAREER NaN NaN 18 3 ... 0.5 0.0 1.56 0.44 6.17
[5 rows x 24 columns]
https://basketball.realgm.com/search?q=Courtney+Fortson
Season Team GP GS MIN FGM ... AST STL BLK PF TOV PTS
0 2011-12 * All Teams 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
1 2011-12 * HOU 6 0 8.2 1.00 ... 0.83 0.5 0.0 0.33 0.83 3.00
2 2011-12 * LAC 4 0 11.5 1.25 ... 1.25 0.0 0.0 0.75 1.25 4.25
3 CAREER NaN 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
[4 rows x 23 columns]

You have to have url list with players and scrape the pages using beautiful soup.
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

Related

how to compute mean absolute deviation row wise in pandas

snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.

pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75

IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN

It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)

fill NaN with values of a similar row

My dataframe has these 2 columns: "reach" and "height". the column "reach" has a lot of missing value. But the column 'height' have all the value needed. What I see is that reach is often a function of height. Therefore, for the rows with NaN, I want to look at the height, then find another row with the same "height" and that has "reach" available, then copy this value to the 1 with missing value
name
SApM
SLpM
height
reach
record
stance
strAcc
strDef
subAvg
tdAcc
tdAvg
tdDef
weight
born_year
win
lose
draw
nc
Justin Frazier
6.11
1.11
6' 0"
75
10-3-0
Southpaw
0.66
0.04
0
0
0
0
265
1989
10
3
0
Gleidson Cutis
8.28
2.99
5' 9"
nan
7-4-0
Orthodox
0.52
0.59
0
0
0
0
155
1989
7
4
0
Xavier Foupa-Pokam
2.5
1.47
6' 1"
nan
32-22-0
Open Stance
0.43
0.49
0
0
0
0.16
185
1982
32
22
0
Mirko Filipovic
1.89
2.11
6' 2"
73
35-11-2-(1 NC)
Southpaw
0.5
0.63
0.3
0.4
0.19
0.78
230
1974
35
11
2
1
Jordan Johnson
2.64
3.45
6' 2"
79
10-0-0
Orthodox
0.47
0.53
1.2
0.42
3.25
1
205
1988
10
0
0
Martin Kampmann
3.28
3.22
6' 0"
72
20-7-0
Orthodox
0.42
0.62
2
0.41
1.86
0.78
170
1982
20
7
0
Darren Elkins
3.05
3.46
5' 10"
71
27-9-0
Orthodox
0.38
0.52
1.1
0.33
2.67
0.56
145
1984
27
9
0
Austen Lane
6.32
5.26
6' 6"
nan
2-1-0
Orthodox
0.35
0.6
0
0
0
0
245
1987
2
1
0
Rachael Ostovich
3.97
2.54
5' 3"
62
4-6-0
Orthodox
0.43
0.57
0.8
0.83
2.03
0.66
125
1991
4
6
0
Travis Lutter
2.42
0.41
5' 11"
75
10-6-0
Orthodox
0.32
0.42
0.7
0.24
1.95
0.3
185
1973
10
6
0
Tom Murphy
0.17
2.5
6' 2"
nan
8-0-0
Southpaw
0.71
0.84
2.5
0.85
7.51
0
227
1974
8
0
0
Darrell Montague
5.38
1.92
5' 6"
67
13-5-0
Southpaw
0.25
0.52
1.4
0.25
0.72
0.33
125
1987
13
5
0
Lauren Murphy
4.25
3.95
5' 5"
67
15-4-0
Orthodox
0.4
0.61
0.1
0.34
1.16
0.7
125
1983
15
4
0
Bill Mahood
3.59
1.54
6' 3"
nan
20-7-1-(1 NC)
Orthodox
0.85
0.17
3.9
0
0
0
200
1967
20
7
1
1
Nate Marquardt
2.32
2.71
6' 0"
74
35-19-2
Orthodox
0.49
0.55
0.8
0.51
1.87
0.7
185
1979
35
19
2
Mike Polchlopek
1.33
2
6' 4"
nan
1-1-0
Orthodox
0.38
0.57
0
0
0
0
285
1965
1
1
0
Harvey Park
7.21
3.77
6' 0"
70
12-3-0
Orthodox
0.5
0.33
0
0
0
0
155
1986
12
3
0
Junyong Park
3.17
4.37
5' 10"
73
13-4-0
Orthodox
0.47
0.58
0.6
0.57
3.02
0.46
185
1991
13
4
0
Ricco Rodriguez
1.15
1.85
6' 4"
nan
53-25-0-(1 NC)
Orthodox
0.51
0.61
1
0.39
2.3
0.4
265
1977
53
25
0
1
Aaron Riley
3.78
3.45
5' 8"
69
30-14-1
Southpaw
0.34
0.61
0.1
0.34
1.18
0.6
155
1980
30
14
1

You can create a height reference dataframe with .groupby() and fetch the first non-NaN entry of a height (if any) by .first(), as follows:
height_ref = df.groupby('height')['reach'].first()
height
5' 10" 71.0
5' 11" 75.0
5' 3" 62.0
5' 5" 67.0
5' 6" 67.0
5' 8" 69.0
5' 9" NaN
6' 0" 75.0
6' 1" NaN
6' 2" 73.0
6' 3" NaN
6' 4" NaN
6' 6" NaN
Name: reach, dtype: float64
Then, you can fill up the NaN values of column reach by looking up the height reference dataframe by .map() and use .fillna() to fill-up values, as follows:
df['reach2'] = df['reach'].fillna(df['height'].map(height_ref))
For demo purpose, I update to a new column reach2. You can overwrite the original column reach as appropriate.
Result:
print(df[['height', 'reach', 'reach2']])
height reach reach2
0 6' 0" 75.0 75.0
1 5' 9" NaN NaN
2 6' 1" NaN NaN
3 6' 2" 73.0 73.0
4 6' 2" 79.0 79.0
5 6' 0" 72.0 72.0
6 5' 10" 71.0 71.0
7 6' 6" NaN NaN
8 5' 3" 62.0 62.0
9 5' 11" 75.0 75.0
10 6' 2" NaN 73.0 <======= filled up with referenced height from other row
11 5' 6" 67.0 67.0
12 5' 5" 67.0 67.0
13 6' 3" NaN NaN
14 6' 0" 74.0 74.0
15 6' 4" NaN NaN
16 6' 0" 70.0 70.0
17 5' 10" 73.0 73.0
18 6' 4" NaN NaN
19 5' 8" 69.0 69.0

I think that a method does not exist to do that in a simple step.
If i were in your shoes I would:
Create a support dataset made up of height|reach fully populated, in which I would store my best guess values
Join the support dataframe with the existing ones, using height as key
Coalesce the values where NaN appears: df.reach = df.reach.fillna(df.from_support_dataset_height)

Pandas - replacing NaN values

In the following df_goals, I see that all NaNs could be replaced by the cell value either above or below it, eliminatig duplicate row values 'For Team':
For Team Goals Home Goals Away Color
0 Arsenal NaN 1.17 #EF0107
1 Arsenal 1.70 NaN #EF0107
2 Aston Villa NaN 1.10 #770038
3 Aston Villa 1.45 NaN #770038
4 Bournemouth NaN 0.77 #D3151B
5 Bournemouth 1.17 NaN #D3151B
6 Brighton and Hove Albion NaN 1.00 #005DAA
7 Brighton and Hove Albion 1.45 NaN #005DAA
8 Burnley NaN 1.25 #630F33
9 Burnley 1.33 NaN #630F33
10 Chelsea NaN 1.82 #034694
11 Chelsea 1.11 NaN #034694
12 Crystal Palace NaN 0.89 #C4122E
13 Crystal Palace 0.79 NaN #C4122E
14 Everton NaN 1.30 #274488
15 Everton 1.40 NaN #274488
16 Leicester City NaN 2.25 #0053A0
17 Leicester City 2.00 NaN #0053A0
18 Liverpool NaN 2.00 #CE1317
19 Liverpool 2.62 NaN #CE1317
20 Manchester City NaN 2.25 #97C1E7
21 Manchester City 2.73 NaN #97C1E7
22 Manchester United NaN 0.92 #E80909
23 Manchester United 1.82 NaN #E80909
24 Newcastle United NaN 0.67 #231F20
25 Newcastle United 1.10 NaN #231F20
26 Norwich City NaN 0.56 #00A14E
27 Norwich City 1.36 NaN #00A14E
28 Sheffield United NaN 0.88 #E52126
29 Sheffield United 0.83 NaN #E52126
30 Southampton NaN 1.42 #ED1A3B
31 Southampton 1.15 NaN #ED1A3B
32 Tottenham Hotspur NaN 1.20 #132257
33 Tottenham Hotspur 1.90 NaN #132257
34 Watford NaN 0.83 #FBEE23
35 Watford 0.90 NaN #FBEE23
36 West Ham United NaN 1.09 #7C2C3B
37 West Ham United 1.83 NaN #7C2C3B
38 Wolverhampton Wanderers NaN 1.40 #FDB913
39 Wolverhampton Wanderers 1.33 NaN #FDB913
How do I do this?

Check with groupby + first
df=df.groupby('For Team').first()

If you want to drop all the NaN rows:
df_goals = df_goals.dropna(axis=0)
You could set all Nan values to 0:
no_nan_list = df_goals.index.tolist()
nan_list = df_goals.drop(no_nan_list).apply(lambda x: to_numeric(x,errors='coerce'))
df_goals = nan_list.fillna(0)
df.head()

Altair - highlight single element in bar chart

I have this dagtaframe, talismen_players:
Team TotalShots TotalCreated Goals ... WeightBetweenness %ShotInv Player Color
0 Aston Villa 55.0 68.0 7.0 ... 0.45 0.36 Jack Grealish #770038
1 Manchester City 76.0 96.0 8.0 ... 0.44 0.32 Kevin De Bruyne #97C1E7
2 Watford 62.0 43.0 4.0 ... 0.37 0.34 Gerard Deulofeu #FBEE23
3 Leicester City 60.0 67.0 6.0 ... 0.32 0.34 James Maddison #0053A0
4 Norwich City 29.0 69.0 0.0 ... 0.31 0.32 Emiliano Buendia #00A14E
5 Chelsea 63.0 40.0 5.0 ... 0.28 0.23 Mason Mount #034694
6 Tottenham Hotspur 64.0 30.0 9.0 ... 0.28 0.29 Son Heung-Min #132257
7 Everton 66.0 30.0 10.0 ... 0.22 0.26 Richarlison #274488
8 Arsenal 64.0 18.0 17.0 ... 0.21 0.27 Pierre-Emerick Aubameyang #EF0107
9 Bournemouth 25.0 40.0 1.0 ... 0.21 0.23 Ryan Fraser #D3151B
10 Crystal Palace 42.0 20.0 6.0 ... 0.20 0.24 Jordan Ayew #C4122E
11 Burnley 33.0 40.0 2.0 ... 0.20 0.27 Dwight McNeil #630F33
12 Newcastle United 41.0 24.0 1.0 ... 0.20 0.24 Joelinton #231F20
13 Liverpool 89.0 41.0 14.0 ... 0.18 0.31 Mohamed Salah #CE1317
14 Brighton and Hove Albion 27.0 52.0 2.0 ... 0.16 0.23 Pascal Groß #005DAA
15 Wolverhampton Wanderers 86.0 38.0 11.0 ... 0.16 0.38 Raul Jimenez #FDB913
16 Sheffield United 31.0 35.0 5.0 ... 0.15 0.24 John Fleck #E52126
17 West Ham United 48.0 18.0 6.0 ... 0.15 0.25 Sebastien Haller #7C2C3B
18 Southampton 64.0 21.0 15.0 ... 0.11 0.24 Danny Ings #ED1A3B
19 Manchester United 37.0 31.0 1.0 ... 0.10 0.17 Andreas Pereira #E80909
And I have singled out one element in a series being plotted by altair, like so:
target_team = 'Liverpool'
# the following prints 'Mohamed Salah'
target_player = talismen_players.loc[talismen_players['Team']==target_team, 'Player'].item()
# all elements
talisman_chart = alt.Chart(talismen_players).mark_bar().encode(
alt.Y('Player:N', title="", sort='-x'),
alt.X('WeightBetweenness:Q', title="Interconectividade do craque com o resto do time"),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
).properties(
width=800
).configure_axis(
grid=False
).configure_view(
strokeWidth=0
)
_________
This plots all elements, but I want to highlight one of them:
This is the code for achieving the result using lines: Multiline highlight
I've tried with this code:
highlight = alt.selection(type='single', on='mouseover',
fields=['Player'], nearest=True)
base = alt.Chart(talismen_players).encode(
alt.X('WeightBetweenness:Q'),
alt.Y('Player:N', title=''),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
)
points = base.mark_circle().encode(
opacity=alt.value(0)
).add_selection(
highlight
).properties(
width=700
)
lines = base.mark_bar().encode(
size=alt.condition(~highlight, alt.value(18), alt.value(20))
)
But the result is not ideal:
Highlight is not working.
QUESTION:
Ideally, I'd like to have target_player bar highlighted on default. and THEN be able to highlight the other bars, with mouseover.
But having target_player bar statically highlighted suffices.
Highlight should not change bar colors, but rather have a focus quality, while other bars are out of focus.

You can do this using an initialized selection, along with an opacity for highlighting. For example:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3, 4],
'y': ['A', 'B', 'C', 'D'],
'color': ['W', 'X', 'Y', 'Z'],
})
hover = alt.selection_single(fields=['y'], on='mouseover', nearest=True, init={'y': 'C'})
alt.Chart(df).mark_bar().encode(
x='x',
y='y',
color='color',
opacity=alt.condition(hover, alt.value(1.0), alt.value(0.3))
).add_selection(hover)

Python Pandas pivot table - how do to tell if differences between means is significant within pivot table?

I have written the following:
ax = df.pivot_table(index=['month'], columns='year', values='sale_amount_usd', margins=True,fill_value=0).round(2).plot(kind='bar',colormap=('Blues'),figsize=(18,15))
plt.legend(loc='best')
plt.ylabel('Average Sales Amount in USD')
plt.xlabel('Month')
plt.xticks(rotation=0)
plt.title('Average Sales Amount in USD by Month/Year')
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.001, p.get_height() * 1.005))
plt.show();
Which returns a nice bar chart:
I'd now like to be able to tell whether the differences in means within each month, between years, is significant. In other words, is the jump from $321 in March 2013 to $365 in March 2014 a significant increase in average sales amount?
How would I do this? Is there a way to overlay a marker on the pivot table that tells me, visually, when a difference is significant?
edited to add sample data:
event_id event_date week_number week_of_month holiday month day year pub_organization_id clicks sales click_to_sale_conversion_rate sale_amount_usd per_sale_amount_usd per_click_sale_amount pub_commission_usd per_sale_pub_commission_usd per_click_pub_commission_usd
0 3365 1/11/13 2 2 NaN 1. January 11 2013 214 11945 754 0.06 40311.75 53.46 3.37 2418.71 3.21 0.20
1 13793 2/12/13 7 3 NaN 2. February 12 2013 214 11711 1183 0.10 73768.54 62.36 6.30 4426.12 3.74 0.38
2 4626 1/15/13 3 3 NaN 1. January 15 2013 214 11561 1029 0.09 70356.46 68.37 6.09 4221.39 4.10 0.37
3 10917 2/3/13 6 1 NaN 2. February 3 2013 167 11481 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 14653 2/15/13 7 3 NaN 2. February 15 2013 214 11268 795 0.07 37262.56 46.87 3.31 2235.77 2.81 0.20
5 18448 2/27/13 9 5 NaN 2. February 27 2013 214 11205 504 0.04 48773.71 96.77 4.35 2926.43 5.81 0.26
6 11382 2/5/13 6 2 NaN 2. February 5 2013 214 11166 1324 0.12 93322.84 70.49 8.36 5599.38 4.23 0.50
7 14764 2/16/13 7 3 NaN 2. February 16 2013 214 11042 451 0.04 22235.51 49.30 2.01 1334.14 2.96 0.12
8 17080 2/23/13 8 4 NaN 2. February 23 2013 214 10991 248 0.02 14558.85 58.71 1.32 873.53 3.52 0.08
9 21171 3/8/13 10 2 NaN 3. March 8 2013 214 10910 1081 0.10 52005.12 48.11 4.77 3631.28 3.36 0.33
10 16417 2/21/13 8 4 NaN 2. February 21 2013 214 10826 507 0.05 44907.20 88.57 4.15 2694.43 5.31 0.25
11 13399 2/11/13 7 3 NaN 2. February 11 2013 214 10772 1142 0.11 38549.55 33.76 3.58 2312.97 2.03 0.21
12 1532 1/5/13 1 1 NaN 1. January 5 2013 214 10750 610 0.06 29838.49 48.92 2.78 1790.31 2.93 0.17
13 22500 3/13/13 11 3 NaN 3. March 13 2013 214 10743 821 0.08 47310.71 57.63 4.40 3688.83 4.49 0.34
14 5840 1/19/13 3 3 NaN 1. January 19 2013 214 10693 487 0.05 28427.35 58.37 2.66 1705.64 3.50 0.16
15 19566 3/3/13 10 1 NaN 3. March 3 2013 214 10672 412 0.04 15722.29 38.16 1.47 1163.16 2.82 0.11
16 26313 3/25/13 13 5 NaN 3. March 25 2013 214 10629 529 0.05 21946.51 41.49 2.06 1589.84 3.01 0.15
17 19732 3/4/13 10 2 NaN 3. March 4 2013 214 10619 1034 0.10 37257.20 36.03 3.51 2713.71 2.62 0.26
18 18569 2/28/13 9 5 NaN 2. February 28 2013 214 10603 414 0.04 40920.28 98.84 3.86 2455.22 5.93 0.23
19 8704 1/28/13 5 5 NaN 1. January 28 2013 214 10548 738 0.07 29041.87 39.35 2.75 1742.52 2.36 0.17

Although not conclusive, you could use error bars (through the yerr argument in plt.plot) that represent one standard deviation of uncertainty, and then just eyeball the overlap of the intervals. Something like (not tested)...
stds = df.groupby(['month', 'year'])['sale_amount_usd'].std().to_frame()
stds.columns = ['std_sales']
df_stds = df.pivot_table(index=['month'], columns='year',\
values='sale_amount_usd', \
margins=True,fill_value=0).round(2).join(stds)
ax = df_stds.plot(kind='bar', yerr = 'std_sales', colormap=('Blues'),figsize=(18,15))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Auto Search list and Scrape table - python

You have to have url list with players and scrape the pages using beautiful soup. import urllib2 from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

Related

how to compute mean absolute deviation row wise in pandas

fill NaN with values of a similar row

Pandas - replacing NaN values

Altair - highlight single element in bar chart

Python Pandas pivot table - how do to tell if differences between means is significant within pivot table?

Categories

Resources