Array into dataframe interpolation - python

I have the following array:
[299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528
71.61117789]
For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame:
wind speed power
5 2.5 0
6 3.0 25
7 3.5 82
8 4.0 154
9 4.5 244
10 5.0 354
11 5.5 486
12 6.0 643
13 6.5 827
14 7.0 1038
15 7.5 1272
16 8.0 1525
17 8.5 1794
18 9.0 2037
19 9.5 2211
20 10.0 2362
21 10.5 2386
22 11.0 2400
So basically I'd like to retreive the following array:
[4.7 4.5 4.3 ... 2.6 3.0 3.4]
Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?

Using interp from numpy
np.interp(ary,df['power'].values,df['wind speed'].values)
Out[202]:
array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451,
3.40886998])

Related

Getting table from webpage: Problem getting full html

I need to get the table from this page: https://stats.nba.com/teams/traditional/?sort=GP&dir=-1. From the html of the page one can see that the table is encoded in the descendants of the tag
<nba-stat-table filters="filters" ... >
<div class="nba-stat-table">
<div class="nba-stat-table__overflow" data-fixed="2" role="grid">
<table>
...
</nba-stat-table>
(I cannot add a screenshot since I am new to stackoverflow but just doing: right click -> inspect element wherever in the table you will see what I mean).
I've tried some different ways such as the first and second answer to this question How to extract tables from websites in Python as well as those to this other question pandas read_html ValueError: No tables found (since trying the first solution I've got an error which is essentially this second question).
First try using pandas:
import requests
import pandas as pd
url = 'http://stats.nba.com/teams/traditional/?sort=GP&dir=-1'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
Or another try with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://stats.nba.com/teams/traditional/?sort=GP&dir=-1"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
stats_table = soup.find('nba-stat-table')
for child in stats_table.descendants:
print(child)
For the first I got ''pandas read_html ValueError: No tables found'' error. For the second I didn't get any error but nothing showing. Then, I have tried to see on a file what was actually happening by doing:
with open('html.txt', 'w') as fout:
fout.write(str(page.content))
and/or:
with open('html.txt', 'w') as fout:
fout.write(str(soup))
and I get in the text file in the part of the html in which the table should be:
<nba-stat-table filters="filters"
ng-if="!isLoading && !noData"
options="options"
params="params"
rows="teamStats.rows"
template="teams/teams-traditional">
</nba-stat-table>
So it appears that I am not getting all the descendats of this tag which actually contains the information of the table. Then, does someone has a solution which obtains the whole html of the page and so it allows me for parsing it or instead an alternative solution to obtaining the table?
Here's what I try when attempting to scrape data. (By the way I LOVE scraping/working with sports data.)
1) Pandas pd.read_html(). (beautifulSoup actually works under the hood here). I like this method as it's rather easy and quick. Usually only requires a small amount of manipulation if it does return what I want. The pandas' pd.read_html() only works if the data is within <table> tags though in the html. Since there are no <table> tags here, it will return what you stated as "ValueError: No tables found". So good work on trying that first, it's the easiest method when it works.
2) The other "go to" method I'll use, is then to see if the data is pulled through XHR. Actually, this might be my first choice as it can give you options of being able to filter what is returned, but requires a little more (not much) investigated work to find the correct request url and query parameter. (This is the route I went for this solution).
3) If it is generated through javascript, sometimes you can find the data in json format with <script> tags using BeautifulSoup. this requires a bit more investigation of pulling out the right <script> tag, then doing string manipulation to get the string in a valid json format to be able to use json.loads() to read in the data.
4a) Use BeautifulSoup to pull out the data elements if they are present in other tags and not rendered by javascript.
4b) Selenium is an option to allow the page to render first, then go into the html and parse with BeautifulSoup (in some cases allow Selenium to render and then could use pd.read_html() if it renders <table> tags), but is usually my last choice. It's not that it doesn't work or is bad, it just slow and unnecessary if any of the above choices work.
So I went with option 2. Here's the code and output:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/leaguedashteamstats'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Mobile Safari/537.36'}
payload = {
'Conference': '',
'DateFrom': '',
'DateTo': '',
'Division': '',
'GameScope': '',
'GameSegment': '',
'LastNGames': '82',
'LeagueID': '00',
'Location': '',
'MeasureType': 'Base',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PORound': '0',
'PaceAdjust': 'N',
'PerMode': 'PerGame',
'Period': '0',
'PlayerExperience': '',
'PlayerPosition': '',
'PlusMinus': 'N',
'Rank': 'N',
'Season': '2019-20',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'ShotClockRange': '',
'StarterBench': '',
'TeamID': '0',
'TwoWay': '0',
'VsConference':'',
'VsDivision':'' }
jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['resultSets'][0]['rowSet'], columns=jsonData['resultSets'][0]['headers'])
Output:
print (df.to_string())
TEAM_ID TEAM_NAME GP W L W_PCT MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT OREB DREB REB AST TOV STL BLK BLKA PF PFD PTS PLUS_MINUS GP_RANK W_RANK L_RANK W_PCT_RANK MIN_RANK FGM_RANK FGA_RANK FG_PCT_RANK FG3M_RANK FG3A_RANK FG3_PCT_RANK FTM_RANK FTA_RANK FT_PCT_RANK OREB_RANK DREB_RANK REB_RANK AST_RANK TOV_RANK STL_RANK BLK_RANK BLKA_RANK PF_RANK PFD_RANK PTS_RANK PLUS_MINUS_RANK CFID CFPARAMS
0 1610612737 Atlanta Hawks 4 2 2 0.500 48.0 39.5 84.3 0.469 10.0 31.8 0.315 16.0 23.0 0.696 8.5 34.5 43.0 25.0 18.0 10.0 5.3 7.3 23.8 21.5 105.0 1.0 1 11 14 14 10 14 27 5 23 21 21 24 21 27 24 21 25 9 19 3 15 29 17 25 22 15 10 Atlanta Hawks
1 1610612738 Boston Celtics 3 2 1 0.667 48.0 39.0 97.3 0.401 11.7 35.0 0.333 18.0 26.3 0.684 14.7 33.0 47.7 21.0 11.3 9.3 6.3 5.7 25.0 29.3 107.7 5.0 19 11 4 11 10 18 3 28 13 12 18 19 12 28 2 25 13 22 1 7 7 19 20 1 16 10 10 Boston Celtics
2 1610612751 Brooklyn Nets 3 1 2 0.333 51.3 43.0 93.3 0.461 15.3 38.7 0.397 22.7 32.3 0.701 10.7 38.3 49.0 22.7 19.7 8.3 5.3 6.0 26.0 27.3 124.0 0.7 19 18 14 18 1 4 9 8 3 7 4 7 3 24 11 9 7 19 26 15 13 22 21 3 1 16 10 Brooklyn Nets
3 1610612766 Charlotte Hornets 4 1 3 0.250 48.0 38.3 86.5 0.442 14.8 36.8 0.401 14.3 19.8 0.722 10.0 31.5 41.5 24.3 19.3 5.3 4.0 6.5 22.5 21.8 105.5 -13.8 1 18 23 23 10 23 23 16 4 9 3 26 26 23 14 28 29 14 24 30 24 25 10 22 21 28 10 Charlotte Hornets
4 1610612741 Chicago Bulls 4 1 3 0.250 48.0 38.5 95.0 0.405 9.8 35.5 0.275 17.5 23.0 0.761 12.3 32.0 44.3 20.0 12.8 10.0 4.5 7.0 21.3 20.8 104.3 -6.0 1 18 23 23 10 20 6 27 24 11 29 20 21 15 7 26 24 26 2 3 20 27 7 26 24 23 10 Chicago Bulls
5 1610612739 Cleveland Cavaliers 3 1 2 0.333 48.0 39.3 89.3 0.440 10.7 34.7 0.308 13.0 18.7 0.696 10.7 38.0 48.7 20.7 15.7 6.3 4.0 4.7 19.0 19.3 102.3 -5.0 19 18 14 18 10 17 16 19 19 13 24 29 27 26 11 10 10 23 13 25 24 11 3 30 26 22 10 Cleveland Cavaliers
6 1610612742 Dallas Mavericks 4 3 1 0.750 48.0 39.5 86.8 0.455 12.8 40.8 0.313 23.0 31.0 0.742 9.8 36.0 45.8 24.0 13.0 6.8 5.0 2.8 19.3 27.0 114.8 4.0 1 1 4 4 10 14 21 10 8 5 22 5 4 19 17 15 21 15 3 19 19 1 4 7 10 12 10 Dallas Mavericks
7 1610612743 Denver Nuggets 4 3 1 0.750 49.3 37.3 90.5 0.412 11.5 31.8 0.362 19.8 24.3 0.814 13.0 35.5 48.5 22.0 14.3 8.0 5.5 4.5 22.8 23.8 105.8 3.3 1 1 4 4 4 27 13 25 14 21 11 13 20 7 4 19 11 20 7 16 11 9 13 14 20 13 10 Denver Nuggets
8 1610612765 Detroit Pistons 4 2 2 0.500 48.0 38.5 80.0 0.481 10.5 26.0 0.404 19.0 25.3 0.752 8.3 33.5 41.8 21.8 18.8 6.0 5.3 3.8 21.8 21.8 106.5 -3.0 1 11 14 14 10 20 29 3 20 28 2 15 15 17 26 24 28 21 21 27 15 4 9 22 18 21 10 Detroit Pistons
9 1610612744 Golden State Warriors 3 1 2 0.333 48.0 40.0 98.3 0.407 11.3 36.7 0.309 24.7 28.3 0.871 15.3 32.0 47.3 27.0 15.3 9.3 1.3 5.7 19.3 23.3 116.0 -12.0 19 18 14 18 10 10 2 26 15 10 23 3 8 2 1 26 14 4 10 7 30 19 5 17 9 27 10 Golden State Warriors
10 1610612745 Houston Rockets 3 2 1 0.667 48.0 38.3 91.3 0.420 13.0 45.7 0.285 28.0 34.0 0.824 9.3 38.0 47.3 24.3 15.7 6.3 5.3 5.0 23.7 28.0 117.7 0.3 19 11 4 11 10 22 11 23 6 3 27 1 1 5 21 10 14 12 13 25 13 15 15 2 8 18 10 Houston Rockets
11 1610612754 Indiana Pacers 3 0 3 0.000 48.0 39.7 90.0 0.441 8.0 23.3 0.343 13.7 16.7 0.820 9.7 29.3 39.0 24.3 13.3 8.7 4.3 5.3 23.7 19.7 101.0 -7.3 19 28 23 28 10 13 14 18 29 30 14 27 29 6 19 30 30 12 5 10 23 18 15 28 27 26 10 Indiana Pacers
12 1610612746 LA Clippers 4 3 1 0.750 48.0 43.0 82.8 0.520 13.0 32.0 0.406 22.5 28.5 0.789 8.3 34.0 42.3 25.0 17.0 8.5 5.5 3.3 26.3 25.5 121.5 9.0 1 1 4 4 10 4 28 1 6 19 1 8 7 11 26 22 26 9 18 11 11 2 23 10 3 3 10 LA Clippers
13 1610612747 Los Angeles Lakers 4 3 1 0.750 48.0 40.0 87.5 0.457 9.8 29.0 0.336 19.5 24.5 0.796 10.0 36.0 46.0 23.5 15.3 8.5 8.0 3.5 21.5 24.3 109.3 11.8 1 1 4 4 10 10 17 9 24 25 17 14 17 8 14 15 19 17 9 11 1 3 8 12 15 1 10 Los Angeles Lakers
14 1610612763 Memphis Grizzlies 4 1 3 0.250 49.3 39.5 95.3 0.415 9.0 32.0 0.281 19.0 24.5 0.776 11.3 36.5 47.8 24.8 18.8 9.0 6.5 7.0 27.0 23.8 107.0 -13.8 1 18 23 23 4 14 5 24 27 19 28 15 17 14 10 14 12 11 21 9 5 27 26 14 17 28 10 Memphis Grizzlies
15 1610612748 Miami Heat 4 3 1 0.750 49.3 40.3 86.0 0.468 12.8 32.3 0.395 24.8 33.8 0.733 9.8 39.0 48.8 23.8 22.5 8.5 6.5 4.8 27.0 27.3 118.0 8.0 1 1 4 4 4 9 25 6 8 17 5 2 2 20 17 6 9 16 30 11 5 12 26 6 7 6 10 Miami Heat
16 1610612749 Milwaukee Bucks 3 2 1 0.667 49.7 45.0 95.0 0.474 16.7 46.0 0.362 17.3 25.7 0.675 6.3 43.7 50.0 27.3 13.7 8.0 7.0 4.0 24.7 25.7 124.0 6.0 19 11 4 11 2 2 6 4 2 1 10 21 14 29 29 2 3 3 6 16 2 5 19 9 1 9 10 Milwaukee Bucks
17 1610612750 Minnesota Timberwolves 3 3 0 1.000 49.7 42.7 96.7 0.441 12.7 42.0 0.302 23.3 30.7 0.761 13.0 37.0 50.0 25.7 15.3 10.7 3.7 7.7 20.0 27.3 121.3 10.0 19 1 1 1 2 6 4 17 10 4 25 4 5 15 4 13 3 5 10 1 28 30 6 3 4 2 10 Minnesota Timberwolves
18 1610612740 New Orleans Pelicans 4 0 4 0.000 49.3 45.5 100.8 0.452 16.8 45.8 0.366 13.3 18.3 0.726 12.0 34.0 46.0 30.8 16.3 8.0 5.3 4.0 26.5 21.8 121.0 -7.3 1 28 29 28 4 1 1 13 1 2 8 28 28 21 8 22 19 1 17 16 15 5 25 22 5 24 10 New Orleans Pelicans
19 1610612752 New York Knicks 4 1 3 0.250 48.0 37.8 87.0 0.434 10.8 27.8 0.387 18.8 28.0 0.670 13.8 35.3 49.0 18.8 20.3 10.0 3.8 5.3 27.0 23.0 105.0 -7.3 1 18 23 23 10 24 19 20 17 27 6 17 10 30 3 20 7 27 27 3 27 17 26 18 22 24 10 New York Knicks
20 1610612760 Oklahoma City Thunder 4 1 3 0.250 48.0 37.5 84.5 0.444 10.8 29.3 0.368 17.3 24.8 0.697 9.5 40.3 49.8 18.8 18.5 6.8 4.5 4.8 23.5 22.8 103.0 1.8 1 18 23 23 10 25 26 15 17 23 7 22 16 25 20 3 5 27 20 19 20 12 14 19 25 14 10 Oklahoma City Thunder
21 1610612753 Orlando Magic 3 1 2 0.333 48.0 35.3 91.3 0.387 8.7 33.3 0.260 16.7 21.0 0.794 10.7 35.7 46.3 20.3 13.0 9.7 5.7 4.3 17.7 20.3 96.0 -1.3 19 18 14 18 10 28 11 30 28 16 30 23 25 9 11 18 17 24 3 6 9 8 1 27 29 20 10 Orlando Magic
22 1610612755 Philadelphia 76ers 3 3 0 1.000 48.0 38.7 86.7 0.446 10.3 34.7 0.298 22.0 30.3 0.725 10.0 39.7 49.7 25.3 20.3 10.7 7.0 4.0 29.7 27.3 109.7 7.3 19 1 1 1 10 19 22 14 22 13 26 11 6 22 14 5 6 7 29 1 2 5 29 3 14 7 10 Philadelphia 76ers
23 1610612756 Phoenix Suns 4 2 2 0.500 49.3 39.8 87.5 0.454 12.3 34.5 0.355 22.3 26.8 0.832 7.8 39.0 46.8 27.8 16.0 8.5 4.0 6.5 31.3 27.0 114.0 8.8 1 11 14 14 4 12 17 11 12 15 13 10 11 4 28 6 16 2 15 11 24 25 30 7 11 4 10 Phoenix Suns
24 1610612757 Portland Trail Blazers 4 2 2 0.500 48.0 41.5 89.8 0.462 9.3 28.3 0.327 21.0 24.5 0.857 8.5 37.8 46.3 17.0 15.5 6.8 5.3 4.5 26.3 22.5 113.3 0.3 1 11 14 14 10 7 15 7 26 26 20 12 17 3 24 12 18 30 12 19 15 9 23 20 12 19 10 Portland Trail Blazers
25 1610612758 Sacramento Kings 4 0 4 0.000 48.0 34.3 86.5 0.396 11.0 32.3 0.341 16.0 21.5 0.744 11.5 30.8 42.3 18.8 18.8 6.5 4.5 5.0 22.5 22.0 95.5 -19.5 1 28 29 28 10 30 23 29 16 17 15 24 24 18 9 29 26 27 21 23 20 15 10 21 30 30 10 Sacramento Kings
26 1610612759 San Antonio Spurs 3 3 0 1.000 48.0 44.3 92.0 0.482 8.0 23.7 0.338 22.3 28.3 0.788 12.7 38.7 51.3 25.3 16.0 5.7 7.0 5.7 18.7 24.7 119.0 4.7 19 1 1 1 10 3 10 2 29 29 16 9 8 12 6 8 2 7 15 29 2 19 2 11 6 11 10 San Antonio Spurs
27 1610612761 Toronto Raptors 4 3 1 0.750 49.3 37.5 87.0 0.431 14.3 39.3 0.363 22.8 25.8 0.883 9.3 44.3 53.5 22.8 20.3 6.8 5.8 6.3 24.3 24.0 112.0 8.8 1 1 4 4 4 25 19 22 5 6 9 6 13 1 22 1 1 18 27 19 8 24 18 13 13 4 10 Toronto Raptors
28 1610612762 Utah Jazz 4 3 1 0.750 48.0 35.0 77.3 0.453 10.5 29.3 0.359 18.3 23.0 0.793 5.5 39.8 45.3 20.3 19.5 6.5 3.3 4.8 26.0 23.5 98.8 7.3 1 1 4 4 10 29 30 12 20 23 12 18 21 10 30 4 22 25 25 23 29 12 21 16 28 8 10 Utah Jazz
29 1610612764 Washington Wizards 3 1 2 0.333 48.0 41.0 95.0 0.432 12.7 38.7 0.328 11.7 15.0 0.778 9.0 36.0 45.0 25.7 15.0 6.0 5.7 6.0 22.7 19.7 106.3 0.7 19 18 14 18 10 8 6 21 10 7 19 30 30 13 23 15 23 5 8 27 9 22 12 28 19 16 10 Washington Wizards
Using Selenium will be the best way to do it. Then you can get the whole content which is rendered by javascript.
https://towardsdatascience.com/simple-web-scraping-with-pythons-selenium-4cedc52798cd

iterating over loc on dataframes

I'm trying to extract data from a list of dataframes and extract row ranges. Each dataframe might not have the same data, therefore I have a list of possible index ranges that I would like loc to loop over, i.e. from the code sample below, I might want CIN to LAN, but on another dataframe, the CIN row doesn't exist, so I will want DET to LAN or HOU to LAN.
so I was thinking putting them in a list and iterating over the list, i.e.
for df in dfs:
ranges=[[df.loc["CIN":"LAN"]], [df.loc["DET":"LAN"]]]
extracted ranges = (i for i in ranges)
I'm not sure how you would iterate over a list and feed into loc, or perhaps .query().
df1 stint g ab r h X2b X3b hr rbi sb cs bb \
year team
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105
DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97
HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60
LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114
NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174
SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235
TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73
TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190
df2 so ibb hbp sh sf gidp
year team
2008 DET 176.0 3.0 10.0 4.0 8.0 28.0
HOU 212.0 3.0 9.0 16.0 6.0 17.0
LAN 141.0 8.0 9.0 3.0 8.0 29.0
NYN 310.0 24.0 23.0 18.0 15.0 48.0
SFN 188.0 51.0 8.0 16.0 6.0 41.0
TEX 140.0 4.0 5.0 2.0 8.0 16.0
TOR 265.0 16.0 12.0 4.0 16.0 38.0
Here is a solution:
import pandas as pd
# Prepare a list of ranges
ranges = [('CIN','LAN'), ('DET','LAN')]
# Declare an empty list of data frames and a list with the existing data frames
df_ranges = []
df_list = [df1, df2]
# Loop over multi-indices
for i, idx_range in enumerate(ranges):
df = df_list[i]
row1, row2 = idx_range
df_ranges.append(df.loc[(slice(None), slice(row1, row2)),:])
# Print the extracted data
print('Extracted data:\n')
print(df_ranges)
Output:
[ stint g ab r h X2b X3b hr rbi sb cs bb
year team
2007 CIN 6 379 745 101 203 35 2 36 125 10 1 105
DET 5 301 1062 162 283 54 4 37 144 24 7 97
HOU 4 311 926 109 218 47 6 14 77 10 4 60
LAN 11 413 1021 153 293 61 3 36 154 7 5 114
so ibb hbp sh sf gidp
year team
2008 DET 176 3 10 4 8 28
HOU 212 3 9 16 6 17
LAN 141 8 9 3 8 29]

Difference between dates in Pandas dataframe

This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row.
My dataframe is:
date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0
And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples.
[date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1]
One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor:
In [72]: df['date'] = pd.to_datetime(df['date'])
In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \
.diff().dt.days.fillna(0)
In [74]: df
Out[74]:
date site country_code kind ID rank votes sessions avg_score count day_diff
0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0
1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0
2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0
3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0
4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0
5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0
6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0
7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0
8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0
9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0
10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0
11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0

Need to compare one Pandas (Python) dataframe with values from another dataframe

So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say.
I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group.
group = df.groupby('step').power.mean()
group2 = df.groupby('step').power.std().fillna(0)
This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following:
upperlimit = group + 3*group2
lowerlimit = group - 3*group2
lowerlimit[lowerlimit<0] = 0
Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step).
Here's 50 lines of the sample data:
Index Power Step
0 106.0 5.0
1 200.4 5.5
2 201.4 5.6
3 226.9 5.6
4 206.8 5.6
5 177.5 5.3
6 124.0 4.9
7 121.0 4.8
8 93.9 4.7
9 135.6 5.0
10 211.1 5.6
11 265.2 6.0
12 281.4 6.2
13 417.9 6.9
14 546.0 7.4
15 619.9 7.9
16 404.4 7.1
17 241.4 5.8
18 44.3 3.9
19 72.1 4.6
20 21.1 3.3
21 6.3 2.3
22 0.0 0.8
23 0.0 0.9
24 0.0 3.2
25 0.0 4.6
26 33.3 4.2
27 97.7 4.7
28 91.0 4.7
29 105.6 4.8
30 97.4 4.6
31 126.7 5.0
32 134.3 5.0
33 133.4 5.1
34 301.8 6.3
35 298.5 6.3
36 312.1 6.5
37 505.3 7.5
38 491.8 7.3
39 404.6 6.8
40 324.3 6.6
41 347.2 6.7
42 365.3 6.8
43 279.7 6.3
44 351.4 6.8
45 350.1 6.7
46 573.5 7.9
47 490.1 7.5
48 520.4 7.6
49 548.2 7.9
To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform:
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
You can then create the new rows directly:
df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0))
df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0))
df.loc[df['lower'] < 0, 'lower'] = 0
And sort accordingly:
df = df[(df.power <= df.upper) & (df.power >= df.lower())]

How to retrieve one column from csv file using python?

im trying to retrieve the age column from one of the csv file , here is what i coded so far.
df = pd.DataFrame.from_csv('train.csv')
result = df[(df.Sex=='female') & (df.Pclass==3)]
print(result.Age)
# finding the average age of all people who survived
print len(result)
sum = len(result)
I printed out the age, because i wanted to see the list of all ages that belong to the colunm of sex that has the value of "female" and the column of class which has the value of "3"
the print result for some reason shows the colunm number and the age next to it, i just want it print the list of ages thats all.
PassengerId
3 26.0
9 27.0
11 4.0
15 14.0
19 31.0
20 NaN
23 15.0
25 8.0
26 38.0
29 NaN
33 NaN
39 18.0
40 14.0
41 40.0
45 19.0
48 NaN
50 18.0
69 17.0
72 16.0
80 30.0
83 NaN
86 33.0
101 28.0
107 21.0
110 NaN
112 14.5
114 20.0
115 17.0
120 2.0
129 NaN
...
658 32.0
678 18.0
679 43.0
681 NaN
692 4.0
698 NaN
703 18.0
728 NaN
730 25.0
737 48.0
768 30.5
778 5.0
781 13.0
787 18.0
793 NaN
798 31.0
800 30.0
808 18.0
814 6.0
817 23.0
824 27.0
831 15.0
853 9.0
856 18.0
859 24.0
864 NaN
876 15.0
883 22.0
886 39.0
889 NaN
Name: Age, dtype: float64
This is what my program prints, i just want the list of age on the right column only not the passengerID column which is on the left.
Thank you
result.Age is a pandas Series object, and so when you print it, column headers, indices, and data types are shown as well. This is a good thing, because it makes the printed representation of the object much more useful.
If you want to control exactly how the data is displayed, you will need to do some string formatting. Something like this should do what you're asking for:
print('\n'.join(str(x) for x in result.Age))
If you want access to the raw data underlying that column for some reason (usually you can work with the Series just as well), without indices or headers, you can get a numpy array with
result.Age.values

Categories