Beginner here,
I scraped a table using requests from nhl.com, and I'd like to send it to Excel.
import requests
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
for i in data['data']:
print('{:<30} {:.1f}'.format(i['teamFullName'], i['powerPlayPct']*100))
I used requests instead of pandas because of the dynamic format on nhl.com for scraping Tables and I don't feel like it creates a dataframe (just like in pandas) to be sent using df.to_excel.
How could I do that?
Try using pd.json_normalize and pass the record_path parameter as 'data'
import requests
import pandas as pd
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
df = pd.json_normalize(data, record_path='data')
# Do whatever math you want here
df.to_excel('nhl_data.xlsx', index=False)
Related
I want to scrape multiple pages of website using Python, but I'm getting Remote Connection closed error.
Here is my code
import pandas as pd
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
dframe = pd.read_html(url, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
Any idea how to solve it?
For me, just using requests to fetch the html before passing to read_html is getting the data. I just edited your code to
import pandas as pd
import re
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
r = requests.get(url) # getting page -> html in r.text
dframe = pandas.read_html(r.text, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
I didn't even have to add headers, but if this isn't enough for you (i.e., if the program breaks or if you don't end up with 53770+ rows), try adding convincing headers or using something like HTMLSession instead of directly calling requests.get...
I'm trying to read in a table from a website, but when I do this, I am getting a result from the website that says: "It appears your browser may be outdated. For the best website experience, we recommend updating your browser."
I am able to use requests.get on the Stats portion of this same PGA website without issue, but for some reason the way these historical results tables are displayed it is causing issues. One interesting thing going on is the web site allows you to select different years for the displayed table, but doing that doesn't result in any difference to the address, so I suspect they are formatting it in a way that read_html won't work. Any other suggestions? Code below.
import pandas as pd
import requests
farmers_url = 'https://www.pgatour.com/tournaments/farmers-insurance-open/past-results.html'
farmers = pd.read_html(requests.get(farmers_url).text, header=0)[0]
farmers.head()
I see a request to the following file for the content you want. This would otherwise be an additional request made by the browser from your start url. What you are currently getting is the actual content of a table at the requested url prior to any updates which would happen dynamically with a browser.
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
pd.read_html(r)
If you want to do tidying to look like the actual webpage then something like the following transformations and cleaning:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
t = pd.read_html(r)[0]
t.reset_index()
t.columns = [':'.join([i[0], i[1]]) if 'ROUNDS' in i else i[0] for i in t.columns]
t.POS = t.POS.map(lambda x: x.split(' ')[-1])
round_columns = [i for i in t.columns if 'ROUNDS' in i]
t[round_columns] = t[round_columns].applymap(lambda x: x.split(' ')[0])
t.drop('TO PAR', inplace = True, axis = 1)
t.rename(columns={"TOTALSCORE": "TOTAL SCORE", "OFFICIALMONEY": "OFFICIAL MONEY", "FEDEXCUPPOINTS":"FEDEX CUP POINTS"}, inplace = True)
Detail:
I was requested to create a JSON dataframe using looping only, with specific headers: headers = {'User-agent': 'CS6400'} in request.get(). In addition, dataframe with 5 rows (one for each of the top 25 posts).
Since I only can use looping, not able to use "read_json" or "json_normalize". I was stuck for several hours, support needed and much appreciated.
Here is my buggy codes:(error message :ERROR! Session/line number was not unique in database. History logging moved to new session 122 )
import urllib, json
import requests
import gzip
import json
url = "http://www.reddit.com/r/popular/top.json"
headers = {'User-agent': 'DS6001'}
response = requests.get(url,headers=headers)
data = response.read().decode("utf-8")
data = json.dumps(response)
dataframe = pd.DataFrame.from_dict(data, orient="index")
dataframe
I tried to download specific data as part of my work,
the data is located in link! .
The source indicates how to download through the get method, but when I make my requests:
import requests
import pandas as pd
url="https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
r=pd.to_csv(url)
it doesnt read as it should be (open link in navigator).
When I try
s=requests.get(url,verify=False) # you can set verify=True
df=pd.DataFrame(s)
the data neither is good.
What else can I do? It suppose to download the data as csv avoiding me to clean the data.
to get the content as csv you can replace all HTML line breaks with newline chars.
please let me know if this works for you:
import requests
import pandas as pd
from io import StringIO
url = "https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
content = requests.get(url,verify=False).text.replace("<br>","\n").strip()
csv = StringIO(content)
r = pd.read_csv(csv)
print(r)
Is it possible to create a dataframe from JSON formatted as text, not as Python object?
With Python object, I could for example do:
from pandas.io.json import json_normalize
import requests
response = requests.get(url, params).json()
df = json_normalize(response)
but I want to achieve the same with response = requests.get(url,params).text (flattening is not required though).
If your response = requests.get(url,params).text is guaranteed to give you a valid JSON string, then all you need to do is as follows:
from pandas.io.json import json_normalize, loads
import requests
response = requests.get(url, params).text
df = json_normalize(loads(response))
Here we make use of json's loads to convert the JSON string to a Python object before passing back to json_normalize.
I usually create dataframe from json using "read_json"
import pandas as pd
import requests
data = requests.get(url, params).content
df = pd.read_json(data)
df.head()