web scraping using pandas - python

I want to scrape multiple pages of website using Python, but I'm getting Remote Connection closed error.
Here is my code
import pandas as pd
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
dframe = pd.read_html(url, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
Any idea how to solve it?

For me, just using requests to fetch the html before passing to read_html is getting the data. I just edited your code to
import pandas as pd
import re
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
r = requests.get(url) # getting page -> html in r.text
dframe = pandas.read_html(r.text, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
I didn't even have to add headers, but if this isn't enough for you (i.e., if the program breaks or if you don't end up with 53770+ rows), try adding convincing headers or using something like HTMLSession instead of directly calling requests.get...

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

What to do when Python requests.get gets a browser error from the website?

I'm trying to read in a table from a website, but when I do this, I am getting a result from the website that says: "It appears your browser may be outdated. For the best website experience, we recommend updating your browser."
I am able to use requests.get on the Stats portion of this same PGA website without issue, but for some reason the way these historical results tables are displayed it is causing issues. One interesting thing going on is the web site allows you to select different years for the displayed table, but doing that doesn't result in any difference to the address, so I suspect they are formatting it in a way that read_html won't work. Any other suggestions? Code below.
import pandas as pd
import requests
farmers_url = 'https://www.pgatour.com/tournaments/farmers-insurance-open/past-results.html'
farmers = pd.read_html(requests.get(farmers_url).text, header=0)[0]
farmers.head()
I see a request to the following file for the content you want. This would otherwise be an additional request made by the browser from your start url. What you are currently getting is the actual content of a table at the requested url prior to any updates which would happen dynamically with a browser.
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
pd.read_html(r)
If you want to do tidying to look like the actual webpage then something like the following transformations and cleaning:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
t = pd.read_html(r)[0]
t.reset_index()
t.columns = [':'.join([i[0], i[1]]) if 'ROUNDS' in i else i[0] for i in t.columns]
t.POS = t.POS.map(lambda x: x.split(' ')[-1])
round_columns = [i for i in t.columns if 'ROUNDS' in i]
t[round_columns] = t[round_columns].applymap(lambda x: x.split(' ')[0])
t.drop('TO PAR', inplace = True, axis = 1)
t.rename(columns={"TOTALSCORE": "TOTAL SCORE", "OFFICIALMONEY": "OFFICIAL MONEY", "FEDEXCUPPOINTS":"FEDEX CUP POINTS"}, inplace = True)
Detail:

Send a Table scraped by requests to an Excel Workbook/sheet

Beginner here,
I scraped a table using requests from nhl.com, and I'd like to send it to Excel.
import requests
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
for i in data['data']:
print('{:<30} {:.1f}'.format(i['teamFullName'], i['powerPlayPct']*100))
I used requests instead of pandas because of the dynamic format on nhl.com for scraping Tables and I don't feel like it creates a dataframe (just like in pandas) to be sent using df.to_excel.
How could I do that?
Try using pd.json_normalize and pass the record_path parameter as 'data'
import requests
import pandas as pd
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
df = pd.json_normalize(data, record_path='data')
# Do whatever math you want here
df.to_excel('nhl_data.xlsx', index=False)

requests_html render scrolldown, script not working

I need to crawl data from the website that data are loaded by scroll down.
The website returned 5 data before scrolling down, and expected 80 data returned after scrolling down are done.
I'm using the requests_html module and tried this
from requests_html import HTML, HTMLSession
keyword = '유산균'
n = 1
url = f'https://search.shopping.naver.com/search/all?frm=NVSHATC&origQuery={keyword}&pagingIndex={n}&pagingSize=80&productSet=total&query={keyword}&sort=rel&timestamp=&viewType=list'
session = HTMLSession()
ses = session.get(url)
html = HTML(html=ses.text)
item_list = html.find('div.basicList_title__3P9Q7')
print(len(item_list))
ses.html.render(scrolldown=100, sleep=.1)
'''
ses.html.render(script="window.scrollTo(0, 99999)", sleep= 10)
also tried not worked either
'''
print(len(item_list))
I expected 5, 80 as the result but both print returned the same result. 5 and 5.
what is wrong with my code?
When you monitor the network activity when loading the site, you'll see that it loads the search results from an api. This means that you can retrieve the data directly from the api without scraping. Here is an example that loads the first page as a pandas dataframe:
import requests
import pandas as pd
keyword = '유산균'
n = 1
r = requests.get(f'https://search.shopping.naver.com/api/search/all?sort=rel&pagingIndex={n}&pagingSize=80&viewType=list&productSet=total&deliveryFee=&deliveryTypeValue=&frm=NVSHATC&query={keyword}&origQuery={keyword}').json()
df = pd.DataFrame(r['shoppingResult']['products'])
You can add a loop to retrieve next pages, etc.

Pull data from a website with multiple tabs

I'm trying to pull data from a website which is dynamically updated (every few hours or so), it is a website of a transport service and it has a few pages/tabs.
all i managed so far is to pull only the first page no matter what i try.
so i can't pull the data of the other tabs.
the code:
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
# For establishing connection
proxies = {'http': 'http:...'}
url = 'http://yit.maya-tour.co.il/yit-pass/Drop_Report.aspx?client_code=2660&coordinator_code=2669'
page = requests.get(url, proxies=proxies)
data = page.text
soup = BeautifulSoup(data, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
html = requests.get(url, proxies=proxies).text
df_list = pd.read_html(html)
df = df_list[1]
df.to_csv('my data.csv')
i also tried doing it by parsing the html source code, but only got the first page as well, any ideas??
you should extract the first page's hyperlinks and use it in your code!(if there is no hyperlinks, put other urls in the loop like below)
import pandas as pd
df_list = []
//call each page here. i assume you have page number at the end of main url
for p in range(1, n):
url = 'http://yit.maya-tour.co.il/yit-pass/Drop_Report.aspx?client_code=2660& coordinator_code=2669?pNumber=%d' %p
df_list.append(pd.read_html(url)[0])
df = pd.concat(df_list)
print(df)
df.to_csv('my data.csv')
Every 15 seconds the webpage is calling the javasscript code below:
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
Open your browser dev tools and put a breakpoint in this function. After you understand the arguments that are submitted by the code, use requests (or other http client) to submit the form from your python code.

Categories