How do I limit the rate of a scraper? - python

So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
I now hit the HTTP Error 429: too many requests, without any retry-after header.
Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?

Python official documentation is the best place to go:
https://docs.python.org/3/library/time.html#time.sleep
Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.
import time
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#New cote to wait for some time
time.sleep(5)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()

The way I did it,
scrape the whole element once and then parse through the header, tag, or name. used bs4 with robinstocks for market data, runs every 10 mins or so, and works fine. specifically the get_element_by_name functionality. or maybe just use time delay from the time lib

Related

Scraping Public Tableau

I am collecting the Indian renewable energy data, and I am kinda stuck on gathering the data.
From the public Tableau, it seems that I can select the year and the tab.
But when I try to know the parameter or filter values to approach the yearly value, the results return nothing.
Here is my python code. I select only the worksheet of "Capacity by EnergySource".
(There are three worksheets in this data. I select only one worksheet for reference.)
But ultimately, I want to scrap the yearly data for all worksheets in each tab.
from tableauscraper import TableauScraper as TS
url = "https://public.tableau.com/views/RenewableCapacity/CapacitybySource"
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Capacity by EnergySource")
print(ws.data)
ws.data.to_csv('table.csv',index=False)
parameters = workbook.getParameters()
print(parameters)
filters = ws.getFilters()
print(filters)
I am following each step of this thread and learning new ways to download the data from the public tableau.
https://github.com/bertrandmartel/tableau-scraping
Could anybody help on this issue?

Populating an Excel File Using an API to track Card Prices in Python

I'm a novice when it comes to Python and in order to learn it, I was working on a side project. My goal is to track card prices of my YGO cards using the yu-gi-oh prices API https://yugiohprices.docs.apiary.io/#
I am attempting to manually enter the print tag for each card and then have the API pull the data and populate the spreadsheet, such as the name of the card and its trait, in addition to the price data. So anytime I run the code, it is updated.
My idea was to use a for loop to get the API to search up each print tag and store the information in an empty dictionary and then post the results onto the excel file. I added an example of the spreadsheet.
Please let me know if I can clarify further. Any suggestions to the code that would help me achieve the goal for this project would be appreciated. Thanks in advance
import requests
import response as rsp
import urllib3
import urlopen
import json
import pandas as pd
df = pd.read_excel("api_ygo.xlsx")
print(df[:5]) # See the first 5 columns
response = requests.get('http://yugiohprices.com/api/price_for_print_tag/print_tag')
print(response.json())
data = []
for i in df:
print_tag = i[2]
request = requests.get('http://yugiohprices.com/api/price_for_print_tag/print_tag' + print_tag)
data.append(print_tag)
print(data)
def jprint(obj):
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
jprint(response.json())
Example Spreadsheet
Iterating over a pandas dataframe can be done using df.apply(). This has the added advantage that you can store the results directly in your dataframe.
First define a function that returns the desired result. Then apply the relevant column to that function while assigning the output to a new column:
import requests
import pandas as pd
import time
df = pd.DataFrame(['EP1-EN002', 'LED6-EN007', 'DRL2-EN041'], columns=['print_tag']) #just dummy data, in your case this is pd.read_excel
def get_tag(print_tag):
request = requests.get('http://yugiohprices.com/api/price_for_print_tag/' + print_tag) #this url works, the one in your code wasn't correct
time.sleep(1) #sleep for a second to prevent sending too many API calls per minute
return request.json()
df['result'] = df['print_tag'].apply(get_tag)
You can now export this column to a list of dictionaries with df['result'].tolist(). Or even better, you can flatten the results into a new dataframe with pd.json_normalize:
df2 = pd.json_normalize(df['result'])
df2.to_excel('output.xlsx') # save dataframe as new excel file

Is it possible extract a specific table with format from a PDF?

I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df
With PyPDF2
pdf_file = open("./tmp/pdf/Food Calories List.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
data = page_content
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
aux = page_content
df = pd.DataFrame([x.split(';') for x in aux.split('\n')])
Even with textract and beautiful soup, the issue that I am facing is that the output format is a mess, Is there any way to extract this table with a better format?
I suspect the issues stem from the fact that the table have merged cells (on the left) and reading data from a table only works when the rows and cells are consistent rather than some merged and some not.
I'd skip over the first two columns and then recreate / populate them on the left hand side once you have the table loaded (As a pandas dataframe for example).
Then you can have one label per row and work with the data consistently, otherwise your cells per column will be inconsistently numbered.
I would look into using tabula templates which you can dynamically generate based on word locations on page. This will give tabula more guidance on which area to consider and lead to more accurate extraction. See tabula.read_pdf_with_template as documented here: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf_with_template.
Camelot can be another Python library to try. Its advanced settings seem to show that it can handle merged cells. However, this will likely require some adjustments to certain settings such as copy_text and shift_text.
Note: Camelot can only read text-based tables. If the table is inside an image, it won't be able to extract it.
If the above is not an issue, try the sample code below:
import camelot
tables = camelot.read_pdf('./tmp/pdf/Food Calories List.pdf', pages='1', copy_text=['v'])
print(tables[0].df)

Need help creating a loop that will go through row by row in Excel

I'm a beginner at Python, and I have been trying my hand at some projects. I have an excel spreadsheet that contains a column of URLs that I want to open, pull some data from, output to a different column on my spreadsheet, and then go down to the next URL and repeat.
I was able to write code that allowed me to complete almost the entire process if I enter in a single URL, but I suck at creating loops
My list is only 10 cells long.
My question is, what code can I use that will loop through a column until it hits a stopping point. .
import urllib.request, csv, pandas as pd
from openpyxl import load_workbook
xl = pd.ExcelFile("filename.xlsx")
ws = xl.parse("Sheet1")
i = 0 # This is where I insert the row number for a specific URL
urlpage = str(ws['URLPage'][i]) # 'URLPage' is the name of the column in Excel
p = urlpage.replace(" ", "") # This line is for deleting whitespace in my URL
response = urllib.request.urlopen(p)
Also as stated, I'm newer at Python, so if you see where I can improve the code I already have, please let me know.

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

Categories