Webscraping - Inconsitent result due to Timeout request - python

I am scraping some public data from website using Python 3.6.
I created a long list of pages URLs I need to scrape (10k+).
I parse each one and produce a list with all the relevant information and than I append this to a comprehensive list.
I used to get some timeout request errors so I tried to handle it using try/except.
The code runs without apparent errors but, re-running the code I get very inconsistent results: the length of the final list changes substantially and I can prove that non all the pages have been parsed.
So my code shuts down at some point and I cannot check at what point.
the time_out variable is always zero at the end, no matter how long is the list produced.
Any help appreciated!
Best
Here is what I believe is the relevant part of the code
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
LIST_OF_URLS = ['URL','URL','URL']
FINAL_LIST = []
timed_out = 0
for URL in LIST_OF_URLS:
try:
result_page = BeautifulSoup(requests.get(URL, headers=headers,timeout=10).text, 'html.parser')
except requests.exceptions.Timeout:
timed_out+=1
#The loop produces a LIST
FINAL_LIST.append(LIST)

Related

How to loop through URLs hosted in a Google Sheet

It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
'https://www.nba.com/game/was-vs-lac-0022100992']
for url in urls:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']
['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.
From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.
I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.
I thought then, I'd be able to still loop through those links in the same way (new code):
from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = freshurls.get_all_records()
for url in urls:
try:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
title = (json_obj['props']['pageProps']['story']['header']['headline'])
date = (json_obj['props']['pageProps']['story']['date'])
content = str(json_obj['props']['pageProps']['story']['content'])
AddData = [url, title, date, content]
worksheet.append_row(AddData)
except:
skip
Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.
Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?
SUGGESTION
You can try using the batch_get method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.
In my understanding, here is your goal:
Put a list of URLs on a specific sheet tab in a spreadsheet file.
Get the URL data from that Sheet tab in Python
Loop through it in your Python script and scrape the data per URL
Append each scrape data to a second sheet tab.
Sample Script
The getURLsFromSpreadsheet.py file
import gspread
gc = gspread.service_account(filename='creds.json')
# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")
# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]
The scrapeScript.py file
from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for url in urls:
r = requests.get(url[0], headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
str(json_obj['props']['pageProps']['story']['date']),
str(json_obj['props']['pageProps']['story']['content'])[2:-1]
]]
apprendWk.append_rows(samplelist)
Demonstration
Sample spreadsheet file. URLs are listed on Column A
The Sheet 2 tab after running the scrapeScript.py file
In action:
Reference
GSpread Samples
Python – Call function from another file
At the document of gspread, it seems that get_all_records() returns a list of dictionaries. Ref Under this condition, when for url in urls: is run, url is {"header1": "value1",,,}. I thought that this might be the reason for your issue.
Unfortunately, although, from your question, I couldn't know the column where the URLs are put, for example, when column "A" has the URLs you want to use, how about the following modification?
From:
urls = freshurls.get_all_records()
for url in urls:
To:
column = 1 # This means column "A".
urls = freshurls.get_all_values()
for e in urls[1:]:
url = e[column - 1]
# print(url) # You can check the URL.
In this modification, the values are retrieved from column "A" using get_all_values. And, the 1st header row is skipped. It seems that get_all_values returns the values as the 2-dimensional array.
If your actual situation uses a different column from column "A", please modify the script.
Reference:
get_all_values(**kwargs)

API instead of BeautifulSoup?

This document defines some URL's and IP's of MS services:
https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online
My goal is to write a Python script that check what is the last updated date of this document.
If the date is change (means that some IP's changed), I need to know it immediately. I can't found any API for this goal, so I wrote this script:
from bs4 import BeautifulSoup
import requests
import time
import re
url = "https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online"
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
last_update_fra = soup.find(string=re.compile("01/04/2021"))
time.sleep(60)
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
if soup.find(string=re.compile("01/04/2021")) == last_update_fra:
print(last_update_fra)
continue
else:
#send an email for notification
pass
I'm not sure if this is the best way to do it. since if the date will change, I also need to update my script to another date (the updated date).
In addition, this is ok to do it with BeautifulSoup? or there's another and a better way?
Beautifulsoup is fine here. I don't even see an XHR request with the data anayway there.
Couple things I noted:
Do you really want/need it checked every minute? Maybe better every day/24 hours, or ever 12 or 6 hours?
If at any point it crashes, Ie you lose internet connection or get a 400 response, you'll need to restart the script and lose whatever the last date was. So maybe consider a) writing the date to disk somewhere so it's not just stored in memory, and b) put in some try/exceptions in there so that if it does encounter an error, it'll keep running and just try again at the next interval (or how ever you decide it to try again).
Code:
import requests
import time
from bs4 import BeautifulSoup
url = 'https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online'
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
last_update_fra = ''
while True:
time.sleep(60)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
found_date = soup.find('time').text
if found_date == last_update_fra:
print(last_update_fra)
continue
else:
# store new date
last_update_fra = found_date
#send an email for notification
pass

How do I send an embed message that contains multiple links parsed from a website to a webhook?

I want my embed message to look like this, but mine only returns one link.
Here's my code:
import requests
from bs4 import BeautifulSoup
from discord_webhook import DiscordWebhook, DiscordEmbed
url = 'https://www.solebox.com/Footwear/Basketball/Lebron-X-JE-Icon-QS-variant.html'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links = "https://www.solebox.com/{0}".format(aid)
webhook = DiscordWebhook(url='WebhookURL')
embed = DiscordEmbed(title='Title')
embed.set_author(name='Brand')
embed.set_thumbnail(url="Image")
embed.set_footer(text='Footer')
embed.set_timestamp()
embed.add_embed_field(name='Sizes', value='US{0}'.format(size))
embed.add_embed_field(name='Links', value='[Links]({0})'.format(product_links))
webhook.add_embed(embed)
webhook.execute()
This will most likely get you the results you want. type(product_links) is a string, meaning that every iteration in your for loop is just re-writing over the product_links variable with a new string. If you declare a List before the loop and append product_links to that list, it will most likely result in what you wanted.
Note: I had to use a different URL from that site. The one specified in the question was no longer available. I also had to use a different header as the one the asker put up continuously fed me a 403 error.
Additional Note: The URLS that are returned via your code logic return links that lead to no where. I feel that you'll need to work that one through since I don't know what you're exactly trying to do, however I feel that this answers the question of why you where only getting one link.
import requests
from bs4 import BeautifulSoup
url = 'https://www.solebox.com/Footwear/Basketball/Air-Force-1-07-PRM-variant-2.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_links = [] # Create our product
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links.append("https://www.solebox.com/{0}".format(aid))

BeautifulSoup and Requests Module NoneType Error

I've been experimenting with the requests and the bs4 module for a couple of days now. I wanted to make a simple program similar to the 'I'm Feeling Lucky' from google.
Here's my code:
import requests, bs4, webbrowser
source=requests.get('https://www.google.com/search?q=facebook').text
exsoup=bs4.BeautifulSoup(source, 'lxml')
# <cite class="iUh30">https://www.facebook.com/</cite>
match=exsoup.find('cite', class_='iUh30')
print(match.text)
But when I run this I get the following error:
print(match.text)
AttributeError: 'NoneType' object has no attribute 'text'
How can I make this work?
try to iterate on something like this, excluding class_ attribute:
match=exsoup.find_all('cite')
for i in match:
if 'http' in i.text:
print(i.text)
The issue seems to be that you are getting different results from visiting the site with a browser from when you visit using the requests library. You could try specifying a header (I took this example from the following: https://stackoverflow.com/a/27652558/9742036)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get('https://www.google.com/search?q=facebook', headers=headers).text
and the source code should look more like your browser visit.
Otherwise, your code works fine. You're just getting no results in the original hit, so should code to handle that case (by for example using the iterator suggestion in the other answer.)

get empty list when parsing instagram following with beautifulsoup

I'm trying to make a program to get a list of following on instagram. Here's the code:
import urllib.request
from bs4 import BeautifulSoup
import requests
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
return response.text
def parse(html):
soup = BeautifulSoup(html, "html.parser")
fol = soup.find_all('a', class_='_2g7d5')
print(fol)
parse(get_html('https://www.instagram.com/any_user/following/'))
But I get an empty list as a result. The code works correctly, when parsing any other website. What's wrong?
P.S. the class has a really weird name
This wouldn't work because, you'll need to be a valid authorized user of instagram to access that. In your code, there is no authentication happening. So, from instagram's Point of View, you're just a ghost trying to access their data. So they don't let you in. Google for some 3rd party python wrappers that use instagram wrappers.
First google search led me here.
Also, you can't scrape websites like that. It is illegal to be honest. You need to use their developer API with a valid token.

Categories