Web scraping of hyperlinks going so slow

Web scraping of hyperlinks going so slow - python

I am using the following function to scrape the Twitter URLs from a list of websites.
import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter
def twitter_url(website): # website address is given to the function in a string format
try:
http = httplib2.Http()
status, response = http.request(str('https://') + website)
url = 'https://twitter.com'
search_domain = urlparse(url).hostname
l = []
for link in bs.BeautifulSoup(response, 'html.parser',
parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
if search_domain in link['href']:
l.append(link['href'])
return list(set(l))
except:
ConnectionRefusedError
and then I apply the function into the dataframe which includes the URL addresses
df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)
The dataframe has about 100,000 website addresses. Even when I run the code for 10,000 samples, the code is running so slow. Is there any way to run this faster?

The issue must be a result of the time taken to retrieve the HTML code for each of the websites.
Since the URLs are processed one after the other, even if each one took 100ms it would still take 1000s (~16 mins) to finish up.
If you however process each URL in a separate thread, that should significantly cut down the time taken.
You can check out the threading library to accomplish that.

Related

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")

It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

I'm trying to scrape Gold stock ticker from Yahoo! Finance.
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get('https://finance.yahoo.com/quote/GC=F?p=GC=F')
soup = BeautifulSoup(response.text, 'lxml')
gold_price = soup.findAll("div", class_='My(6px) Pos(r) smartphone_Mt(6px)')[2].find_all('p').text
Whenever I run this it returns: list index out of range.
When I do print(len(ssoup)) it returns 4.
Any ideas?
Thank you.

You can make a direct request to the yahoo server. To locate the query URL you need to open Network tab via Dev tools (F12) -> Fetch/XHR -> find name: spark?symbols= (refresh page if you don't see any), find the needed symbol, and see the response (preview tab) on the newly opened tab on the right.
You can make direct requests to all of these links if the request method is GET since POST methods are much more complicated.
You need json and requests library, no need for bs4. Note that making a lot of such requests might block your IP (or set an IP rate limit) or you won't get any response because their system might detect that it's a bot since the regular user won't make such requests to the server, repeatedly. So you need to figure out how to bypass it.
Update:
There's possibly a hard limit on how many requests can be made in an X period of time.
Code and example in the online IDE (contains full JSON response):
import requests, json
response = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=GC%3DF&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance').text
data_1 = json.loads(response)
gold_price = data_1['spark']['result'][0]['response'][0]['meta']['previousClose']
print(gold_price)
# 1830.8
P.S. There's a blog about scraping Yahoo! Finance Home Page of mine, which is kind of relevant.

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!

i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count

How do I get the requests module in python to have a clock time?

I'm trying to scrape some data off of a website that and print it to the console. However I've run into an issue in that the data I'm trying to scrape changes depending on the clock of the computer that's accessing it. Right now the data just isn't showing up at all when I send a request (the rest of the page is though). I think the issue is that the python request module isn't giving the page a clock time. If I'm correct is there a way to fix this and if not is there any other reason it might not be working?
My code so far if it helps:
from bs4 import BeautifulSoup
import time
import requests
while True:
html_doc = requests.get('website removed for privacy').text
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id='timeLeft').contents)
time.sleep(1)

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.

This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)

So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping of hyperlinks going so slow - python

Related

Recursive crawling with BeautifulSoup really slow

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

How do I get the requests module in python to have a clock time?

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

Categories

Resources