Is this actually using threading to scrape for urls? - python

Thanks in advance for your help. I'm new to Python and trying to figure out how to use the threading module to scrape the NY Daily News site for urls. I put the following together and the script is scrapping but it doesn't seem to be any faster than it was before so I'm not sure the threading is happening. Can you let me know if it is? Can I write in anything so that I can tell? And also any other tips you have about threading?
Thank you.
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import os
import io
import threading
def fetch_url():
for i in xrange(15500, 6100, -1):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = threading.Thread(target=fetch_url())
threads.start()

The below is a naive implementation (which will very quickly get you blacklisted from nydailynews.com):
def fetch_url(i, logfile):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = []
for i in xrange(15500, 6100, -1):
t = threading.Thread(target=fetch_url, args=(i, logfile))
t.start()
threads.append(t)
for t in threads:
t.join()
Note that fetch_url takes the number to substitute in the URL as an argument, and each possible value for that argument is started in its own, separate thread.
I would strongly suggest dividing the job into smaller batches, and running one batch at a time.

No, you're not using threads. threads = threading.Thread(target=fetch_url()) calls fetch_url() in your main thread, waits for it to complete and passes its return value (None) to the threading.Thread constructor.

Related

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Simultaneous Requests in Selenium Python

I am learning web scraping for a hobby project (Using Selenium in python).
I'm trying to get the information on product listings . There are about a 100 web pages , and I am sequentially loading each , processing the data in each then moving to the next.
This process takes over 5 minutes with the major bottleneck being the loading time of each page.
As I am only "reading" from the pages (not interacting with any of them)... I would like to know is it possible to send requests for all the pages together(as opposed to waiting for one page to load, then requesting the next page) and process the data as they arrive.
PS: Please tell me if there are some other solutions to reduce the loading time
You could use the Python Requests Module and the built-in threading module to make it faster. So for example:
import threading
import requests
list_of_links = [
# All your links here
]
threads = []
all_html = {
# This will be a dictionary of the links as key and the HTML of the links as values
}
def get_html(link):
r = requests.get(link)
all_html[link] = r.text
for link in list_of_links:
thread = threading.Thread(target=get_html, args=(link,))
thread.start()
threads.append(thread)
for t in threads:
t.join()
print(all_html)
print("DONE")

Python returning values from infinite loop thread

So for my program I need to check a client on my local network, which has a Flask server running. This Flask server is returning a number that is able to change.
Now to retrieve that value, I use the requests library and BeautifulSoup. I want to use the retrieved value in another part of my script (while continuously checking the other client). For this I thought I could use the threading module.The problem is, however, that the thread only returns it's values when it's done with the loop, but the loop needs to be infinite.This is what I got so far:
import threading
import requests
from bs4 import BeautifulSoup
def checkClient():
while True:
page = requests.get('http://192.168.1.25/8080')
soup = BeautifulSoup(page.text, 'html.parser')
value = soup.find('div', class_='valueDecibel')
print(value)
t1 = threading.Thread(target=checkClient, name=checkClient)
t1.start()
Does anyone know how to return the printed values to another function here? Of course you can replace the requests.get url with some kind of API where the values change a lot.
You need a Queue and something listening on the queue
import queue
import threading
import requests
from bs4 import BeautifulSoup
def checkClient(q):
while True:
page = requests.get('http://192.168.1.25/8080')
soup = BeautifulSoup(page.text, 'html.parser')
value = soup.find('div', class_='valueDecibel')
q.put(value)
q = queue.Queue()
t1 = threading.Thread(target=checkClient, name=checkClient, args=(q,))
t1.start()
while True:
value = q.get()
print(value)
The Queue is thread safe and allows to pass values back and forth. In your case they are only being sent from the thread to a receiver.
See: https://docs.python.org/3/library/queue.html

How to gather specific information using urllib2 in python

As of now I have created a basic program in python 2.7 using urllib2 and re that gathers the html code of a website and prints it out for you as well as indexing a keyword. I would like to create a much more complex and dynamic program which could gather data from websites such as sports or stock statistics and aggregate them into lists which could then be used in analysis in something such as an excel document etc. I'm not asking for someone to literally write the code. I simply need help understanding more of how I should approach the code: whether I require extra libraries, etc. Here is the current code. It is very simplistic as of now.:
import urllib2
import re
y = 0
while(y == 0):
x = str(raw_input("[[[Enter URL]]]"))
keyword = str(raw_input("[[[Enter Keyword]]]"))
wait = 0
try:
req = urllib2.Request(x)
response = urllib2.urlopen(req)
page_content = response.read()
idall = [m.start() for m in re.finditer(keyword,page_content)]
wait = raw_input("")
print(idall)
wait = raw_input("")
print(page_content)
except urllib2.HTTPError as e:
print e.reason
You can use requests to deal with interaction with website. Here is link for it. http://docs.python-requests.org/en/latest/
And then you can use beautifulsoup to handle the html content. Here is link for it.http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
They're more ease of use than urllib2 and re.
Hope it helps.

Website Downloader using Python

I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.
You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.
done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.
Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.

Categories