I've written a script in python using proxies to scrape the links of different posts traversing different pages of a webpage. I've tried to make use of proxies from a list. The script is supposed to take random proxies from the list and send request to that website and finally parse the items. However, if any proxy is not working then it should be kicked out from the list.
I thought the way I've used number of proxies and list of urls within ThreadPool(10).starmap(make_requests, zip(proxyVault,lead_url)) is accurate but it doesn't produce any results; rather, the script gets stuck.
How can I pass the proxies and the links to the ThreadPool in order for the script to produce results?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from itertools import cycle
import random
base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=15".format(page) for page in range(1,6)]
proxyVault = ['104.248.159.145:8888', '113.53.83.252:54356', '206.189.236.200:80', '218.48.229.173:808', '119.15.90.38:60622', '186.250.176.156:42575']
def make_requests(proxyVault,lead_url):
while True:
random.shuffle(proxyVault)
global pitem
pitem = cycle(proxyVault)
proxy = {'https':'http://{}'.format(next(pitem))}
try:
res = requests.get(lead_url,proxies=proxy)
soup = BeautifulSoup(res.text,"lxml")
[get_title(proxy,urljoin(base_url,item.get("href"))) for item in soup.select(".summary .question-hyperlink")]
except Exception:
try:
proxyVault.pop(0)
make_requests(proxyVault,lead_url)
except Exception:pass
def get_title(proxy,itemlink):
res = requests.get(itemlink,proxies=proxy)
soup = BeautifulSoup(res.text,"lxml")
print(soup.select_one("h1[itemprop='name'] a").text)
if __name__ == '__main__':
ThreadPool(10).starmap(make_requests, zip(proxyVault,lead_url))
Btw, the proxies used above are just placeholders.
The problems with your code was that it was creating a lot of endless loops in the thread. Also they way you handled the proxies was a bit strange to me, so I changed it.
I also think you had misunderstood how data was sent to the threads, they get one one element of the iterable, not the whole thing. So I changed some names to reflect that.
The way it works now is that each thread gets their own url from lead_url, then they choose a random proxy from the proxyVault.
They fetch the webpage and parse it and calls get_title on each of the parsed links.
If the request fails because of the proxy, that proxy is removed from the list so its not used again and make_requests is called again, which will randomly choose a new proxy from the ones that are still available.
I did not change the actual parsing, because I can't judge if it's what you want or not.
Runnable code:
https://repl.it/#zlim00/unable-to-pass-proxies-and-links-to-the-threadpool-to-get-re
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from random import choice
import requests
from urllib.parse import urljoin
base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = [f'https://stackoverflow.com/questions/tagged/web-scraping?sort='
f'newest&page={page}&pagesize=15' for page in range(1, 6)]
proxyVault = ['36.67.57.45:53367', '5.202.150.233:42895',
'85.187.184.129:8080', '109.195.23.223:45947']
def make_requests(url):
proxy_url = choice(proxyVault)
proxy = {'https': f'http://{proxy_url}'}
try:
res = requests.get(url, proxies=proxy)
soup = BeautifulSoup(res.text, "lxml")
[get_title(proxy, urljoin(base_url, item.get("href")))
for item in soup.select(".summary .question-hyperlink")]
except requests.exceptions.ProxyError:
# Check so that the bad proxy was not removed by another thread
if proxy_url in proxyVault:
proxyVault.remove(proxy_url)
print(f'Removed bad proxy: {proxy_url}')
return make_requests(url)
def get_title(proxy, itemlink):
res = requests.get(itemlink, proxies=proxy)
soup = BeautifulSoup(res.text, "lxml")
print(soup.select_one("h1[itemprop='name'] a").text)
if __name__ == '__main__':
ThreadPool(10).map(make_requests, lead_url)
Maybe you can use another approach to get proxies like this
def get_proxy():
url = 'https://free-proxy-list.net/anonymous-proxy.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', attrs={'id': 'proxylisttable'})
table_body = table.find('tbody')
proxies = table_body.find_all('tr')
proxy_row = random.choice(proxies).find_all('td')
return proxy_row[0].text + ':' + proxy_row[1].text
Related
I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.
Here's a sample collection page: https://bandcamp.com/nhoward
Here's the Selenium code:
def scrapeFanCollection(url):
browser = getBrowser()
setattr(threadLocal, 'browser', browser)
#Go to url
browser.get(url)
try:
#Click show more button
browser.find_element_by_class_name('show-more').click()
#Wait two seconds
time.sleep(2)
#Scroll to the bottom loading full collection
scroll(browser, 2)
except Exception:
pass
#Return full album collection
soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
#Empty array
urls = []
# Looping through all the a elements in the page source
for item in soup_a.find_all('a', {"class": "item-link"}):
url = item.get('href')
if(url != None):
urls.append(url)
return urls
The API can be accessed as follows:
$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items
I didn't encounter a scenario where a "older_than_token" was stale, so the problem boils down to getting the "fan_id" given a URL.
This information is located in a blob in the id="pagedata" element.
>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985
Putting it all together (building upon this answer):
import json
import requests
from bs4 import BeautifulSoup
fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])
data = {
"fan_id": user["fan_data"]["fan_id"],
"older_than_token": user["wishlist_data"]["last_token"],
"count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()
for item in collection["items"][:10]:
print(item["album_title"], item["item_url"])
I'm using user["wishlist_data"]["last_token"] which has the same format as the "older_than_token" just in case this matters.
In order to get the entire collection i changed the previous code from
"older_than_token": user["wishlist_data"]["last_token"]
to
user["collection_data"]["last_token"]
which contained the right token
Unfortunately for you, this particular Bandcamp site doesn't seem to make any HTTP API call to fetch the list of albums. You can check that by using your browser developer tools, Network tab, click on XHR filter. The only call being made seems to be fetching your collection details.
I want to scrape a website and its sub-pages, but it is taking too long. How can I optimize the request or use an alternative solution?
Below is the code I am using. It takes 10s for just loading the Google home page. So it's clearly not scalable if I were to give it 280 links
from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')
# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()
Use python requests and Beautiful soup module.
import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])
letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
for letter in letters:
req = requests.get(url1.format(letter), verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print('Letters : ' + letter)
print([item['href'] for item in soup.select('.columns-list a[href]')])
you can use that script for the speed. multithread crawler better than all:
https://edmundmartin.com/multi-threaded-crawler-in-python/
After that you must change that code:
def run_scraper(self):
with open("francais-arabe-marocain.csv", 'a') as file:
file.write("url")
file.writelines("\n")
for i in range(50000):
try:
target_url = self.to_crawl.get(timeout=600)
if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
df = pd.DataFrame([{'url': target_url}])
df.to_csv(file, index=False, header=False)
print(target_url)
except Empty:
return
except Exception as e:
print(e)
continue
If url include "francais-arabe-marocain" save urls in a csv file.
After that you can scrape that urls in one for loop reading csv line by line with same way
try to use urllib just like this
import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)
it took only 2s. However, it depends also on the quality of connection you have
My Threading works fine only when fetch() is not called from validate(). But in this scenario below it returns
RuntimeError: There is no current event loop in thread
in all the threads 0 - 99 what am I doing wrong here ?
from threading import Thread
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
def fetch():
#fetch data from another site
session = HTMLSession()
url='http://url'
data = session.get(url)
data.html.render()
content = data.html.html
soup = BeautifulSoup(content, "html.parser")
iban = soup.find('p',{"id":"demo"})
return result.text
def validate():
url = "https://url"
payload = {
"data" : fetch(),
"veto" : "fi"
}
response = requests.post(url, data=payload)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.body.find(text='contact')
if(data):
print (data)
else:
print ("no data")
if __name__ == "__main__":
threads = []
for i in range(100):
# We start one thread per url present.
process = Thread(target=validate)
process.start()
threads.append(process)
From a quick search of the error, I found this Github issue that seems to show your problem and its solution.
It looks like you need to use asyncio and, at the beginning of each running thread, call asyncio.set_event_loop(asyncio.new_event_loop()).
I'm writing a web scraper. I could've just used scrapy but decided to write it from scratch so I can practice.
I've created a scraper that works successfully using requests and BeautifulSoup. It navigates through about 135 pages with 12 items on each, grabs the link and then grabs the information from the link destination. At the end it writes everything in a CSV file. It only grabs strings and it doesn't download any images or anything like that… for now.
Problem? It's quite slow. It takes about 5 secs to grab the everything just from contents of one page so that times 135 is about 11 minutes.
So my question is how do I implement threading in my code so it gets data way faster.
Here's the code:
import requests
from bs4 import BeautifulSoup
import re
import csv
def get_actor_dict_from_html(url, html):
soup = BeautifulSoup(html, "html.parser")
#There must be a better way to handle this, but let's assign a NULL value to all upcoming variables.
profileName = profileImage = profileHeight = profileWeight = 'NULL'
#Let's get the name and image..
profileName = str.strip(soup.find('h1').get_text())
profileImage = "http://images.host.com/actors/" + re.findall(r'\d+', url)[0] + "/actor-large.jpg"
#Now the rest of the stuff..
try:
profileHeight = soup.find('a', {"title": "Height"}).get_text()
except:
pass
try:
profileWeight = soup.find('a', {"title": "Weight"}).get_text()
except:
pass
return {
'Name': profileName,
'ImageUrl': profileImage,
'Height': profileHeight,
'Weight': profileWeight,
}
def lotta_downloads():
output = open("/tmp/export.csv", 'w', newline='')
wr = csv.DictWriter(output, ['Name','ImageUrl','Height','Weight'], delimiter=',')
wr.writeheader()
for i in range(135):
url = "http://www.host.com/actors/all-actors/name/{}/".format(i)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("div", { "class" : "card-image" })
for a in links:
for url in a.find_all('a'):
url = "http://www.host.com" + url['href']
print(url)
response = requests.get(url)
html = response.content
actor_dict = get_actor_dict_from_html(url, html)
wr.writerow(actor_dict)
print('All Done!')
if __name__ == "__main__":
lotta_downloads()
Thanks!
Why don't you try to use gevent library?
gevent library has monkey patch making blocking function to non-blocking function.
Maybe wait time of requests is too much and so slow.
So I think that Making request as non-blocking function make your program fast.
On python 2.7.10
example:
import gevent
from gevent import monkey; monkey.patch_all() # Fix import code
import reqeusts
actor_dict_list = []
def worker(url):
content = requests.get(url).content
bs4.BeautifulSoup(content)
links = soup.find_all('div', {'class': 'card-image'})
for a in links:
for url in a.find_all('a'):
response = requests.get(url) # You can also use gevent spawn function on this line
...
actor_dict_list.append(get_actor_dict_from_html(url, html)) # Because of preventing race condition
output = open("/tmp/export.csv", "w", newline='')
wr = csv.DictWriter(output, ['Name', 'ImageUrl', 'Height', 'Weight'], delimiter=',')
wr.writeheader()
urls = ["http://www.host.com/actors/all-actors/name/{}/".format(i) for i in range(135)]
jobs = [gevent.spawn(worker, url) for url in urls]
gevent.joinall(jobs)
for i in actor_dict_list:
wr.writerow(actor_dict)
public gevent document: doc
P.S.
You must install python-gevent If you have ubuntu OS
sudo apt-get install python-gevent
I try to organize pool with maximum 10 concurrent downloads. The function should download base url, then parser all urls on this page and download each of them, but OVERALL number of concurrent downloads should not exceed 10.
from lxml import etree
import gevent
from gevent import monkey, pool
import requests
monkey.patch_all()
urls = [
'http://www.google.com',
'http://www.yandex.ru',
'http://www.python.org',
'http://stackoverflow.com',
# ... another 100 urls
]
LINKS_ON_PAGE=[]
POOL = pool.Pool(10)
def parse_urls(page):
html = etree.HTML(page)
if html:
links = [link for link in html.xpath("//a/#href") if 'http' in link]
# Download each url that appears in the main URL
for link in links:
data = requests.get(link)
LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))
def get_base_urls(url):
# Download the main URL
data = requests.get(url)
parse_urls(data.content)
How can I organize it to go concurrent way, but to keep the general global Pool limit for ALL web requests?
I think the following should get you what you want. I'm using BeautifulSoup in my example instead the link striping stuff you had.
from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()
jobs = []
links = []
p = pool.Pool(10)
urls = [
'http://www.google.com',
# ... another 100 urls
]
def get_links(url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.text)
links.extend(soup.find_all('a'))
for url in urls:
jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)
gevent.pool will limit the concurrent greenlets, not the connections.
You should use session with HTTPAdapter
connection_limit = 10
adapter = requests.adapters.HTTPAdapter(pool_connections=connection_limit,
pool_maxsize=connection_limit)
session = requests.session()
session.mount('http://', adapter)
session.get('some url')
# or do your work with gevent
from gevent.pool import Pool
# it should bigger than connection limit if the time of processing data
# is longer than downings,
# to give a change run processing.
pool_size = 15
pool = Pool(pool_size)
for url in urls:
pool.spawn(session.get, url)
You should use gevent.queue to do it in the right way.
Also this(eventlet examples) will be helpful for you to understand the basic idea.
Gevent solution is similar to the eventlet.
Keep in mind that will have somewhere to store visited URLs, so as not to get cycling, so you do not get out of memory error, you need to introduce some restrictions.