Python multiprocessing scraping, duplicate results

Python multiprocessing scraping, duplicate results - python

I'm building a scraper that needs to perform pretty fast, over a large amount of webpages. The results of the code below will be a csv file with a list of links (and other things).
Basically, I create a list of webpages that contain several links, and for each of this pages I collect these links.
Implementing multiprocessing leads to some weird results, that I wasn't able to explain.
If I run this code setting the value of the pool to 1 (hence, without multithreading) I get a final result in which I have 0.5% of duplicated links (which is fair enough).
As soon as I speed it up setting the value to 8, 12 or 24, I get around 25% of duplicate links in the final results.
I suspect my mistake is in the way I write the results to the csv file or in the way I use the imap() function (same happens with imap_unordered, map etc..), which leads the threads to somehow access the same elements on the iterable passed. Any suggestion?
#!/usr/bin/env python
# coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool
keyword = "keyword"
def openup():
global crawl_list
try:
### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
startpage = 1
## Get endpage
url0 = myurl0
r0 = requests.get(url0)
print "First request: "+str(r0.status_code)
tree = html.fromstring(r0.content)
endpage = tree.xpath("//*[#id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[#class='adroite']/a/text()")
print str(endpage[0]) + " pages found"
### Generate random sequence for crawling
crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
return crawl_list
except Exception as e:
### Catches openup error and return an empty crawl list, then breaks
print e
crawl_list = []
return crawl_list
def worker_crawl(x):
### Open page
url_base = myurlbase
r = requests.get(url_base)
print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
while True:
if r.status_code == 200:
tree = html.fromstring(r.content)
### Get data
titles = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
links = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/#href')
abstracts = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
footers = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
dates = []
pagenums = []
for f in footers:
pagenums.append(x)
match = re.search(r'\| .+$', f)
if match:
date = match.group()
dates.append(date)
pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
return pageindex
else:
pageindex = [[str(r.status_code),"","","","",str(x)]]
return pageindex
continue
def mp_handler():
### Write down:
with open(keyword+'_results.csv', 'wb') as outcsv:
wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
wr.writeheader()
results = p.imap(worker_crawl, crawl_list)
for result in results:
for x in result:
wr.writerow({
#"keyword": str(keyword),
"title": x[0],
"link": x[1],
"abstract": x[2],
"footer": x[3],
"date": x[4],
"pagenum": x[5],
})
if __name__=='__main__':
p = ThreadPool(4)
openup()
mp_handler()
p.terminate()
p.join()

Are you sure the page responds with the correct response in a fast sequence of requests? I have been in situations where the scraped site responded with different responses if the requests were fast vs. if the requests were spaced in time. Menaing, everything went perfectly while debugging but as soon as the requests were fast and in sequence, the website decided to give me a different response.
Beside this, I would ask if the fact you are writing in a non-thread-safe environment might have impact: To minimize interactions on the final CSV output and issues with the data, you might:
use wr.writerows with a chunk of rows to write
use a threading.lock like here: Multiple threads writing to the same CSV in Python

Related

How can I make use of dictionary in order to store multiple values in two different arrays

I want to store values of last traded price for the derivative quote of INFOSYS and RELIANCE in two different lists. After that, I want my program to subtract the two latest values from the respective list and provide the output as a difference between the values. The given code provides output for one derivative quote.
How can use the single code to provide me the desired output from multiple lists? Can I make use of a dictionary to solve the problem?
import requests
import json
import time
from bs4 import BeautifulSoup as bs
import datetime, threading
LTP_arr=[0]
url = 'https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=INFY&instrument=FUTSTK&expiry=27JUN2019&type=-&strike=-'
def ltpwap():
resp = requests.get(url)
soup = bs(resp.content, 'lxml')
data = json.loads(soup.select_one('#responseDiv').text.strip())
LTP=data['data'][0]['lastPrice']
n2=float(LTP.replace(',', ''))
LTP_arr.append(n2)
LTP1= LTP_arr[-1] - LTP_arr[-2]
print("Difference between the latest two values of INFY is ",LTP1)
threading.Timer(1, ltpwap).start()
ltpwap()
Which produces:
Difference between the latest two values of INFY is 4.
The expected outcome is:
INFY_list = (729, 730, 731, 732, 733)
RELIANCE_list = (1330, 1331, 1332, 1333, 1334)

Better approach than keep some lists is just make generators that produces desired value from URLs at some interval. Here is implementation with time.sleep() but with many URLs I would suggest to look at asyncio:
from bs4 import BeautifulSoup as bs
import json
import requests
from collections import defaultdict
from time import sleep
urls_to_watch = {
'INFOSYS': 'https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=INFY&instrument=FUTSTK&expiry=27JUN2019&type=-&strike=-',
'RELIANCE': 'https://nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=RELIANCE&instrument=FUTSTK&expiry=27JUN2019&type=-&strike=-'
}
def get_values(urls):
for name, url in urls.items():
resp = requests.get(url)
soup = bs(resp.content, 'lxml')
data = json.loads(soup.select_one('#responseDiv').text.strip())
LTP=data['data'][0]['lastPrice']
n2=float(LTP.replace(',', ''))
yield name, n2
def check_changes(urls):
last_values = defaultdict(int)
current_values = {}
while True:
current_values = dict(get_values(urls))
for name in urls.keys():
if current_values[name] - last_values[name]:
yield name, current_values[name], last_values[name]
last_values = current_values
sleep(1)
for name, current, last in check_changes(urls_to_watch):
# here you can just print the values, or store current value to list
# and periodically store it to DB, etc.
print(name, current, last, current - last)
Prints:
INFOSYS 750.7 0 750.7
RELIANCE 1284.4 0 1284.4
RELIANCE 1284.8 1284.4 0.3999999999998636
INFOSYS 749.8 750.7 -0.900000000000091
RELIANCE 1285.4 1284.8 0.6000000000001364
...and waits infinitely for any change to occur, then prints it.

Q: How to write a function output in a .CSV file with multi-threading / multiprocessing ? (Using a String array as input)

I am coding a little web scraper where I would like to implement multiprocessing / multi-threading.
I have written my function webScraper() which receives a String with a website URL as input, scrapes some domain data and writes that data to a CSV file, line by line (for each domain).
The input data with all the URLs is saved in a String array like this :
urls = ["google.com", "yahoo.com", "bing.com"]. (I consider changing to URL import from CSV file.)
How can I use multiprocessing and write the function output to a CSV file without having inconsistencies and index out of bounds errors? I found a nice looking script, which seems to be exactly what I need. Unfortunately, I just switched to Python from Java a few days ago and can't figure out what I need to change exactly.
So basically, I just want to change the script below, so that it calls my function webScraper(url) for each URL that is in my String array urls or Input CSV file. The script should then write the function output for each array item line by line into my CSV (if I understood the code correctly).
That's the code I am working on (Thanks to hbar for the nice code!)
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# multiproc_sums.py
"""A program that reads integer values from a CSV file and writes out their
sums to another CSV file, using multiple processes if desired.
"""
import csv
import multiprocessing
import optparse
import sys
NUM_PROCS = multiprocessing.cpu_count()
def make_cli_parser():
"""Make the command line interface parser."""
usage = "\n\n".join(["python %prog INPUT_CSV OUTPUT_CSV",
__doc__,
"""
ARGUMENTS:
INPUT_CSV: an input CSV file with rows of numbers
OUTPUT_CSV: an output file that will contain the sums\
"""])
cli_parser = optparse.OptionParser(usage)
cli_parser.add_option('-n', '--numprocs', type='int',
default=NUM_PROCS,
help="Number of processes to launch [DEFAULT: %default]")
return cli_parser
class CSVWorker(object):
def __init__(self, numprocs, infile, outfile):
self.numprocs = numprocs
self.infile = open(infile)
self.outfile = outfile
self.in_csvfile = csv.reader(self.infile)
self.inq = multiprocessing.Queue()
self.outq = multiprocessing.Queue()
self.pin = multiprocessing.Process(target=self.parse_input_csv, args=())
self.pout = multiprocessing.Process(target=self.write_output_csv, args=())
self.ps = [ multiprocessing.Process(target=self.sum_row, args=())
for i in range(self.numprocs)]
self.pin.start()
self.pout.start()
for p in self.ps:
p.start()
self.pin.join()
i = 0
for p in self.ps:
p.join()
print "Done", i
i += 1
self.pout.join()
self.infile.close()
def parse_input_csv(self):
"""Parses the input CSV and yields tuples with the index of the row
as the first element, and the integers of the row as the second
element.
The index is zero-index based.
The data is then sent over inqueue for the workers to do their
thing. At the end the input process sends a 'STOP' message for each
worker.
"""
for i, row in enumerate(self.in_csvfile):
row = [ int(entry) for entry in row ]
self.inq.put( (i, row) )
for i in range(self.numprocs):
self.inq.put("STOP")
def sum_row(self):
"""
Workers. Consume inq and produce answers on outq
"""
tot = 0
for i, row in iter(self.inq.get, "STOP"):
self.outq.put( (i, sum(row)) )
self.outq.put("STOP")
def write_output_csv(self):
"""
Open outgoing csv file then start reading outq for answers
Since I chose to make sure output was synchronized to the input there
is some extra goodies to do that.
Obviously your input has the original row number so this is not
required.
"""
cur = 0
stop = 0
buffer = {}
# For some reason csv.writer works badly across processes so open/close
# and use it all in the same process or else you'll have the last
# several rows missing
outfile = open(self.outfile, "w")
self.out_csvfile = csv.writer(outfile)
#Keep running until we see numprocs STOP messages
for works in range(self.numprocs):
for i, val in iter(self.outq.get, "STOP"):
# verify rows are in order, if not save in buffer
if i != cur:
buffer[i] = val
else:
#if yes are write it out and make sure no waiting rows exist
self.out_csvfile.writerow( [i, val] )
cur += 1
while cur in buffer:
self.out_csvfile.writerow([ cur, buffer[cur] ])
del buffer[cur]
cur += 1
outfile.close()
def main(argv):
cli_parser = make_cli_parser()
opts, args = cli_parser.parse_args(argv)
if len(args) != 2:
cli_parser.error("Please provide an input file and output file.")
c = CSVWorker(opts.numprocs, args[0], args[1])
if __name__ == '__main__':
main(sys.argv[1:])
The whole thing wouldn't really be a problem for me, if there was no writting to a CSV file involved in the multiprocessing. I already tried a different solution Python Map Pool (link) but without success. I think there were inconsistencies among the Pools which led to errors.
Thanks for your ideas!

The way I would handle this is by using multiprocessing to do the web scraping, and then using a single process to write out to a csv. I'm willing to bet that the scraping is the time consuming part, and the I/O is quick. Below is a snippet of code that uses Pool.map to multiprocess your function.
import multiprocessing as mp
import csv
pool = mp.Pool( processes=mp.cpu_count() )
# or however many processors you can support
scraped_data = pool.map( webScraper, urls )
with open('out.csv') as outfile:
wr = csv.writer(outfile)
wr.writerow(scraped_data)

How To Make A Web Crawler More Efficient?

Here is a code:
str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'
import urllib.request
from Stacks import Stack
import re
import functools
import operator as op
from nary_tree import *
url = 'http://www.activeingredients.com/'
s = set()
List = []
url_list = []
def f_go(List, s, url):
try:
if url in s:
return
s.add(url)
with urllib.request.urlopen(url) as response:
html = response.read()
#print(url)
h = html.decode("utf-8")
lst0 = prepare_expression(list(h))
ntr = buildNaryParseTree(lst0)
lst2 = nary_tree_tolist(ntr)
lst3= functools.reduce(op.add, lst2, [])
str2 = ''.join(lst3)
List.append(str2)
f1 = re.finditer(str_regex, h)
l1 = []
for tok in f1:
ind1 = tok.span()
l1.append(h[ind1[0]:ind1[1]])
for exp in l1:
length = len(l1)
if (exp[-1] == 'g' and exp[length - 2] == 'p' and exp[length - 3] == 'j') or \
(exp[-1] == 'p' and exp[length - 2] == 'n' and exp[length - 3] == 'g'):
pass
else:
f_go(List, s, exp, iter_cnt + 1, url_list)
except:
return
It basically, using, urlllib.request.urlopen, opens urls recursively in a loop; does tis in certain domain (in that case activeingredients.com); link extraction form a page is done by regexpression. Inside, having open page it parse it and add to a list as a string. So, what this is suppose to do is go through given domain, extract information (meaningful text in that case), add to a list. Try except block, just returns in the case of all the http errors (and all the rest errors too, but this is tested and working).
It works, for example, for this small page, but for bigger is extremely slow and eat memory.
Parsing, preparing page, more or less do the right job, I believe.
Question is, is there an efficient way to do this? How web searches crawl through network so fast?

First: I don't think Google's webcrawler is running on one laptop or one pc. So don't worry if you can't get results like big companies do.
Points to consider:
You could start with a big list of words you can download from many websites. That sorts out some useless combinations of url's. After that you could crawl just with letters to get useless-named-sites on your index as well.
You could start with a list of all registered domains on dns servers. I.E. something like this: http://www.registered-domains-list.com
Use multiple threads
Have much bandwidth
Consider buying Google's Data-Center
These points are just ideas to give you a basic idea of how you could improve your crawler.

Maintaining Dictionary Integrity While Running it Through Multithread Process

I sped up a process by using a multithread function, however I need to maintain a relationship between the output and input.
import requests
import pprint
import threading
ticker = ['aapl', 'googl', 'nvda']
url_array = []
for i in ticker:
url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/' + i + '?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
url_array.append(url)
def fetch_ev(url):
urlHandler = requests.get(url)
data = urlHandler.json()
ev_single = data['quoteSummary']['result'][0]['defaultKeyStatistics']['enterpriseValue']['raw']
ev_array.append(ev_single) # makes array of enterprise values
threads = [threading.Thread(target=fetch_ev, args=(url,)) for url in
url_array] # calls multi thread that pulls enterprise value
for thread in threads:
thread.start()
for thread in threads:
thread.join()
pprint.pprint(dict(zip(ticker, ev_array)))
Sample output of the code:
1) {'aapl': '30.34B', 'googl': '484.66B', 'nvda': '602.66B'}
2) {'aapl': '484.66B', 'googl': '30.34B', 'nvda': '602.66B'}
I need the value to be matched up with the correct ticker.
Edit: I know dictionaries do not preserve order. Sorry, perhaps I was a little (very) unclear in my question. I have an array of ticker symbols, that matches the order of my url inputs. After running fetch_ev, I want to combine these ticker symbols with the matching enterprise value or ev_single. The order that they are stored in does not matter, however the pairings (k v pairs) or which values are stored with which ticker is very important.
Edit2 (MCVE) I changed the code to a simpler version of what I had- that shows the problem better. Sorry it's a little more complicated than I would have wanted complicated.

To make it easy to maintain the correspondence between input and output, the ev_array can be preallocated so it's the same size as the ticker array, and the fetch_ev() thread function can be given an extra argument specifying the index of the location in that array to store the value fetched.
The maintain the integrity of the ev_array, a threading.RLock was added to prevent concurrent access to the shared resource which might otherwise be written to simultaneously by more than one thread. (Since its contents are now referenced directly through the index passed to fetch_ev(), this may not be strictly necessary.)
I don't know the proper ticker ↔ enterprise value concurrence to be able to verify the results that doing this produces:
{'aapl': 602658308096L, 'googl': 484659986432L, 'nvda': 30338199552L}
but at least they're now the same each time it's run.
import requests
import pprint
import threading
def fetch_ev(index, url): # index parameter added
response = requests.get(url)
response.raise_for_status()
data = response.json()
ev_single = data['quoteSummary']['result'][0][
'defaultKeyStatistics']['enterpriseValue']['raw']
with ev_array_lock:
ev_array[index] = ev_single # store enterprise value obtained
tickers = ['aapl', 'googl', 'nvda']
ev_array = [None] * len(tickers) # preallocate to hold results
ev_array_lock = threading.RLock() # to synchronize concurrent array access
urls = ['https://query2.finance.yahoo.com/v10/finance/quoteSummary/{}'
'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US'
'&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents'
'&corsDomain=finance.yahoo.com'.format(symbol)
for symbol in tickers]
threads = [threading.Thread(target=fetch_ev, args=(i, url))
for i, url in enumerate(urls)] # activities to obtain ev's
for thread in threads:
thread.start()
for thread in threads:
thread.join()
pprint.pprint(dict(zip(tickers, ev_array)))

RuntimeError: maximum recursion depth exceeded with Python 3.2 pickle.dump

I'm getting the above error with the code below. The error occurs at the last line. Please excuse the subject matter, I'm just practicing my python skills. =)
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint
from pickle import dump
moves = dict()
moves0 = set()
url = 'http://www.marriland.com/pokedex/1-bulbasaur'
print(url)
# Open url
with urlopen(url) as usock:
# Get url data source
data = usock.read().decode("latin-1")
# Soupify
soup = BeautifulSoup(data)
# Find move tables
for div_class1 in soup.find_all('div', {'class': 'listing-container listing-container-table'}):
div_class2 = div_class1.find_all('div', {'class': 'listing-header'})
if len(div_class2) > 1:
header = div_class2[0].find_all(text=True)[1]
# Take only moves from Level Up, TM / HM, and Tutor
if header in ['Level Up', 'TM / HM', 'Tutor']:
# Get rows
for row in div_class1.find_all('tbody')[0].find_all('tr'):
# Get cells
cells = row.find_all('td')
# Get move name
move = cells[1].find_all(text=True)[0]
# If move is new
if not move in moves:
# Get type
typ = cells[2].find_all(text=True)[0]
# Get category
cat = cells[3].find_all(text=True)[0]
# Get power if not Status or Support
power = '--'
if cat != 'Status or Support':
try:
# not STAB
power = int(cells[4].find_all(text=True)[1].strip(' \t\r\n'))
except ValueError:
try:
# STAB
power = int(cells[4].find_all(text=True)[-2])
except ValueError:
# Moves like Return, Frustration, etc.
power = cells[4].find_all(text=True)[-2]
# Get accuracy
acc = cells[5].find_all(text=True)[0]
# Get pp
pp = cells[6].find_all(text=True)[0]
# Add move to dict
moves[move] = {'type': typ,
'cat': cat,
'power': power,
'acc': acc,
'pp': pp}
# Add move to pokemon's move set
moves0.add(move)
pprint(moves)
dump(moves, open('pkmn_moves.dump', 'wb'))
I have reduced the code as much as possible in order to produce the error. The fault may be simple, but I can't just find it. In the meantime, I made a workaround by setting the recursion limit to 10000.

Just want to contribute an answer for anyone else who may have this issue. Specifically, I was having it with caching BeautifulSoup objects in a Django session from a remote API.
The short answer is the pickling BeautifulSoup nodes is not supported. I instead opted to store the original string data in my object and have an accessor method that parsed it on the fly, so that only the original string data is pickled.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing scraping, duplicate results - python

Related

How can I make use of dictionary in order to store multiple values in two different arrays

Q: How to write a function output in a .CSV file with multi-threading / multiprocessing ? (Using a String array as input)

How To Make A Web Crawler More Efficient?

Maintaining Dictionary Integrity While Running it Through Multithread Process

RuntimeError: maximum recursion depth exceeded with Python 3.2 pickle.dump

Categories

Resources