Parsing multiple XML using urrlib, lxml and multiprocessing

Parsing multiple XML using urrlib, lxml and multiprocessing - python

I'm triying to speed up a script to scrape an XML which is obtained by making a request to an API with urllib. I have to make ~2.3 million requests, so it tooks ~8 hours without multiprocessing.
Without applying multiprocessing:
from urllib import request as rq
from lxml import etree
def download_data(id):
data = []
xml = etree.iterparse(rq.urlretrieve(url + id + ".xml")[0], events=('start', 'end'))
for event, id_data in xml:
if event == "start":
try:
data.append(id_data.get('value'))
except:
pass
return data
with open("/path/to/file", "rt") as ids_file:
ids = ids_file.read().splitlines()
data_dict = {id: download_data(id) for id in ids}
I've tried the following code:
from urllib import request as rq
from lxml import etree
from multiprocessing import Pool, cpu_count
def download_data(id):
data = []
xml = etree.iterparse(rq.urlretrieve(url + id + ".xml")[0], events=('start', 'end'))
for event, id_data in xml:
if event == "start":
try:
data.append(id_data.get('value'))
except:
pass
return (id, data)
with open("/path/to/file", "rt") as ids_file:
ids = ids_file.read().splitlines()
with Pool(processes=cpu_count()*2) as pool:
dt = pool.map(download_data, ids)
data_dict = dict(dt)
I get the following error:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Any suggestions?
Thank you in advance!

Related

Error: "There is no current event loop in thread" when used concurrent.futures with HTMLSession

Can you help me find a bug in my code ?
I'm trying to speed up web scraping URLs gathered by using googlesearch.search.
*Note!
This seems similiar issue as it was described in this post:
Concurrent.futures + requests_html's render() = "There is no current event loop in thread 'ThreadPoolExecutor-0_0'."
But after attempting to implement it the way it was described there, I still can't get rid of my issue.
Here's my original code so far:
from requests_html import HTMLSession
import multiprocessing as mp
import concurrent.futures
from googlesearch import search
#get 20 urls for "funny cats"
def getURLs():
urls = list(search("funny cats", tld='com', num=20, stop=20, pause=2))
return urls
# divide list of 20 urls into list of 4 lists x 5 url
# each sub-list will be processed on one processor (I have 4 cores)
def fillContainer(some_iterable):
my_gen = iter(some_iterable)
cores = mp.cpu_count()
container = [ [] for n in range(cores) ]
while True:
for a in container:
try:
a.append(next(my_gen))
except StopIteration:
return container
def processURL(urls):
with HTMLSession() as session:
for u in urls:
try:
response = session.get(u)
response.raise_for_status()
response.html.render()
# plus some regex to process html, but that's not the point
except Exception as e:
print(f"ERROR !!! {e} , accessing URL: {u} , Movinh on ...")
def main():
URLs = getURLs()
container = fillContainer(URLs)
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(processURL, url) for url in container]
if __name__ == '__main__':
main()
I get the :
There is no current event loop in thread 'ThreadPoolExecutor'
Error for each URL I try to proces using my processURL() function. I also tried using: executor.map(processURL, URLs) but with no success.
Thank you for your help.
#EDIT #1
It seems that there's a problem with line: response.html.render() ,
however I don't know how to deal with it.

Multiprocessing Pool usage with requests

Good day
I am working on a directory scanner and trying to speed it up as much as possible. I have been looking into using multiprocessing, however I do not believe I am using it correctly.
from multiprocessing import Pool
import requests
import sys
def dir_scanner(wordlist=sys.argv[1],dest_address=sys.argv[2],file_ext=sys.argv[3]):
print(f"Scanning Target: {dest_address} looking for files ending in {file_ext}")
# read a wordlist
dir_file = open(f"{wordlist}").read()
dir_list = dir_file.splitlines()
# empty list for discovered dirs
discovered_dirs = []
# make requests for each potential dir location
for dir_item in dir_list:
req_url = f"http://{dest_address}/{dir_item}.{file_ext}"
req_dir = requests.get(req_url)
print(req_url)
if req_dir.status_code==404:
pass
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
with open("discovered_dirs.txt","w") as f:
for directtories in discovered_dirs:
print(req_url,file=f)
if __name__ == '__main__':
with Pool(processes=4) as pool:
dir_scanner(sys.argv[1],sys.argv[2],sys.argv[3])
Is the above example the correct usage of Pool? Ultimately I am attempting to speed up the requests that are being made to the target.
UPDATE: Perhaps not the most eleigant solution but:
from multiprocessing import Pool
import requests
import sys
# USAGE EXAMPLE: python3 dir_scanner.py <wordlist> <target address> <file extension>
discovered_dirs = []
# read in the wordlist
dir_file = open(f"{sys.argv[1]}").read()
dir_list = dir_file.splitlines()
def make_request(dir_list):
# create a GET request URL base on items in the wordlist
req_url = f"http://{sys.argv[2]}/{dir_list}.{sys.argv[3]}"
return req_url, requests.get(req_url)
# map the requests made by make_requests to speed things up
with Pool(processes=4) as pool:
for req_url, req_dir in pool.map(make_request, dir_list):
# if the request resp is a 404 move on
if req_dir.status_code == 404:
pass
# if not a 404 resp then add it to the list
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
# create a new file and append it with directories that were discovered
with open("discovered_dirs.txt","w") as f:
for directories in discovered_dirs:
print(req_url,file=f)

Right now, you are creating a pool and not using it.
You can use pool.map to distribute the request into multiple process:
...
def make_request(dir_item):
req_url = f"http://{dest_address}/{dir_item}.{file_ext}"
return req_url, requests.get(req_url)
with Pool(processes=4) as pool:
for req_url, req_dir in pool.map(make_request, dir_list):
print(req_url)
if req_dir.status_code == 404:
pass
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
...
In the example above the function make_request is executed in subprocesses.
Python documentation gives a lot of examples.

XML incremental parsing with python

I'm trying to parse a huge XML file with the code below and whenever I run the code through the terminal, it's just run without any errors and does nothing. I need it to parse the file incrementally and delete the parent element after checking if Submission time is older than a specific number of days.
For example, the XML structure is like this:
<Feed>
<Reviews>
<Review>
<SubmissionTime>2015-06-16T19:00:00.000-05:00</SubmissionTime>
</Review>
</Reviews
</Feed>
from lxml import etree, objectify
import logging, sys, iso8601
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import re
def remove_per_age(file):
datestring = datetime.now().strftime("%Y%m%d-%H%M%S")
full_data = ""
for event, elem in ET.iterparse(sys.argv[1], events=("end",)):
if elem.tag == 'SubmissionTime':
element_datetime = iso8601.parse_date(elem.text)
element_date = element_datetime.date()
if (element_date < datetime.now(element_datetime.tzinfo).date()-relativedelta(days=180)):
elem.getparent().remove(elem)
else:
full_data += ET.tostring(elem)
else:
elem.clear()
with open("output.xml", 'w') as f:
f.write(full_data)
def strip_tag_name(tag):
pattern = re.compile(r'\{.+\}')
clean_tag = pattern.sub(r'', tag)
return clean_tag
if __name__ == "__main__":
remove_per_age(sys.argv[1])
#Reviews/Review/SubmissionTime

The way to handle huge XML file incrementally is to use SAX.
You will need to extend xml.sax.ContentHandler and add your logic there.
See https://www.tutorialspoint.com/parsing-xml-with-sax-apis-in-python for an example

How can I improve the following code performance to ingest 1 million record /second

The following code are ingesting 10k-20k record per second and I want to improve the performance of it. I am reading a json format and ingesting it into database using Kafka.
-I am running it on the cluster of five nodes with zookeeper and Kafka installed on it.
Can you give me some tips to improve?
import os
import json
from multiprocessing import Pool
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer
def process_line(line):
producer = SimpleProducer(client)
try:
jrec = json.loads(line.strip())
producer.send_messages('twitter2613',json.dumps(jrec))
except ValueError, e:
{}
if __name__ == "__main__":
client = KafkaClient('10.62.84.35:9092')
myloop=True
pool = Pool(30)
direcToData = os.listdir("/FullData/RowData")
for loop in direcToData:
mydir2=os.listdir("/FullData/RowData/"+loop)
for i in mydir2:
if myloop:
with open("/FullData/RowData/"+loop+"/"+i) as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 30)

You can maybe import only the fonction that you need form OS. It can be a first optimization.

Requests with multiple connections

I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?

You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).

Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)

This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()

you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing multiple XML using urrlib, lxml and multiprocessing - python

Related

Error: "There is no current event loop in thread" when used concurrent.futures with HTMLSession

Multiprocessing Pool usage with requests

XML incremental parsing with python

How can I improve the following code performance to ingest 1 million record /second

Requests with multiple connections

Categories

Resources