I'm trying to do the following: grab some information off a page, and then insert it into a mongodb. There are a list of pages and I'm wanting to multiprocessing as these pages can take time to load. Once the webdriver returns the result I want to insert into the db. The problem I'm facing is that I'm only getting 1/4 of the results I'm expecting in the db, so I imagine the way I'm managing the results and the inserting isn't working. I was hoping someone could show me where I've gone wrong. The following is an example of the code:
from multiprocessing.dummy import Pool
from multiprocessing import cpu_count
from selenium import webdriver
import timeit
from pymongo import MongoClient
def mp_worker(urls):
driver = webdriver.Chrome(chromedriver,
chrome_options=options)
url = "http://website"+urls
driver.get(url)
return what_you_want
driver.quit() #do I do this here, close or quit?
def mp_handler():
urls= ["14360705","4584061","13788961","6877217","13194596","13400479","9868014","8524704","16394198","16315464"]
client = MongoClient()
db = client.test
collection = db['test-collection']
p = Pool(cpu_count()*2)
for result in p.imap(mp_worker, urls):
db.restaurants.update(result,{"upsert":"True"})
if __name__=='__main__':
start = timeit.default_timer()
mp_handler()
stop = timeit.default_timer()
print (stop - start)
This syntax is incorrect:
db.restaurants.update(result,{"upsert":"True"})
You want, likely:
db.restaurants.insert(result)
Or:
db.restaurants.update(filter, result, upsert=True)
Where "filter" is a MongoDB query (expressed as a Python dict) that uniquely matches the document you want to update or create.
Related
I have written functin based on selenium and I want it to parse simultaneously multiple webpages. I have list of urls that I pass to the function that I want scrape at the same time so as to save time.
I created scraper.py file where i put scraper function:
def parser_od(url):
price=[]
url_of = url
driver.get(url_of)
try:
price.append(browser.find_element_by_xpath("//*[#id='root']/article/header/div[2]/div[1]/div[2]").text.replace(" ","").replace("zł","").replace(",","."))
except NoSuchElementException:
price.append("")
Now I want to use the function to parse multiple urls from my urls at the same time using multiprocessing library:
from scraper import *
url_list=['https://www.otodom.pl/oferta/2-duze-pokoje-we-wrzeszczu-do-zamieszania-ID42f6s',
'https://www.otodom.pl/oferta/mieszkanie-na-zamknietym-osiedlu-z-ogrodkiem-ID40ZxM',
'https://www.otodom.pl/oferta/zaciszna-nowe-mieszkanie-3-pokoje-0-ID41UaX',
'https://www.otodom.pl/oferta/dwupoziomowe-dewel-mieszkanie-101-m2-lebork-i-p-ID3JEcQ']
driver = webdriver.Chrome(executable_path=r"C:\Users\Admin\chromedriver.exe")
from multiprocessing import Pool
with Pool(4) as p:
price = p.map(parser_od, url_list)
But I get following error:
NameError: name 'driver' is not defined
Which is weird because chrome is opened up.
Edit:
I need to have the browser(s) open while running this scraper, so that the driver is opened before not everytime this function is invoked.
Just should probably just split up the list of urls you want to process ino 4 equal parts, and have a driver for each process that processes one of the equal parts in the Pool.
def parser_od(urls, thread_index):
driver = webdriver.Chrome(executable_path=r"C:\Users\Admin\chromedriver.exe")
prices = []
for i in range(len(urls)):
url = urls[i]
if i % 4 == thread_index:
price=[]
url_of = url
driver.get(url_of)
try:
price.append(browser.find_element_by_xpath("//*[#id='root']/article/header/div[2]/div[1]/div[2]").text.replace(" ","").replace("zł","").replace(",","."))
except NoSuchElementException:
price.append("")
prices.append(price)
return prices
from multiprocessing import Pool
with Pool(4) as p:
price = p.map(lambda x: parser_od(x, url_list), list(range(len(url_list))))
I need help with a feature I try to implement, unfortunately I'm not very comfortable with multithreading.
My script download 4 different files from internet, and calls a dedicated function for each one, then saving all.
The problem is that I'm doing it step by step, therefore I have to wait for each download to finish in order to proceed to the next one.
I see what I should do to solve this, but I don't succeed to code it.
Actual Behaviour:
url_list = [Url1, Url2, Url3, Url4]
files_list = []
files_list.append(downloadFile(Url1))
handleFile(files_list[-1], type=0)
...
files_list.append(downloadFile(Url4))
handleFile(files_list[-1], type=3)
saveAll(files_list)
Needed Behaviour:
url_list = [Url1, Url2, Url3, Url4]
files_list = []
for url in url_list:
callThread(files_list.append(downloadFile(url)), # function
handleFile(files_list[url.index], type=url.index) # trigger
#use a thread for downloading
#once file is downloaded, it triggers his associated function
#wait for all files to be treated
saveAll(files_list)
Thanks for your help !
Typical approach is to put the IO heavy part like fetching data over the internet and data processing into the same function:
import random
import threading
import time
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch_and_process_file(url):
thread_name = threading.currentThread().name
print(thread_name, "fetch", url)
data = requests.get(url).text
# "process" result
time.sleep(random.random() / 4) # simulate work
print(thread_name, "process data from", url)
result = len(data) ** 2
return result
threads = 2
urls = ["https://google.com", "https://python.org", "https://pypi.org"]
executor = ThreadPoolExecutor(max_workers=threads)
with executor:
results = executor.map(fetch_and_process_file, urls)
print()
print("results:", list(results))
outputs:
ThreadPoolExecutor-0_0 fetch https://google.com
ThreadPoolExecutor-0_1 fetch https://python.org
ThreadPoolExecutor-0_0 process data from https://google.com
ThreadPoolExecutor-0_0 fetch https://pypi.org
ThreadPoolExecutor-0_0 process data from https://pypi.org
ThreadPoolExecutor-0_1 process data from https://python.org
To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong)
Pseudocode:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver-line needs to be placed outside scrape(URL). However, everything outside scrape(URL) will not find its way to joblib's Parallel(n_jobs = 40). This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1: How to reuse driver instances during parallel processing in the above example?
Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?
Note: Flash and image loading is disabled in firefox_profile (code not shown)
1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.
2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)
import threading
from joblib import Parallel, delayed
from selenium import webdriver
def scrape(URL):
try:
driver = drivers[threading.current_thread().name]
except KeyError:
drivers[threading.current_thread().name] = webdriver.Firefox()
driver = drivers[threading.current_thread().name]
driver.get(URL)
results = do_something(driver)
if results:
safe_results("results.csv")
drivers = {}
Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
for driver in drivers.values():
driver.quit()
But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).
I am trying to extract the purchase_state from google play, following the steps bellow:
import base64
import requests
import smtplib
from collections import OrderedDict
import mysql.connector
from mysql.connector import errorcode
......
Query db ,returning thousand of lines with purchase_idfield from my table
Check for each row from db, and extract purchase_id, then query google play for all of them. for example if the results of my previous
step is 1000, 1000 times is querying google (refresh token + query).
Add a new field purchase status from the google play to a new dictionary apart from some other fields whcih are grabbed from mysql query.
The last step is doing a loop over my dic as a follow and prepare the desirable report
AFTER EDITED:
def build_dic_from_db(data,access_token):
dic = {}
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
status = check_purchase_status(access_token, product_id,purchase_id)
cnt = 1
if row['user'] not in dic:
dic[row['user']] = {'id':row['user_id'],'country': row['country_name'],'reg_ts': row['user_registration_timestamp'],'last_active_ts': row['user_last_active_action_timestamp'],
'total_credits': row['user_credits'],'total_call_sec_this_month': row['outgoing_call_seconds_this_month'],'user_status':row['user_status'],'mobile':row['user_mobile_phone_number_num'],'plus':row['user_assigned_msisdn_num'],
row['product_id']:{'tAttemp': cnt,'tCancel': status}}
else:
if row['product_id'] not in dic[row['user']]:
dic[row['user']][row['product_id']] = {'tAttemp': cnt,'tCancel':status}
else:
dic[row['user']][row['product_id']]['tCancel'] += status
dic[row['user']][row['product_id']]['tAttemp'] += cnt
return dic
The problem is that my code is working slowly ~ Total execution time: 448.7483880519867 and I am wondering if there is away to improve my script. Is there any suggestion?
I hope I'm right about this, but the bottleneck seems to be the connection to the playstore. Doing it sequentially will take a long time, whereas the server is able to take a million requests at a time. So here's a way to process your jobs with executors (you need the "concurrent" package installed)
In that example, you'll be able to send 100 requests at the same time.
from concurrent import futures
EXECUTORS = futures.ThreadPoolExecutor(max_workers=100)
jobs = dict()
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
job = EXECUTORS.submit(check_purchase_status,
access_token, product_id,purchase_id)
jobs[job] = row
for job in futures.as_completed(jobs.keys()):
# here collect your results and do something useful with them :)
status = job.result()
# make the connection with current row
row = jobs[job]
# now you have your status and the row
And BTW try to use temp variables or you're constantly accessing your dictionary with the same keys, which is not good for performance AND readability of your code.
I'm trying to create a Reddit scraper that takes the first 100 pages from the Reddit home page and stores them into MongoDB. I keep getting the error:
TypeError: 'Collection' object is not callable. If you meant to call the 'insert_one' method on a 'Collection' object it is failing because no such method exists.
Here is my code
import pymongo
import praw
import time
def main():
fpid = os.fork()
if fpid!=0:
# Running as daemon now. PID is fpid
sys.exit(0)
user_agent = ("Python Scraper by djames v0.1")
r = praw.Reddit(user_agent = user_agent) #Reddit API requires user agent
conn=pymongo.MongoClient()
db = conn.reddit
threads = db.threads
while 1==1: #Runs in an infinite loop, loop repeats every 30 seconds
frontpage_pull = r.get_front_page(limit=100) #get first 100 posts from reddit.com
for posts in frontpage_pull: #repeats for each of the 100 posts pulled
data = {}
data['title'] = posts.title
data['text'] = posts.selftext
threads.insert_one(data)
time.sleep(30)
if __name__ == "__main__":
main()
insert_one() was not added to pymongo until version 3.0. If you try calling it on a version before that, you will get the error you are seeing.
To check you version of pymongo, open up a python interpreter and enter:
import pymongo
pymongo.version
The legacy way of inserting documents with pymongo is just with Collection.insert(). So in your case you can change your insert line to:
threads.insert(data)
For more info, see pymongo 2.8 documentation