I am data mining a website using Beautiful Soup. The first page is Scoutmob's map, where I grab each city, open up the page, and grab the URL of each deal in that city.
Currently I'm not using threads and everything is being processed serially. For about all 500 deals (from all cities), my program currently takes about 400 seconds.
For practice, I wanted to modify my code to use threading. I have read up some tutorials and examples on how to create queues in Python, but I don't want to create 500 threads to process 500 URLs.
Instead I want to create about 20 (worker) threads to process all the URLs. Can someone show me an example how 20 threads can process 500 URL in a queue?
I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.
By the way, while each worker is data mining, it also writes the content to a database. So there might be issues with threading in the database, but that is another question for another day :-).
Thanks in advance!
For your example creating worker queues is probably overkill. You might have better luck if you grab the rss feed published for each of the pages rather than trying to parse the HTML which is slower. I slapped together the quick little script below that parses it in a total of ~13 seconds... ~8 seconds to grab the cities and ~5 seconds to parse all the rss feeds.
In today's run it grabs 310 total deals from 13 cities (there are a total of 20 cities listed, but 7 of them are listed as "coming soon").
#!/usr/bin/env python
from lxml import etree, html
from urlparse import urljoin
import time
t = time.time()
base = 'http://scoutmob.com/'
main = html.parse(base)
cities = [x.split('?')[0] for x in main.xpath("//a[starts-with(#class, 'cities-')]/#href")]
urls = [urljoin(base, x + '/today') for x in cities]
docs = [html.parse(url) for url in urls]
feeds = [doc.xpath("//link[#rel='alternate']/#href")[0] for doc in docs]
# filter out the "coming soon" feeds
feeds = [x for x in feeds if x != 'http://feeds.feedburner.com/scoutmob']
print time.time() - t
print len(cities), cities
print len(feeds), feeds
t = time.time()
items = [etree.parse(x).xpath("//item") for x in feeds]
print time.time() - t
count = sum(map(len, items))
print count
Yields this output:
7.79690480232
20 ['/atlanta', '/new-york', '/san-francisco', '/washington-dc', '/charlotte', '/miami', '/philadelphia', '/houston', '/minneapolis', '/phoenix', '/san-diego', '/nashville', '/austin', '/boston', '/chicago', '/dallas', '/denver', '/los-angeles', '/seattle', '/portland']
13 ['http://feeds.feedburner.com/scoutmob/atl', 'http://feeds.feedburner.com/scoutmob/nyc', 'http://feeds.feedburner.com/scoutmob/sf', 'http://scoutmob.com/washington-dc.rss', 'http://scoutmob.com/nashville.rss', 'http://scoutmob.com/austin.rss', 'http://scoutmob.com/boston.rss', 'http://scoutmob.com/chicago.rss', 'http://scoutmob.com/dallas.rss', 'http://scoutmob.com/denver.rss', 'http://scoutmob.com/los-angeles.rss', 'http://scoutmob.com/seattle.rss', 'http://scoutmob.com/portland.rss']
4.76977992058
310
Just implement it. You've pretty much talked yourself through the answer right there
I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.
Related
I have a nice URL structure and I want to iterate through the URL's and download all the images from the URL's. I am trying to use BeautifulSoup to get the job done, along with the requests function.
Here is the URL - https://sixmorevodka.com/#&gid=0&pid={i}, and I want 'i' to iterate from say 1 to 100 for this example.
from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os
#contextlib.contextmanager
def get_images(url:str):
d = soup(requests.get(url).text, 'html.parser')
yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]
n = 100 #end value
for i in range(n):
with get_images(f'https://sixmorevodka.com/#&gid=0&pid={i}') as links:
print(links)
for c, [link, ext] in enumerate(links, 1):
with open(f'ART/image{i}{c}.{ext}', 'wb') as f:
f.write(requests.get(f'https://sixmorevodka.com{link}').content)
I think I either messed something up in the Yield line or in the very last write line. Someone help me out please. I am using Python 3.7
In looking at the structure of that webpage, your gid parameter is invalid. To see for yourself, open a new tab and navigate to https://sixmorevodka.com/#&gid=0&pid=22.
You'll notice that none of the portfolio images are displayed. gid can be a value 1-5, denoting the grid to which an image belongs.
Regardless, your current scraping methodology is inefficient, and puts undue traffic on the website. Instead, you only need to make this request once, and extract the urls actually containing the images using the ilb portfolio__grid__item class selector.
Then, you can iterate and download those urls, which are directly the source of the images.
So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.
I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.
Here is my code:
from download1 import download
import threading,lxml.html
def getInfo(initial,ending):
for Number in range(initial,ending):
Fields = ['country', 'area', 'population', 'iso', 'capital', 'continent', 'tld', 'currency_code',
'currency_name', 'phone',
'postal_code_format', 'postal_code_regex', 'languages', 'neighbours']
url = 'http://example.webscraping.com/places/default/view/%d'%Number
html=download(url)
tree = lxml.html.fromstring(html)
results=[]
for field in Fields:
x=tree.cssselect('table > tr#places_%s__row >td.w2p_fw' % field)[0].text_content()
results.append(x)#should i start writing here?
downloadthreads=[]
for i in range(1,252,63): #create 4 threads
downloadThread=threading.Thread(target=getInfo,args=(i,i+62))
downloadthreads.append(downloadThread)
downloadThread.start()
for threadobj in downloadthreads:
threadobj.join() #end of each thread
print "Done"
So results will have the values of Fields ,I need to write the data with Fields as top row (only once) then the values in results into CSV file.
I am not sure i can open the file in the function because threads will open the file multiple times simultaneously.
Note: i know threading isn't desirable when crawling but i am just testing
I think you should consider using some kind of queuing or thread pools. Thread pools are really useful if you want create several threads (not 4, I think you would use more than 4 threads, but 4 threads at a time).
An example of Queue technique can be found here.
Of course, you can label the files with its threads id, for example: "results_1.txt", "results_2.txt" and so on. Then, you can merge them after all threads finished.
You can use the basic concepts of Lock, Monitor, and so on, however I am not the biggest fan of them. An example of locking can be found here
I need to shorten multiple urls using the Google Shortener API. Since each shortening process is not interdependent I decided to use multiprocessing library in Python.
raw_data is a data frame which contains my long url. Api_Key is a list which contains api key from multiple google accounts(as i dont want to hit the limit of api usage for a day)
Here is the underlying code.
import pandas as pd
import time
from multiprocessing import Pool
import math
import requests
raw_data["Shortened"] = "xx"
Api_Key_List1 = ['Key1',
'Key2',
'Key3']
raw_data = raw_data[0:3]
LongUrl = raw_data['SMS Click traker'].tolist()
args = [[LongUrl[index], Api_Key_List1[index]] for index,value in enumerate(LongUrl)]
print("xxxx")
def goo_shorten_url(url,key):
post_url = 'https://www.googleapis.com/urlshortener/v1/url?key=%s'%key
payload = {'longUrl': url}
headers = {'content-type': 'application/json'}
time.sleep(2)
r = requests.post(post_url, data=json.dumps(payload), headers=headers)
return(json.loads(r.text)["id"])
def helper(args):
return goo_shorten_url(*args)
if __name__ == '__main__':
p = Pool(processes = 3)
data = p.map(helper,args)
print("main")
raw_data["Shortened"]=data
p.close()
print(raw_data)
Using this code i am successfully shorten multiple urls in one go thus saving a lot of time. Here are some questions which i am having difficulty in finding answers to:
What is the optimal number of sub-processes that I shall be running? (I am using a simple quadcore processor 8 gb ram laptop, Windows 7)
Sometimes due to some issues the api does not return a shortened url. The code is stuck forever. This was my experience when i was not using multiprocess. What happens to the process in such case?
Assuming the subprocess will get stuck for a long set of urls (70 000). How do I kill it and also ensure the code safely returns me all remaining shortened urls with proper matching in the data frame?
How can I identify those urls which were not Shortened?
I am a newbie to python programming, bear with me. This will help me better understand the inner working of multiprocessing.