Scraping large amount of data with beautifulsoup, process being killed?

Scraping large amount of data with beautifulsoup, process being killed? - python

There are exactly 100 items per page. I'm assuming it is some type of memory limit that's causing it to be killed. Also I have a feeling appending the items to a list variable is most likely not best practice when it comes to memory efficiency. Would opening a text file and writing to it be better? I've done a test with 10 pages and it creates the list successfully taking about 12 seconds to do so. When I try with 9500 pages however, the process gets automatically killed in about an hour.
import requests
from bs4 import BeautifulSoup
import timeit
def lol_scrape():
start = timeit.default_timer()
summoners_listed = []
for i in range(9500):
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for i in range(len(summoners)):
name = summoners[i].find('strong')
summoners_listed.append(name.string)
stop = timeit.default_timer()
print('Time: ', stop - start)
return summoners_listed

Credit to #1extralime
All I did was make a csv for every page instead of continually appending to one super long list.
from bs4 import BeautifulSoup
import timeit
import pandas as pd
def lol_scrape():
start = timeit.default_timer()
for i in range(6500):
# Moved variable inside loop to reset it every iteration
summoners_listed = []
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for x in range(len(summoners)):
name = summoners[x].find('strong')
summoners_listed.append(name.string)
# Make a new df with the list values then save to a new csv
df = pd.DataFrame(summoners_listed)
df.to_csv('all_summoners/summoners_page'+str(i+1))
stop = timeit.default_timer()
print('Time: ', stop - start)
Also as a note to my future self or anyone else reading. This method is way superior because had the process failed at anytime I had all the successful csv's saved and could just restart where it left off.

Related

How to extract elements belong to the same tag name in web scraping?

the web page being scraped
the wrong output i get
So basically I was trying to scrape over those rows of streamers on each page with the tag name "tr". And in each row, there's multiple columns that I want to include into my output. I was able to include almost all of those columns, but there's two that have the same tag name frustrated me a lot. (The two columns about followers) I tried the method of indexing them to get only odd or even, but the result is included in the second picture and it did not work out well. The numbers just keeps repeating itself and does not go down the way as it should. So is there some way to get the column of "followers gained" correctly into the output?
It's my first time asking here, so i am not sure if it is enough. I am glad to update more info later if needed.
for i in range(30): # Number of pages plus one
url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
channels = soup.find_all('tr')
for idx, channel in enumerate(channels):
if idx % 2 == 1:
idx += 1
Name = ", ".join([p.get_text(strip=True) for p in channel.find_all('a', attrs={'style': 'color:inherit'})])
Avg = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewers')])
Time = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-streamed')])
All = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-viewersMax')])
HW = ", ".join([p.get_text(strip=True) for p in channel.find_all('td', class_ = 'color-watched')])
FG = ", ".join([soup.find_all('td', class_ = 'color-followers hidden-sm')[idx].get_text(strip=True)])

Maybe an alternativ approach ?##
It uses pandas to read the tables, you just have to clean the ads out.
I also used time.sleep() delaying the loops and to be gentle to the server.
Example
import requests, time
import pandas as pd
df_list = []
for i in range(30): # Number of pages plus one
url = "https://twitchtracker.com/channels/viewership?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
df_list.append(pd.read_html(page.text)[0])
time.sleep(1.5)
df = pd.concat(df_list).reset_index(drop=True)
df.rename(columns={'Unnamed: 2':'Name'}, inplace=True)
df.drop(df.columns[[0,1]],axis=1,inplace=True)
df[~df.Rank.str.contains(".ads { display:")].to_csv('table.csv', mode='w', index=False)

Is there a way to optimize the for loop? Selenium is taking too long to scrape 38 pages

I'm trying to scrape https://arxiv.org/search/?query=healthcare&searchtype=allI through the Selenium and python. The for loop takes too long to execute. I tried to scrape with headless browsers and PhantomJS, but it doesnt scrape the abstract field (Need the abstract field expanded with the more button clicked)
import pandas as pd
import selenium
import re
import time
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Firefox
browser = Firefox()
url_healthcare = 'https://arxiv.org/search/?query=healthcare&searchtype=all'
browser.get(url_healthcare)
dfs = []
for i in range(1, 39):
articles = browser.find_elements_by_tag_name('li[class="arxiv-result"]')
for article in articles:
title = article.find_element_by_tag_name('p[class="title is-5 mathjax"]').text
arxiv_id = article.find_element_by_tag_name('a').text.replace('arXiv:','')
arxiv_link = article.find_elements_by_tag_name('a')[0].get_attribute('href')
pdf_link = article.find_elements_by_tag_name('a')[1].get_attribute('href')
authors = article.find_element_by_tag_name('p[class="authors"]').text.replace('Authors:','')
try:
link1 = browser.find_element_by_link_text('▽ More')
link1.click()
except:
time.sleep(0.1)
abstract = article.find_element_by_tag_name('p[class="abstract mathjax"]').text
date = article.find_element_by_tag_name('p[class="is-size-7"]').text
date = re.split(r"Submitted|;",date)[1]
tag = article.find_element_by_tag_name('div[class="tags is-inline-block"]').text.replace('\n', ',')
try:
doi = article.find_element_by_tag_name('div[class="tags has-addons"]').text
doi = re.split(r'\s', doi)[1]
except NoSuchElementException:
doi = 'None'
all_combined = [title, arxiv_id, arxiv_link, pdf_link, authors, abstract, date, tag, doi]
dfs.append(all_combined)
print('Finished Extracting Page:', i)
try:
link2 = browser.find_element_by_class_name('pagination-next')
link2.click()
except:
browser.close
time.sleep(0.1)

The following implementation achieves this in 16 seconds.
To speed up the execution process I have taken the following measures:
Removed Selenium entirely (No clicking required)
For abstract, used BeautifulSoup's output and processed it later
Added multiprocessing to speed up the process significantly
from multiprocessing import Process, Manager
import requests
from bs4 import BeautifulSoup
import re
import time
start_time = time.time()
def get_no_of_pages(showing_text):
no_of_results = int((re.findall(r"(\d+,*\d+) results for all",showing_text)[0].replace(',','')))
pages = no_of_results//200 + 1
print("total pages:",pages)
return pages
def clean(text):
return text.replace("\n", '').replace(" ",'')
def get_data_from_page(url,page_number,data):
print("getting page",page_number)
response = requests.get(url+"start="+str(page_number*200))
soup = BeautifulSoup(response.content, "lxml")
arxiv_results = soup.find_all("li",{"class","arxiv-result"})
for arxiv_result in arxiv_results:
paper = {}
paper["titles"]= clean(arxiv_result.find("p",{"class","title is-5 mathjax"}).text)
links = arxiv_result.find_all("a")
paper["arxiv_ids"]= links[0].text.replace('arXiv:','')
paper["arxiv_links"]= links[0].get('href')
paper["pdf_link"]= links[1].get('href')
paper["authors"]= clean(arxiv_result.find("p",{"class","authors"}).text.replace('Authors:',''))
split_abstract = arxiv_result.find("p",{"class":"abstract mathjax"}).text.split("▽ More\n\n\n",1)
if len(split_abstract) == 2:
paper["abstract"] = clean(split_abstract[1].replace("△ Less",''))
else:
paper["abstract"] = clean(split_abstract[0].replace("△ Less",''))
paper["date"] = re.split(r"Submitted|;",arxiv_results[0].find("p",{"class":"is-size-7"}).text)[1]
paper["tag"] = clean(arxiv_results[0].find("div",{"class":"tags is-inline-block"}).text)
doi = arxiv_results[0].find("div",{"class":"tags has-addons"})
if doi is None:
paper["doi"] = "None"
else:
paper["doi"] = re.split(r'\s', doi.text)[1]
data.append(paper)
print(f"page {page_number} done")
if __name__ == "__main__":
url = 'https://arxiv.org/search/?searchtype=all&query=healthcare&abstracts=show&size=200&order=-announced_date_first&'
response = requests.get(url+"start=0")
soup = BeautifulSoup(response.content, "lxml")
with Manager() as manager:
data = manager.list()
processes = []
get_data_from_page(url,0,data)
showing_text = soup.find("h1",{"class":"title is-clearfix"}).text
for i in range(1,get_no_of_pages(showing_text)):
p = Process(target=get_data_from_page, args=(url,i,data))
p.start()
processes.append(p)
for p in processes:
p.join()
print("Number of entires scraped:",len(data))
stop_time = time.time()
print("Time taken:", stop_time-start_time,"seconds")
Output:
>>> python test.py
getting page 0
page 0 done
total pages: 10
getting page 1
getting page 4
getting page 2
getting page 6
getting page 5
getting page 3
getting page 7
getting page 9
getting page 8
page 9 done
page 4 done
page 1 done
page 6 done
page 2 done
page 7 done
page 3 done
page 5 done
page 8 done
Number of entires scraped: 1890
Time taken: 15.911492586135864 seconds
Note:
Please write the above code in a .py file. For Jupyter notebook refer this.
Multiprocessing code taken from here.
The ordering of entries in the data list won't match the ordering on the website as Manager will append dictionaries into it as they come.
The above code finds the number of pages on its own and is thus generalized to work on any arxiv search result. Unfortunately, to do this it first gets page 0 and then calculates the number of pages and then goes for multiprocessing for the remaining pages. This has the disadvantage that while the 0th page was being worked on, no other process was running. So if you remove that part and simply run the loop for 10 pages then the time taken should fall at around 8 seconds.

You can try with request and beautiful soup approach. No need to click more link.
from requests import get
from bs4 import BeautifulSoup
# you can change the size to retrieve all the results at one shot.
url = 'https://arxiv.org/search/?query=healthcare&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0'
response = get(url,verify = False)
soup = BeautifulSoup(response.content, "lxml")
#print(soup)
queryresults = soup.find_all("li", attrs={"class": "arxiv-result"})
for result in queryresults:
title = result.find("p",attrs={"class": "title is-5 mathjax"})
print(title.text)
#If you need full abstract content - try this (you do not need to click on more button
for result in queryresults:
abstractFullContent = result.find("span",attrs={"class": "abstract-full has-text-grey-dark mathjax"})
print(abstractFullContent.text)
Output:
Interpretable Deep Learning for Automatic Diagnosis of 12-lead Electrocardiogram
Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being
Towards new forms of particle sensing and manipulation and 3D imaging on a smartphone for healthcare applications

Optimize scraping through list of urls and write to csv

With a csv of 20k+ urls I want to scrape and find the html element "super-attribute-select". If found, write the url to column A, along with the product number(sku) to column B. If not found, write url to column C and sku to column D. Finally, save the dataframe to a csv file.
If i run the following code it works, but my program runs out of memory. It liked to find a way to optimize this. Now ~1500 urls take 5 hrs to process. While the entire csv is 20k.
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas import Series
urlList = pd.read_csv(r"url.csv")
urlList = urlList.url.tolist()
notfound = []
found = []
skulist =[]
skumissinglist =[]
# Function scrape, pass url, open with soup, and find class
def scrape(url):
tag ='select'
classused = "super-attribute-select"
d = dict(A=np.array(found), B=np.array(skulist), C=np.array(notfound), D=np.array(skumissinglist))
try:
content = urllib.request.urlopen(url)
soup = BeautifulSoup(content, features="html.parser")
sku= soup.find("div", {"itemprop": "sku"}).string
result = soup.find(tag, class_=classused)
#soup returns None if can't find anything
if result == None:
notfound.append(url)
skumissinglist.append(sku)
else:
found.append(url)
skulist.append(sku)
except:
result = print("Some extraction went wrong")
df = pd.DataFrame(dict([(k, Series(v)) for k, v in d.items()]))
df = df.to_csv('Test.csv')
for i in urlList:
scrape(i)

If I were doing this, I would try a few things:
(1) Update a dictionary instead of appending to a list. I think dictionaries are faster and more memory-efficient than lists.
(2) Rather than export each URL result as a CSV with the same name, either (a) preferred: wait until you are done to export all results as a single CSV, or (b) worse: maybe export them to different filenames by using f-strings instead of overwriting 'Test.csv' every time.

You could use a pool either with gevent or the built in one from urllib3 (or requests). Then you could do 10, or 100 a time depending on poolsize, and use an async queue to get remaining ones as the pools get exhausted.
from gevent import monkey, spawn, joinall
monkey.patch_all()
from gevent.pool import Pool as GeventPool
import pandas as pd
from pandas import Series
import numpy as np
import requests
from bs4 import BeautifulSoup
urlList = pd.read_csv(r"url.csv")
urlList = urlList.url.tolist()
pool = GeventPool(10)
notfound = []
found = []
skulist =[]
skumissinglist =[]
count = len(urllist)
# Function scrape, pass url, open with soup, and find class
def scrape(url):
tag ='select'
classused = "super-attribute-select"
d = dict(A=np.array(found), B=np.array(skulist), C=np.array(notfound), D=np.array(skumissinglist))
try:
content = requests.get(url).text
soup = BeautifulSoup(content, features="html.parser")
sku= soup.find("div", {"itemprop": "sku"}).string
result = soup.find(tag, class_=classused)
#soup returns None if can't find anything
if result == None:
notfound.append(url)
skumissinglist.append(sku)
else:
found.append(url)
skulist.append(sku)
except:
print("Some extraction went wrong")
df = pd.DataFrame(dict([(k, Series(v)) for k, v in d.items()]))
return df.to_csv('Test.csv')
pool.map(scrape, urllist)

Saving output URLs with next line in file

I am trying to extract data from website and have following code which is extracting all URLs from Main category and its sub category links.
I am now stuck in saving the extracted output with line separator (to move each URL in separate line) in a file -Medical.tsv
Need help on this.
Code is given below:
from bs4 import BeautifulSoup
import requests
import time
import random
def write_to_file(file,mode, data, newline=None, with_tab=None): #**
with open(file, mode, encoding='utf-8') as l:
if with_tab == True:
data = ''.join(data)
if newline == True:
data = data+'\n'
l.write(data)
def get_soup(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
category_links = {}
for cat in (raw_categories):
t0 = time.time()
response_delay = time.time() - t0 # It wait 10x longer than it took them to respond using delay.
time.sleep(10*response_delay) # This way if the site gets overwhelmed and starts to slow down, the code will automatically back off.
time.sleep(random.randint(2,5)) # This will provide random time intervals of 2 and 3 secs acting as human crawl instead of bot.
soup = get_soup(cat['href'])
links = soup.select('#category-group li a')
category_links[cat.text] = [link['href'] for link in links]
print(category_links)

You got the write_to_file function but you never call it? mode have to be w or w+(if you wanna overwrite in the case if the file already exists)

Python/BS4 - Use list as input in function

I am working on a web scraper at the moment, right now I have it so it grabs a list of url's. I need it to use each of the url's in the list it makes one at a time into the soup function, to get my desired html output from each individual page.
Example:
my_list = ['www.google1213.com', 'www.yahoo123.com', 'www.apples123.com']
def main():
url = input('URL: ') #List goes here
currentDT = datetime.datetime.now()
scraper = cfscrape.create_scraper()
response = scraper.get(url).content
soup = BeautifulSoup(response,"lxml")
#etc...#
while True:
main()
If anyone can help me get my list to send its contents so I scrape each url one at a time, I would be very greatful!

def main():
for url in my_list:
currentDT = datetime.now()
scraper = cfscrape.create_scraper()
response = scraper.get(url).content
soup = BeautifulSoup(response,"lxml")

You can use a simple for loop:
for url in my_list:
print(url)
# do your scrapping stuff...
Ps: maybe you should also limit your requests per second. Otherwise some websites will block you after a few tries.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping large amount of data with beautifulsoup, process being killed? - python

Related

How to extract elements belong to the same tag name in web scraping?

Is there a way to optimize the for loop? Selenium is taking too long to scrape 38 pages

Optimize scraping through list of urls and write to csv

Saving output URLs with next line in file

Python/BS4 - Use list as input in function

Categories

Resources