I'm trying to script some blogs with python and selenium.
However, the source page is limited to a few articles, thus I need to scroll down to load the AJAX..
Is there a way to get the full source in one call with selenium?
The code would be something like:
# url and page source generating
url = url_constructor_medium_news(blog_name)
content = social_data_tools.selenium_page_source_generator(driver, url)
try:
# construct soup
soup = BeautifulSoup(content, "html.parser").rss.channel
# break condition
divs = soup.find_all('item')
except AttributeError as e:
print(e.__cause__)
# friendly
time.sleep(3 + random.randint(1, 5))
I don't believe there is a way to populate the driver with unloaded data that would otherwise be obtained by scrolling.
An alternative solution to getting the data would be driver.execute_script("windows.scrollTo(0, document.body.scrollHeight);")
I've previously used this as a reference.
I hope this helps!
Related
I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)
I am trying to get a list of all the movie/series on my personal IMDb watchlist. I am using selenium to click the load more button so everything shows up in the html code. However, when I try and scrape that data, only the first 100 movies show up.
Nothing past 'page3' shows up.
The image below shows the part of the html that connotes page 3:
After clicking the load button with selenium, all the movies are shown in the chrome pop up. However, only the first 100/138 are printed to my console.
Here is the URL: https://www.imdb.com/user/ur130279232/watchlist
This is my current code:
URL = "https://www.imdb.com/user/ur130279232/watchlist"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,20)
driver.get(URL)
while True:
try:
watchlist = driver.find_element_by_xpath("//div[#class='lister-list mode-detail']")
watchlistHTML = watchlist.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='load-more']")
soup = BeautifulSoup(watchlistHTML, 'html.parser')
content = soup.find_all('h3', class_ ='lister-item-header')
#pdb.set_trace()
print('length: ',len(content))
for elem in content:
print(elem.find('a').contents[0])
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
Even though after clicking load more button, "lister-list mode-detail" includes everything up until Sound of Music?
Rest of the data is returned by doing HTTP GET to (scroll down and hit Load more)
https://www.imdb.com/title/data?ids=tt0144117,tt0116996,tt0106179,tt0118589,tt11252090,tt13132030,tt6083778,tt0106611,tt0115685,tt1959563,tt8385148,tt0118971,tt0340855,tt8629748,tt13932270,tt11185940,tt5580390,tt4975722,tt2024544,tt1024648,tt1504320,tt1010048,tt0169547,tt0138097,tt0112573,tt0109830,tt0108052,tt0097239,tt0079417,tt0071562,tt0068646,tt0070735,tt0067116,tt0059742,tt0107207,tt0097937&tracking_tag=&pageId=ls089853956&pageType=list&subpageType=watchlist
What #balderman mentioned works if you can access the HTTP GET.
The main thing is that there's a delayed loading of the titles and it doesn't load the later ones until the earlier ones are loaded. I don't know if they only load if you're in the right region, but a janky way to get around it is to programmatically scroll through the page and let it load.
I'm trying to write a .txt file for each article in the 'Capitalism' section on this page. But it stops after the 7th article, because the link to the 8th won't load. How do I skip it then?
res = session.get('https://www.theschooloflife.com/thebookoflife/category/work/?index')
soup = BeautifulSoup(res.text, 'lxml')
sections = soup.select('section ')
my_section = sections[7]
cat = my_section.select('.category_title')[0].text
titles = [title.text for title in my_section.select('.title')]
links = [link['href'] for link in my_section.select('ul.indexlist a')]
path = '{}'.format(cat)
os.mkdir(path)
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
# ...and then I make a numbered .txt file containing the text found in each link. Image below.
You haven't provided the most important parts: the exception that you are getting and the code which is responsible for url retrieval. Without that the only recommendation is to wrap your for body in exception, and continue your loop if any error, relevant to url retrieval occurs. Assuming that you are using requests library (as seen in session.get), you should end up with something like this:
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
try:
# ...and then I make a numbered .txt file containing the text found in each link. Image below.
except requests.exceptions.RequestException:
continue
requests.exceptions.RequestException is a general exception for requests module, you can find a more suiting one for your case here: https://requests.readthedocs.io/en/latest/user/quickstart/#errors-and-exceptions
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.
I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.