This question already has answers here:
Scrape a dynamic website
(8 answers)
Closed 2 days ago.
I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor
import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())
I get the following error.
AttributeError: 'NoneType' object has no attribute 'find'
It seems that not the whole page information is extracted by BeautifulSoup.
I expect the following output:
Orange-rose odor, Floral, waxy, green
The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):
That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.
To solve the problem:
use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:
PubChem_Nonanal_CID=31289
coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))
print (compund_info.json())
Parsing the JSON Reply
Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:
for section in compund_info.json()['Record']['Section']:
if section['TOCHeading']=="Chemical and Physical Properties":
for sub_section in section['Section']:
if sub_section['TOCHeading'] == 'Experimental Properties':
for sub_sub_section in sub_section['Section']:
if sub_sub_section['TOCHeading']=="Odor":
print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
break
Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com
# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`
odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']
Related
I'm still learning python and thought a good project would be to make an Instagram Scraper. First I thought of trying to scrape Kylie Jenners's profile picture I thought I would use BS4 to search but then i ran into an issue.
import requests
from bs4 import BeautifulSoup as bs
instagramUser = input('Input Instagram Username: ')
url = 'https://instagram.com/' + instagramUser
r = requests.get(url)
soup = bs(r.text, 'html.parser')
profile_image = soup.find('img', class_ = "_6q-tv")['src']
print(profile_image)
On the line where i declare profile_image i get an error saying:
line 12, in
profile_image = soup.find('img', class_ = "_6q-tv")['src']
TypeError: 'NoneType' object is not subscriptable
I'm not sure why it doesn't work, my guess is I'm reading the html on Instagram wrong and searching incorrectly. I wanted to ask more experienced people than me on what I'm doing wrong, any help would be appreciated :)
You can disect the contents of line 12 into two commands:
image_tag = soup.find('img', class_ = "_6q-tv")
profile_image = image_tag['src']
The error
line 12, in profile_image = soup.find('img', class_ = "_6q-tv")['src'] TypeError: 'NoneType' object is not subscriptable
indicates that the result of the first command is None, which is Python's null value, which represents the absence of a value. This value does not implement the subscript operator ([]), thus, it's not subscriptable.
The reason probably is that soup.find didn't found any tag that matches your search criteria and returns None.
To debug this issue, I suggest you to write the source code into a file and inspect that file with a text editor of your choice (or directly in an interactive Python console). That way, you see what your Python program 'sees'. If you use the developer tools in the browser instead, you see the state of a Web page after having executed a bunch of JavaScript, but BeautifulSoup is oblivious of the JavaScript code. It just fetches the document as-is from the server.
As the answer of bushcat69 suggests, it's probably hard to scrape content from Instagram, so you may better be off with a simpler Website, which doesn't use as much JavaScript and protective measures against webscraping.
Instagram's content is loaded via javascript so scraping it like this won't work. It's also got many ways of stopping scraping so you will have a tough time scraping it without automating a browser with something like Selenium.
You can see what is happening when you navigate to a page by opening your browser's Developer Tools - Network - fetch/XHR and reloading the page, there you can see all the other content that is loaded, sometimes an easily accessible backend api is visible which loads the data you want and can be scraped (not the case with Instagram sadly, it is heavily protected)
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.
I'm a begginer in the world of python and web-scrapers, i am used to make scrapers with dynamic URLs, where the URI change when i input specific parameters in the URL itself.
Ex: Wikipedia.
(if i input a search named "Stack Overflow" i will have a URI that looks like this: https://en.wikipedia.org/wiki/Stack_Overflow)
At the moment i was challenged to develop a web-scraper to collect data from this page.
The field "Texto/Termos a serem pesquisados" corresponds a search field, but when i input the search the URL stays the same not letting me to get the right HTML code for my research.
I am used to work with BeautifulSoup and Requests to do the scraping thing, but in this case it has no use, since the URL stays the same after the search.
import requests
from bs4 import BeautifulSoup
url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')
print(bsObj)
# And from now on i cant go any further
Usually i would do something like
url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input
And then do all the BeautifulSoup thing, and findAll thing to get my data from the HTML code.
I have tried to use Selenium too, but im looking for something different than that due to all the webdriver thing. With the following piece of code i have achieved to some odd results but i still cant scrape the HTML in a good way.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
# Acess the page and input the search on the field
driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()
So given all that:
There is a way to scrape data from these static urls ?
Can i input a search in the field via code ?
Why cant i scrape the right source code ?
The problem is that your initial search page is using frames for the searching & results, which makes it harder for BeautifulSoup to work with it. I was able to obtain the search results by using a slightly different URL and MechanicalSoup instead:
>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form() # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text' # input the text to search for
>>> sb.submit_selected() # submit the form
<Response [200]>
>>> page = sb.get_current_page() # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>
Note that the URL I'm using here is that of the frame that has the search form and not the page you provided that was inlining it. This removes one layer of indirection.
MechanicalSoup is built on top of BeautifulSoup and provides some tools for interacting with websites in a similar way to the old mechanize library.
I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.
I am trying to get some data from this website:
http://www.espn.com.br/futebol/resultados/_/liga/BRA.1/data/20181018
When I inspect the page on my browser I can see all the values I need on the HTML. I want to fetch the game result and the players names (for each date, in this example 2018-10-18)
On no game days the website shows:
"Sem jogos nesta data", which is it easy to find on browser inspection:
But when using
url = 'http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018'
page = requests.get(url, "lxml")
The output is basically the website where I can't find the phrase "Sem jogos nesta data"
How can I get fetch the HTML containing the script results? Is it possible with request? urllib?
Looks like the data you are looking for that comes from their backend API. I would use selenium-python package instead of requests.
Here is example:
driver = webdriver.Firefox()
driver.get("http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018")
value = driver.find_elements(By.XPATH, '//*[#id="events"]/div')
drive.close()
I didn't check the code but it should be working