I'm still learning python and thought a good project would be to make an Instagram Scraper. First I thought of trying to scrape Kylie Jenners's profile picture I thought I would use BS4 to search but then i ran into an issue.
import requests
from bs4 import BeautifulSoup as bs
instagramUser = input('Input Instagram Username: ')
url = 'https://instagram.com/' + instagramUser
r = requests.get(url)
soup = bs(r.text, 'html.parser')
profile_image = soup.find('img', class_ = "_6q-tv")['src']
print(profile_image)
On the line where i declare profile_image i get an error saying:
line 12, in
profile_image = soup.find('img', class_ = "_6q-tv")['src']
TypeError: 'NoneType' object is not subscriptable
I'm not sure why it doesn't work, my guess is I'm reading the html on Instagram wrong and searching incorrectly. I wanted to ask more experienced people than me on what I'm doing wrong, any help would be appreciated :)
You can disect the contents of line 12 into two commands:
image_tag = soup.find('img', class_ = "_6q-tv")
profile_image = image_tag['src']
The error
line 12, in profile_image = soup.find('img', class_ = "_6q-tv")['src'] TypeError: 'NoneType' object is not subscriptable
indicates that the result of the first command is None, which is Python's null value, which represents the absence of a value. This value does not implement the subscript operator ([]), thus, it's not subscriptable.
The reason probably is that soup.find didn't found any tag that matches your search criteria and returns None.
To debug this issue, I suggest you to write the source code into a file and inspect that file with a text editor of your choice (or directly in an interactive Python console). That way, you see what your Python program 'sees'. If you use the developer tools in the browser instead, you see the state of a Web page after having executed a bunch of JavaScript, but BeautifulSoup is oblivious of the JavaScript code. It just fetches the document as-is from the server.
As the answer of bushcat69 suggests, it's probably hard to scrape content from Instagram, so you may better be off with a simpler Website, which doesn't use as much JavaScript and protective measures against webscraping.
Instagram's content is loaded via javascript so scraping it like this won't work. It's also got many ways of stopping scraping so you will have a tough time scraping it without automating a browser with something like Selenium.
You can see what is happening when you navigate to a page by opening your browser's Developer Tools - Network - fetch/XHR and reloading the page, there you can see all the other content that is loaded, sometimes an easily accessible backend api is visible which loads the data you want and can be scraped (not the case with Instagram sadly, it is heavily protected)
Related
This question already has answers here:
Scrape a dynamic website
(8 answers)
Closed 2 days ago.
I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor
import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())
I get the following error.
AttributeError: 'NoneType' object has no attribute 'find'
It seems that not the whole page information is extracted by BeautifulSoup.
I expect the following output:
Orange-rose odor, Floral, waxy, green
The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):
That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.
To solve the problem:
use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:
PubChem_Nonanal_CID=31289
coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))
print (compund_info.json())
Parsing the JSON Reply
Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:
for section in compund_info.json()['Record']['Section']:
if section['TOCHeading']=="Chemical and Physical Properties":
for sub_section in section['Section']:
if sub_section['TOCHeading'] == 'Experimental Properties':
for sub_sub_section in sub_section['Section']:
if sub_sub_section['TOCHeading']=="Odor":
print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
break
Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com
# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`
odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']
hi there i am trying to scrape the url link from the a href tag but i am getting this error
Input In [162] in <cell line: 1>
link = post.find('a', class_ = 'ln2bl2p dir dir-lt').get('href') AttributeError: 'NoneType' object has no attribute 'get'
here is my code below. Link 24 is returning an error
website link < https://www.airbnb.co.uk/s/Honolulu--HI--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&date_picker_type=calendar&place_id=ChIJTUbDjDsYAHwRbJen81_1KEs&checkin=2022-08-08&checkout=2022-08-14&source=structured_search_input_header&search_type=autocomplete_click&federated_search_session_id=82d7df97-e5c9-48d5-9dfe-ca1006489343&pagination_search=true>
Are sure that you can get content with this request? Add html code which you got from request to your answer, so we would understand it. Look like you need use another request to get this data.
If you will save html code which you got from request here (don't use screenshots of code anymore):
page = requests.get(url)
with open('test.html', 'w') as f:
f.write(page.text)
I am preety sure that you will discover that no needed information there.
And after that i'm trying to understand how i get this info when i'm using my own browser...
Going to website
Open dev tools (F12 in chrome)
open network
If you filter packages with Doc, you will see html pages that server sent to you. And there is empty page, so you are getting empty page with this request
To find information that you need, you must find it in packages which server sent to you. Usually it's JS or FETCH/XHR.
im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has.
this is what i wrote:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)
and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span.
i really appreciate it if someone could figure it out and help me.
thanks in advance.
Problem
Running your code and checking the res value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).
Solution
Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"} to the get requests does work.
res = requests.get(url, headers={"User-Agent": "Defined"})
Will return a 200 (OK) response.
The Twist
Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)) will likely show you the following:
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
The counter
But you can use selenium to simulate a human. A minimal working example for me was the following:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
Which prints out
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
A small thing when using selenium is that it is much slower than the requests library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless driver.
It´s my first time here!. I´m new with python and I´m getting error :"'NoneType' object has no attribute getText."
I´m working with the Requests and BeautifulSoup libraries. It´s about chess.com, a chess web, where all your data games can be downloaded. I'm learning about web scraping and data visualization, and the idea is to work with my info. The code is:
text = page.text
b = BeautifulSoup(text, 'html.parser')
content = b.find('span', attrs={'class': re.compile("archive-games-game-time")})
content.getText().strip()
"massarov" is my username in the page. I dont´know what´s wrong. Could anyone help me please?????.
if you are logging in it may be better to use session as it keeps your cookies:
session = requests.Session()
session.post(post_link, data=yourdata)
data = session.get(link)
this will keep you logged in when you change url (go to a different page on website). so whenever there is need to keep cookies use session
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.