can't scrape ahref from airbnb site

can't scrape ahref from airbnb site - python

hi there i am trying to scrape the url link from the a href tag but i am getting this error
Input In [162] in <cell line: 1>
link = post.find('a', class_ = 'ln2bl2p dir dir-lt').get('href') AttributeError: 'NoneType' object has no attribute 'get'
here is my code below. Link 24 is returning an error
website link < https://www.airbnb.co.uk/s/Honolulu--HI--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&date_picker_type=calendar&place_id=ChIJTUbDjDsYAHwRbJen81_1KEs&checkin=2022-08-08&checkout=2022-08-14&source=structured_search_input_header&search_type=autocomplete_click&federated_search_session_id=82d7df97-e5c9-48d5-9dfe-ca1006489343&pagination_search=true>

Are sure that you can get content with this request? Add html code which you got from request to your answer, so we would understand it. Look like you need use another request to get this data.
If you will save html code which you got from request here (don't use screenshots of code anymore):
page = requests.get(url)
with open('test.html', 'w') as f:
f.write(page.text)
I am preety sure that you will discover that no needed information there.
And after that i'm trying to understand how i get this info when i'm using my own browser...
Going to website
Open dev tools (F12 in chrome)
open network
If you filter packages with Doc, you will see html pages that server sent to you. And there is empty page, so you are getting empty page with this request
To find information that you need, you must find it in packages which server sent to you. Usually it's JS or FETCH/XHR.

Related

How do you properly search using BS4?

I'm still learning python and thought a good project would be to make an Instagram Scraper. First I thought of trying to scrape Kylie Jenners's profile picture I thought I would use BS4 to search but then i ran into an issue.
import requests
from bs4 import BeautifulSoup as bs
instagramUser = input('Input Instagram Username: ')
url = 'https://instagram.com/' + instagramUser
r = requests.get(url)
soup = bs(r.text, 'html.parser')
profile_image = soup.find('img', class_ = "_6q-tv")['src']
print(profile_image)
On the line where i declare profile_image i get an error saying:
line 12, in
profile_image = soup.find('img', class_ = "_6q-tv")['src']
TypeError: 'NoneType' object is not subscriptable
I'm not sure why it doesn't work, my guess is I'm reading the html on Instagram wrong and searching incorrectly. I wanted to ask more experienced people than me on what I'm doing wrong, any help would be appreciated :)

You can disect the contents of line 12 into two commands:
image_tag = soup.find('img', class_ = "_6q-tv")
profile_image = image_tag['src']
The error
line 12, in profile_image = soup.find('img', class_ = "_6q-tv")['src'] TypeError: 'NoneType' object is not subscriptable
indicates that the result of the first command is None, which is Python's null value, which represents the absence of a value. This value does not implement the subscript operator ([]), thus, it's not subscriptable.
The reason probably is that soup.find didn't found any tag that matches your search criteria and returns None.
To debug this issue, I suggest you to write the source code into a file and inspect that file with a text editor of your choice (or directly in an interactive Python console). That way, you see what your Python program 'sees'. If you use the developer tools in the browser instead, you see the state of a Web page after having executed a bunch of JavaScript, but BeautifulSoup is oblivious of the JavaScript code. It just fetches the document as-is from the server.
As the answer of bushcat69 suggests, it's probably hard to scrape content from Instagram, so you may better be off with a simpler Website, which doesn't use as much JavaScript and protective measures against webscraping.

Instagram's content is loaded via javascript so scraping it like this won't work. It's also got many ways of stopping scraping so you will have a tough time scraping it without automating a browser with something like Selenium.
You can see what is happening when you navigate to a page by opening your browser's Developer Tools - Network - fetch/XHR and reloading the page, there you can see all the other content that is loaded, sometimes an easily accessible backend api is visible which loads the data you want and can be scraped (not the case with Instagram sadly, it is heavily protected)

BeautifulSoup-scrape html code from a website not working

let me briefly describe the problem. When I use urllib3 to scrape the html from a website, it isn't the same as the html code that I get when I manually enter the website with chrome and use 'inspect element'
Here is an example from my code. The problem is that the html code I got here is different from the html code I would get when I use inspect element on chrome
#myUrl is the url of the website I'm trying to scrape
http = urllib3.PoolManager()
response = http.request('GET', myUrl)
soup = BeautifulSoup(response.data.decode('utf-8'), features="html.parser")
m = str(soup)

that problem, probably is due to: the content on the page is being loaded with javascript. To get the whole data, you have to use some library that runs javascript. I recommend using Selenium.
To verify that case, you can disable the browser's javascript and trying to load the page.

Python AttributeError:'NoneType' object has no attribute getText

It´s my first time here!. I´m new with python and I´m getting error :"'NoneType' object has no attribute getText."
I´m working with the Requests and BeautifulSoup libraries. It´s about chess.com, a chess web, where all your data games can be downloaded. I'm learning about web scraping and data visualization, and the idea is to work with my info. The code is:
text = page.text
b = BeautifulSoup(text, 'html.parser')
content = b.find('span', attrs={'class': re.compile("archive-games-game-time")})
content.getText().strip()
"massarov" is my username in the page. I dont´know what´s wrong. Could anyone help me please?????.

if you are logging in it may be better to use session as it keeps your cookies:
session = requests.Session()
session.post(post_link, data=yourdata)
data = session.get(link)
this will keep you logged in when you change url (go to a different page on website). so whenever there is need to keep cookies use session

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.

The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.

"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.

You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.