Is it possible to find this link text with requests? - python

At the URL https://www.airbnb.com/rooms/3093543, there is a map that loads near the bottom of the page containing a ‘neighborhood’ box that says Presidio. It’s stored in a tag as Presidio
I'm trying to get it with this:
profile = BeautifulSoup(requests.get("https://www.airbnb.com/rooms/3093543").content, "html.parser")
print profile.select('div[id="hover-card"]')[0].find('a').text
# div[id="hover-card"] is not found
I’m not sure if this is a dynamic variable that could only be retrieved with another module, or whether it is possible to get with requests.

You can get that data via another element.
Try this:
profile = BeautifulSoup(requests.get("https://www.airbnb.com/rooms/3093543").content, "html.parser")
print profile.select('meta[id="_bootstrap-neighborhood_card"]')[0]
And if needed request the map via:
https://www.airbnb.pt/locations/api/neighborhood_tiles.json?ids%5B%5D=ID
Where the ID in the above URL is given by the neighborhood_basic_info attribute in the first print.

Related

How to retrieve page content from a list of website links using a for loop in python and lxml?

I am scraping data from a website and I have retrieved a list of URLs from which I will be getting the final data I need. How do I retrieve the html from this list of addresses using a loop?
Using xpath in lxml I have a list of URLs. Now I need to retrieve page content for each of thse URLs and then get use xpath once again to get the final data from each of these pages. I am able to individually get data from each page if use
pagecontent=requests.get(linklist[1])
then I am able to get the content of 1 url but if I use a for loop
for i in range(0,8):
pagecontent[i]=requests.get(linklist[i])
I get an error list assignment index out of range. I have also tried using
pagecontent=[requests.get(linklist) for s in linklist]
the error I see is No connection adapters were found for '['http...(list of links)...]'
I am trying to get a list pagecontent where each item in the list has html of the respective URLs. What is the best way to achieve this?
In light of your comment, I believe this (or something like this) may be what you're looking for; I can't try it myself since I don't have your linklist, but you should be able to modify the code to fit your situation. It uses python f-strings to accomplish what you need.
linklist = ['www.example_1.com','www.example_2.com','www.example_3.com']
pages = {} #initialize an empty dictionary to house your name/link entries
for i in range(len(linklist)):
pages[f'pagecontent[{i+1}]'] = linklist[i] #the '+1' is needed because python counts from 0...
for name, link in pages.items() :
print (name, link)
Output:
pagecontent[1] www.example_1.com
pagecontent[2] www.example_2.com
pagecontent[3] www.example_3.com

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Getting the XPath from an HTML document

https://next.newsimpact.com/NewsWidget/Live
I am trying to code a python script that will grab a value from a HTML table in the link above. The link above is the site that I am trying to grab from, and this is the code I have written. I think that possibly my XPath is incorrect, because its been doing fine on other elements, but the path I'm using is not returning/printing anything.
from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)
#This will create a list of buyers:
value = tree.xpath('//*[#id="table9521"]/tr[1]/td[4]/text()')
print('Value: ', value)
What is strange is when I open the view source code page, I cant find the table I am trying to pull from.
Thank you for your help!
Required data absent in initial page source - it comes from XHR. You can get it as below:
import requests
response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()
first_previous = response['Items'][0]['Previous'] # Current output - "2.632"
second_previous = response['Items'][1]['Previous'] # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast'] # ""
second_forecast = response['Items'][1]['Forecast'] # "0.3"
You can parse response as simple Python dict and get all required data
Your problem is simple, request don't handle javascript at all. The values are JS generated !
If you really need to run this xpath, you need to use a module capable of understanding JS, like spynner.
You can test when you need JS or not by first using curl or by disabling JS in your browser. With firefox : about:config in navigation bar, then search javascript.enabled, then double click on it to switch between true or false
In chrome, open chrome dev tools, there's the option somewhere.
Check https://github.com/makinacorpus/spynner
Another (possible) problem, use tree = html.fromstring(page.text) not tree = html.fromstring(page.content)

Xpath in python not getting data

I'm trying to request data from wikipedia in python using xpath.
I'm getting an empty list. What am I doing wrong.
import requests
from lxml import html
pageContent=requests.get(
'https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo'
)
tree = html.fromstring(pageContent.content)
name = tree.xpath('//*[#id="mw-content-text"]/div/table[1]/tbody/tr[2]/td[2]/a[1]/text()')
print name
This is a very common mistake when trying to get the xpath from the browser and the table tags, as the browser is the one that normally adds the tbody tag inside of them, which doesn't actually exist inside the response body.
So just remove it and it should be like:
'//*[#id="mw-content-text"]/div/table[1]//tr[2]/td[2]/a[1]/text()'

soup.find("div", id = "tournamentTable"), None returned - python 2.7 - BS 4.5.1

I'm Trying to parse the following page: http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/
The part I'm interested in is getting the table along with the scores and odds.
The code I have so far:
url = "http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/"
req = requests.get(url, timeout = 9)
soup = BeautifulSoup(req.text)
print soup.find("div", id = "tournamentTable"), soup.find("#tournamentTable")
>>> <div id="tournamentTable"></div> None
Very simple but I'm thus weirdly stuck at finding the table in the tree. Although I found already prepared datasets I would like to know to why the printed strings are a tag and None.
Any ideas?
Thanks
First, this page uses JavaScript to fetch data, if you disable the JS in your browser, you will notice that the div tag exist but nothing in it, so, the first will print a single tag.
Second, # is CSS selector, you can not use it in find()
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes.
so, the second find will to find some tag with #tournamentTable as it's attribute, and nothing will be match, so it will return None
It looks like the table gets populated with an Ajax call back to the server. That is why why you print soup.find("div", id = "tournamentTable") you get only the empty tag. When you print soup.find("#tournamentTable"), you get None because that is trying to find a element with the tag "#tournamentTable". If you want to use CSS selectors, you should use soup.select() like this, soup.select('#tournamentTable') or soup.select('div#tournamentTable') if you want to be even more particular.

Categories