Xpath in python not getting data - python

I'm trying to request data from wikipedia in python using xpath.
I'm getting an empty list. What am I doing wrong.
import requests
from lxml import html
pageContent=requests.get(
'https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo'
)
tree = html.fromstring(pageContent.content)
name = tree.xpath('//*[#id="mw-content-text"]/div/table[1]/tbody/tr[2]/td[2]/a[1]/text()')
print name

This is a very common mistake when trying to get the xpath from the browser and the table tags, as the browser is the one that normally adds the tbody tag inside of them, which doesn't actually exist inside the response body.
So just remove it and it should be like:
'//*[#id="mw-content-text"]/div/table[1]//tr[2]/td[2]/a[1]/text()'

Related

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

why do python and BS4 return only one 'href' when called specifically, but all values when called as text?

Scraping a page and trying to get all the urls from the first column. When I call as text I get everything in the div, which I get. But, when I specifically target the URL, I only get the first one. How do I get all of them - separated for storage?
from bs4 import BeautifulSoup
from urllib import urlopen
base_url = "http://www.heavyliftpfi.com/news/"
html = urlopen(base_url)
soup = BeautifulSoup(html.read().decode('latin-1', 'ignore'),"lxml")
main_div = soup.select_one("div.fullWidth")
div_sub = main_div.select_one("div.leftcol")
print (div_sub).text # I get that this gets everything as .text
print (div_sub).h2.a['href'] # alternate - with only one 'href' return
since you are navigating the parse tree via tag names, if you have multiple matching attribute names, only the first one is returned. This is expected behavior. Try using find_all() to search for them instead.
from the BS4 docs:
"Using a tag name as an attribute will give you only the first tag by
that name."
"If you need to get all the tags, or anything more complicated
than the first tag with a certain name, you’ll need to use one of the
methods described in Searching the tree, such as find_all()"
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
It was the findAll, but I needed to move up the tree
for a in main_div.findAll('a', href=True):
print a['href']

Website scraping with python3 & beautifulsoup 4

I'm starting to make progress on a website scraper, but I've run into two snags. Here is the code first:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.nytimes.com")
soup=BeautifulSoup(r.text)
headlines=soup.find_all(class_="story-heading")
for headline in headlines:
print (headline)
Questions
Why do you a have to use find_all(class_= blahblahblah)
Instead of just find_all(blahblahblah)? I realize that the story-heading is a class of its own, but can't I just search all the HTML using find_all and get the same results? The notes for BeautifulSoup show find_all.a returning all the anchor tags in an HTML document, why won't find_all("story-heading") do the same?
Is it because if I try and do that, it will just find all the instances of "story-heading" within the HTML and return those? I am trying to get python to return everything in that tag. That's my best guess.
Why do I get all this extra junk code? Should my requests to find all just show me everything within the story-header tag? I'm getting a lot more text than what I am just trying to specify.
Beautiful Soup allows you use CSS Selectors. Look in the doc for "CSS selector"
You can find all elements with class "story-heading" like so:
soup.find_all(".story-heading")
If instead it's you're looking for id's just do
soup.find_all("#id-name")

BeautifulSoup Django Parse for Links

I'm trying to get all the links a with class=fl I'm using mechanize to get the raw html output and then beautifulsoup to try to parse out the links.
The value of rawGatheredGoogleOutput is outputting output like (which is just a google result):
The red portion is to show you what I'm trying to grab, which is the a.fl
To find a elements with a class=fl attribute, you call find_all like this:
getAdditionalGooglePages = beautifulSoupObj.find_all('a', attrs={"class": "fl"})
For other attributes, it's simpler - for example, with id=fl it would be:
getAdditionalGooglePages = beautifulSoupObj.find_all('a', id="fl")
... but that doesn't work with class, because it's a Python reserved word.

Categories