xpath selection of html results in element response

xpath selection of html results in element response - python

from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a')
print(link)
This keeps giving me:
[Element a at 0x1c64c963f48]
response instead the actual number I am seeking in the page? Any idea why?
Also, why can't I get a type(link) value to see the type?

Try below code to get "192,322" as output:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
try:
link = doc.xpath('//a[#href="/metrics"]/text()')[0]
print(link.split()[0])
except IndexError:
print("No link found")

Your XPath gives you <a> elements. You want their text. So... print their text.
link = doc.xpath("//label[#for='search-header']//a")
for a in link:
print( a.text )
Notes
/html/body/header/div[4]/div/div/h4/label/small/a is way too specific. It will break very easily when they make even the slightest change to their HTML layout. Don't use auto-generated XPath expressions. Write all your XPath expressions yourself.
XPath always returns a list of nodes, even if there is only one hit. Use a loop or pick a specific list item (like link[0]).

You can use the feature for extracting the href by changing your code to use the text(). See below:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a/text()')
print(link)
Example in Chrome Developer Tools:
> $x("/html/body/header/div[4]/div/div/h4/label/small/a/text()")[0]
> 192,322 DATASETS

Related

How to retrieve URLs under a certain property using BeautifulSoup in Python?

I am trying to retrieve urls under a certain property. The current code I have is
import urllib
import lxml.html
url = 'https://play.acast.com/s/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-'
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
links = []
for link in dom.xpath('//meta/#content'): # select the url in href for all a tags(links)
if 'mp3' in link:
links.append(link)
output = set(links)
for i in output:
print(i)
This outputs 2 links which is not what I want.
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-r/media.mp3
What I would like to do is to get 'only' the URL link that is under og:audio property. Not og:audio:secure_url property.
How do I accomplish this?

To only select a tag where the property="og:audio" and not property="og:audio:secure_url", you can use an [attribute=value]
CSS selector. In your case it would be: [property="og:audio"].
Since you tagged beautifulsoup, you can do it as follows:
soup = BeautifulSoup(connection.read(), "html.parser")
for tag in soup.select('[property="og:audio"]'):
print(tag["content"])
Output:
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3

A better way would be to study the XHR calls in the Network tab when you inspect the page. In the response of https://feeder.acast.com/api/v1/shows/jeg-kan-ingenting-om-vin/episodes/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-?showInfo=true the url key is what you are looking for.

Web scraping an xml page in python?

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)
# Get element using XPath
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
print(buyers)
Am I using #url wrong?
The links I am looking for:
Any help would be greatly appreciated!

What happens?
The following xpath wont work, as you mentioned it is the use of #url and also text()
//enclosure[#url="mp3"]/text()
Solution
The attribute url in any //enclosure should contain mp3 and then returned /#url
Change this line:
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
to
buyers = tree.xpath('//enclosure[contains(#url,"mp3")]/#url')
Output
['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).
import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parse
soup = BeautifulSoup(page.text, 'lxml')
# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

Can't find the Web scraping selector for div class

I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)

I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu

How to extract the link of the first gallery on imgur using BS4 and urllib

I am trying to extract the gallery link of the first result on an imgur search.
theurl = "https://imgur.com/search?q=" +text
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
link = soup.findAll('a',{"class":"image-list-link"})[0].decode_contents()
Here is what is being displayed for link:
I am mainly trying to get the href value from only this section (the first result for the search)
Here is what the inspect element looks like:

Actually, it's pretty easy to accomplish what you're trying to do. As shown in the image, the href of first image (or any image for that matter) is located inside the <a> tag with the attribute class="image-list-link". So, you can use the find() function, which returns the first match found. And then, use ['href'] to get the link.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://imgur.com/search?q=python')
soup = BeautifulSoup(r.text, 'lxml')
first_image_link = soup.find('a', class_='image-list-link')['href']
print(first_image_link)
# /gallery/AxKwQ2c
If you want to get the links for all the images, you can use a list comprehension.
all_image_links = [a['href'] for a in soup.find_all('a', class_='image-list-link')]

Extract Number of Followers from Twitter using BeautifulSoup

I am trying to automate the process of obtaining the number of followers different twitter accounts using the page source.
I have the following code for one account
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup.findAll('a'):
if tag.has_key('class'):
if tag['class'] == 'ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor':
if tag['href'] == '/justinbieber/followers':
print tag.title
break
I am not sure where did I went wrong. I understand that we can use Twitter API to obtain the number of followers. However, I wish to try to obtain it through this method as well to try it out. Any suggestions?
I've modified the code from here

If I were you, I'd be passing the class name as an argument to the find() function instead of find_all() and I'd first look for the <li> element that contains the anchor you're loooking for. It'd look something like this
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
f = soup.find('li', class_="ProfileNav-item--followers")
title = f.find('a')['title']
print title
# 81,346,708 Followers
num_followers = int(title.split(' ')[0].replace(',',''))
print num_followers
# 81346708
PS findAll() was renamed to find_all() in bs4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

xpath selection of html results in element response - python

Try below code to get "192,322" as output: from lxml import html import requests url = 'https://www.data.gov/' r = requests.get(url) doc = html.fromstring(r.content) try: link = doc.xpath('//a[#href="/metrics"]/text()')[0] print(link.split()[0]) except IndexError: print("No link found")

Related

How to retrieve URLs under a certain property using BeautifulSoup in Python?

Web scraping an xml page in python?

Can't find the Web scraping selector for div class

How to extract the link of the first gallery on imgur using BS4 and urllib

Extract Number of Followers from Twitter using BeautifulSoup

Categories

Resources