from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a')
print(link)
This keeps giving me:
[Element a at 0x1c64c963f48]
response instead the actual number I am seeking in the page? Any idea why?
Also, why can't I get a type(link) value to see the type?
Try below code to get "192,322" as output:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
try:
link = doc.xpath('//a[#href="/metrics"]/text()')[0]
print(link.split()[0])
except IndexError:
print("No link found")
Your XPath gives you <a> elements. You want their text. So... print their text.
link = doc.xpath("//label[#for='search-header']//a")
for a in link:
print( a.text )
Notes
/html/body/header/div[4]/div/div/h4/label/small/a is way too specific. It will break very easily when they make even the slightest change to their HTML layout. Don't use auto-generated XPath expressions. Write all your XPath expressions yourself.
XPath always returns a list of nodes, even if there is only one hit. Use a loop or pick a specific list item (like link[0]).
You can use the feature for extracting the href by changing your code to use the text(). See below:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a/text()')
print(link)
Example in Chrome Developer Tools:
> $x("/html/body/header/div[4]/div/div/h4/label/small/a/text()")[0]
> 192,322 DATASETS
Related
I am trying to retrieve urls under a certain property. The current code I have is
import urllib
import lxml.html
url = 'https://play.acast.com/s/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-'
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
links = []
for link in dom.xpath('//meta/#content'): # select the url in href for all a tags(links)
if 'mp3' in link:
links.append(link)
output = set(links)
for i in output:
print(i)
This outputs 2 links which is not what I want.
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-r/media.mp3
What I would like to do is to get 'only' the URL link that is under og:audio property. Not og:audio:secure_url property.
How do I accomplish this?
To only select a tag where the property="og:audio" and not property="og:audio:secure_url", you can use an [attribute=value]
CSS selector. In your case it would be: [property="og:audio"].
Since you tagged beautifulsoup, you can do it as follows:
soup = BeautifulSoup(connection.read(), "html.parser")
for tag in soup.select('[property="og:audio"]'):
print(tag["content"])
Output:
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
A better way would be to study the XHR calls in the Network tab when you inspect the page. In the response of https://feeder.acast.com/api/v1/shows/jeg-kan-ingenting-om-vin/episodes/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-?showInfo=true the url key is what you are looking for.
I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)
# Get element using XPath
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
print(buyers)
Am I using #url wrong?
The links I am looking for:
Any help would be greatly appreciated!
What happens?
The following xpath wont work, as you mentioned it is the use of #url and also text()
//enclosure[#url="mp3"]/text()
Solution
The attribute url in any //enclosure should contain mp3 and then returned /#url
Change this line:
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
to
buyers = tree.xpath('//enclosure[contains(#url,"mp3")]/#url')
Output
['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]
It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).
import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parse
soup = BeautifulSoup(page.text, 'lxml')
# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]
I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)
I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu
I am trying to extract the gallery link of the first result on an imgur search.
theurl = "https://imgur.com/search?q=" +text
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
link = soup.findAll('a',{"class":"image-list-link"})[0].decode_contents()
Here is what is being displayed for link:
I am mainly trying to get the href value from only this section (the first result for the search)
Here is what the inspect element looks like:
Actually, it's pretty easy to accomplish what you're trying to do. As shown in the image, the href of first image (or any image for that matter) is located inside the <a> tag with the attribute class="image-list-link". So, you can use the find() function, which returns the first match found. And then, use ['href'] to get the link.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://imgur.com/search?q=python')
soup = BeautifulSoup(r.text, 'lxml')
first_image_link = soup.find('a', class_='image-list-link')['href']
print(first_image_link)
# /gallery/AxKwQ2c
If you want to get the links for all the images, you can use a list comprehension.
all_image_links = [a['href'] for a in soup.find_all('a', class_='image-list-link')]
I am trying to automate the process of obtaining the number of followers different twitter accounts using the page source.
I have the following code for one account
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup.findAll('a'):
if tag.has_key('class'):
if tag['class'] == 'ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor':
if tag['href'] == '/justinbieber/followers':
print tag.title
break
I am not sure where did I went wrong. I understand that we can use Twitter API to obtain the number of followers. However, I wish to try to obtain it through this method as well to try it out. Any suggestions?
I've modified the code from here
If I were you, I'd be passing the class name as an argument to the find() function instead of find_all() and I'd first look for the <li> element that contains the anchor you're loooking for. It'd look something like this
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
f = soup.find('li', class_="ProfileNav-item--followers")
title = f.find('a')['title']
print title
# 81,346,708 Followers
num_followers = int(title.split(' ')[0].replace(',',''))
print num_followers
# 81346708
PS findAll() was renamed to find_all() in bs4