Web scraping an xml page in python?

Web scraping an xml page in python? - python

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)
# Get element using XPath
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
print(buyers)
Am I using #url wrong?
The links I am looking for:
Any help would be greatly appreciated!

What happens?
The following xpath wont work, as you mentioned it is the use of #url and also text()
//enclosure[#url="mp3"]/text()
Solution
The attribute url in any //enclosure should contain mp3 and then returned /#url
Change this line:
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
to
buyers = tree.xpath('//enclosure[contains(#url,"mp3")]/#url')
Output
['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).
import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parse
soup = BeautifulSoup(page.text, 'lxml')
# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

Related

How to retrieve URLs under a certain property using BeautifulSoup in Python?

I am trying to retrieve urls under a certain property. The current code I have is
import urllib
import lxml.html
url = 'https://play.acast.com/s/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-'
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
links = []
for link in dom.xpath('//meta/#content'): # select the url in href for all a tags(links)
if 'mp3' in link:
links.append(link)
output = set(links)
for i in output:
print(i)
This outputs 2 links which is not what I want.
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-r/media.mp3
What I would like to do is to get 'only' the URL link that is under og:audio property. Not og:audio:secure_url property.
How do I accomplish this?

To only select a tag where the property="og:audio" and not property="og:audio:secure_url", you can use an [attribute=value]
CSS selector. In your case it would be: [property="og:audio"].
Since you tagged beautifulsoup, you can do it as follows:
soup = BeautifulSoup(connection.read(), "html.parser")
for tag in soup.select('[property="og:audio"]'):
print(tag["content"])
Output:
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3

A better way would be to study the XHR calls in the Network tab when you inspect the page. In the response of https://feeder.acast.com/api/v1/shows/jeg-kan-ingenting-om-vin/episodes/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-?showInfo=true the url key is what you are looking for.

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.

To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!

xpath selection of html results in element response

from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a')
print(link)
This keeps giving me:
[Element a at 0x1c64c963f48]
response instead the actual number I am seeking in the page? Any idea why?
Also, why can't I get a type(link) value to see the type?

Try below code to get "192,322" as output:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
try:
link = doc.xpath('//a[#href="/metrics"]/text()')[0]
print(link.split()[0])
except IndexError:
print("No link found")

Your XPath gives you <a> elements. You want their text. So... print their text.
link = doc.xpath("//label[#for='search-header']//a")
for a in link:
print( a.text )
Notes
/html/body/header/div[4]/div/div/h4/label/small/a is way too specific. It will break very easily when they make even the slightest change to their HTML layout. Don't use auto-generated XPath expressions. Write all your XPath expressions yourself.
XPath always returns a list of nodes, even if there is only one hit. Use a loop or pick a specific list item (like link[0]).

You can use the feature for extracting the href by changing your code to use the text(). See below:
from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a/text()')
print(link)
Example in Chrome Developer Tools:
> $x("/html/body/header/div[4]/div/div/h4/label/small/a/text()")[0]
> 192,322 DATASETS

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.

You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')

You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.

You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!

You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']

this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping an xml page in python? - python

Related

How to retrieve URLs under a certain property using BeautifulSoup in Python?

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

xpath selection of html results in element response

Extract Link URL After Specified Element with Python and Beautifulsoup4

How to find specific video html tag using beautiful soup?

Categories

Resources