Scraping links in Pattern library for Python - python

I found code similar to this in a course I was taking. This code gets all of the links of a certain format that are mentioned in the source code of the webpage. I understand everything, except for the last line. The last line says the following:
print link.attrs.get('href', '')
This works, however I'm unsure as to how the instructor figured out how to do this. I've looked through the documentation and I can't figure out what .get does. Could someone please let me know how I can find this information.
Documentation for Pattern Library: http://www.clips.ua.ac.be/pages/pattern-web
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
pattern = 'http://www.realclearpolitics.com/epolls/????/governor/??/*-*.html'
dom = web.Element(xml)
all_links = dom.by_tag('a')
for link in all_links:
print link.attrs.get('href', '')

It would get all the href "hyperlinks" in that page. You can BeautifulSoup package which is more convinient
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html")
soup = BeautifulSoup(xml, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

Related

Web scraping an xml page in python?

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)
# Get element using XPath
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
print(buyers)
Am I using #url wrong?
The links I am looking for:
Any help would be greatly appreciated!
What happens?
The following xpath wont work, as you mentioned it is the use of #url and also text()
//enclosure[#url="mp3"]/text()
Solution
The attribute url in any //enclosure should contain mp3 and then returned /#url
Change this line:
buyers = tree.xpath('//enclosure[#url="mp3"]/text()')
to
buyers = tree.xpath('//enclosure[contains(#url,"mp3")]/#url')
Output
['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]
It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).
import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests
# Request the page
page = requests.get('https://feeds.megaphone.fm/darknetdiaries')
# Parse
soup = BeautifulSoup(page.text, 'lxml')
# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

How to check whether the url is of format "/askwiki/questions/<any number>" in Python

I am trying to learn web scraping using BeautifulSoup and Python.
I scraped a list of urls from a website and I want to display the text of all th links that are in format "/askwiki/questions/ like
"/askwiki/questions/4" or "/askwiki/questions/123".
import requests
from bs4 import BeautifulSoup
url = 'http://unistd.herokuapp.com/askrec';
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml");
links = soup.find_all("a")
for link in links:
if #url is of my desired format
print link.text
What should I write in the if statement.
I am new to python as well as web scraping. It may be a really stupid question but I am not getting what to write there.
I tried like
if "/askwiki/questions/[0-9]+ " in link.get("href"):
if "/askwiki/questions/[0-9]?" in link.get("href"):
but it's not working.
P.S - There are other links too like 'askwiki/questions/tags' and /askwiki/questions/users'.
Edit: Using regex to identify only those with numbers at the end.
import re
for link in links:
url = str(link.get('href'))
if re.findall('/askwiki/questions/[\d]+', url):
print(link)
You're on the right track! The missing component is the re module.
I think what you want is something like this:
import re
matcher = re.compile(r"/askwiki/questions/[0-9]+")
if matcher.search(link.get("href")):
print(link.text)
Alternatively, you can just drop the number component, if you're only really looking for links with "/askwiki/questions" in:
if "/askwiki/questions" in link.get("href")
print(link.text)
try something like :
for link in links:
link = link.get("href")
if link.startswith("/askwiki/questions/"):
print(link.test)
If you want to use regex (ie what you have, [0-9]+), you have to import the re library. Check out this link to the documentation on using re to find patterns!

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

Categories