Python, Limiting search at a specific hyperlink on webpage - python

I am finding a way to download .pdf file through hyperlinks on a webpage.
Learned from How can i grab pdf links from website with Python script, the way is:
import lxml.html, urllib2, urlparse
base_url = 'http://www.renderx.com/demos/examples.html'
res = urllib2.urlopen(base_url)
tree = lxml.html.fromstring(res.read())
ns = {'re': 'http://exslt.org/regular-expressions'}
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
print urlparse.urljoin(base_url, node.attrib['href'])
The question is, how can I only find the .pdf under a specific hyperlink, instead of listing all the .pdf(s) on the webpage?
A way is, I can limit the print when it contains certain words like:
If ‘CA-Personal.pdf’ in node:
But what if the .pdf file name is changing? Or I just want to limit the searching on the webpage, at the hyperlink of “Applications”? thanks.

well, not the best way but no harm to do:
from bs4 import BeautifulSoup
import urllib2
domain = 'http://www.renderx.com'
url = 'http://www.renderx.com/demos/examples.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
app = soup.find_all('a', text = "Applications")
for aa in app:
print domain + aa['href']

Related

Download multiple csv files from a web directory using Python and store them in disk, using as filename the anchor text

From this URL:
http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV
I want to download all the files and save them with the text of the anchor tag. I guess my main struggle is to retrieve the text of the anchor tag right now:
from bs4 import BeautifulSoup
import requests
import urllib.request
url_base = "http://vs-web-fs-1.oecd.org"
url_dir = "http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV"
r = requests.get(url_dir)
data = r.text
soup = BeautifulSoup(data,features="html5lib")
for link in soup.find_all('a'):
if link.get('href').endswith(".csv"):
print(link.find("a"))
urllib.request.urlretrieve(url_base+link.get('href'), "test.csv")
Line print(link.find("a")) returns None. How can I retrieve the text?
You get the text accessing the content, like this:
link.contents[0]
or
link.string

How to extract the main text body of a url, discarding all irrelevant data

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = "http://www.wholefoodsmarket.com/forums"
br = mechanize.Browser()
urls = [url]
visited = [url]
while len(urls)>0:
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
ur = urllib.urlopen(newurl)
soup = BeautifulSoup(ur.read())
html = soup.find_all()
print html
f = open('content.txt', 'a')
f.write(newurl)
f.write("\n")
print >>f.write(soup.title.string)
f.write("\n")
f.write(soup.head)
f.write("\n")
f.write(soup.body)
print >>f, "Next Link\n"
f.close()
except:
print "error"
urls.pop(0)
I am trying to recursively crawl html pages data upto 1 GB and then extract the relevant text data i.e discarding all code, html tags. Can someone suggest some link I can follow.
You could try using the get_text method.
Relevant code snippet:
soup = BeautifulSoup(html_doc)
print(soup.get_text())
Hope it gets you started in the right direction
In the case you are not limited to BeautifulSoup I would suggest you exploring xpath capabilities.
As an example to get all the text from a page you would need an expression as simple as this one:
//*/text()
The text from all links will be:
//a/text()
Similar expressions can be used to extract all info you need.
More info on XPath here:https://stackoverflow.com/tags/xpath/info
In the case you got problems building up the crawler from the scratch think about using an already implemented one (as Scrapy)

Trying to scrape information from an iterated list of links using Beautiful Soup or ElementTree

I'm attempting to scrape an xml database list of links for these addresses. (The 2nd link is an example page that actually contains some addresses. Many of the links don't.)
I'm able to retrieve the list of initial links I'd like to crawl through, but I can't seem to go one step further and extract the final information I'm looking for (addresses).
I assume there's an error with my syntax, and I've tried scraping it using both beautiful soup and Python's included library, but it doesn't work.
BSoup:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
data = bs.find("html",{"i"})
print data
Non 3rd Party:
import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text
Any input is appreciated! Thanks.
Your syntax is ok. You need to simply follow those links in the first page, here's how it will look like for the Milano page:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
company_menu = bs.find_all('loc')
for item in company_menu:
if 'milano' in item.text:
subpage = requests.get(item.text)
subsoup = BeautifulSoup(subpage.text)
adresses = subsoup.find_all(class_='riquadro_agenzia_off')
for adress in adresses:
companyname.append(adress.text)
print companyname
To get all addresses you can simply remove if 'milano' block in the code. I don't know if they are all formatted according to coherent rules, for milano addresses are under div with class="riquandro_agenzia_off", if other subpages are also formatted in this way then it should work. Anyway this should get you started. Hope it helps.

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Write a python script that goes through the links on a page recursively

I'm doing a project for my school in which I would like to compare scam mails. I found this website: http://www.419scam.org/emails/
Now what I would like to do is to save every scam in apart documents then later on I can analyse them.
Here is my code so far:
import BeautifulSoup, urllib2
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
This saves me the whole html file in a text format, now I would like to strip the file and save the content of the html links to the scams:
01
02
03
etc.
If i get that, I would still need to go a step further and open save another href. Any idea how do I do it in one python code?
Thank you!
You picked the right tool in BeautifulSoup. Technically you could do it all do it in one script, but you might want to segment it, because it looks like you'll be dealing with tens of thousands of e-mails, all of which are seperate requests - and that will take a while.
This page is gonna help you a lot, but here's just a little code snippet to get you started. This gets all of the html tags that are index pages for the e-mails, extracts their href links and appends a bit to the front of the url so they can be accessed directly.
from bs4 import BeautifulSoup
import re
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/"))
tags = soup.find_all(href=re.compile("20......../index\.htm")
links = []
for t in tags:
links.append("http://www.419scam.org/emails/" + t['href'])
're' is a Python's regular expressions module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute match that regular expression. I chose this regular expression to get only the e-mail index pages rather than all of the href links on that page. I noticed that the index page links had that pattern for all of their URLs.
Having all the proper 'a' tags, I then looped through them, extracting the string from the href attribute by doing t['href'] and appending the rest of the URL to the front of the string, to get raw string URLs.
Reading through that documentation, you should get an idea of how to expand these techniques to grab the individual e-mails.
You might also find value in requests and lxml.html. Requests is another way to make http requests and lxml is an alternative for parsing xml and html content.
There are many ways to search the html document but you might want to start with cssselect.
import requests
from lxml.html import fromstring
url = 'http://www.419scam.org/emails/'
doc = fromstring(requests.get(url).content)
atags = doc.cssselect('a')
# using .get('href', '') syntax because not all a tags will have an href
hrefs = (a.attrib.get('href', '') for a in atags)
Or as suggested in the comments using .iterlinks(). Note that you will still need to filter if you only want 'a' tags. Either way the .make_links_absolute() call is probably going to be helpful. It is your homework though, so play around with it.
doc.make_links_absolute(base_url=url)
hrefs = (l[2] for l in doc.iterlinks() if l[0].tag == 'a')
Next up for you... how to loop through and open all of the individual spam links.
To get all links on the page you could use BeautifulSoup. Take a look at this page, it can help. It actually tells how to do exactly what you need.
To save all pages, you could do the same as what you do in your current code, but within a loop that would iterate over all links you'll have extracted and stored, say, in a list.
Heres a solution using lxml + XPath and urllib2 :
#!/usr/bin/env python2 -u
# -*- coding: utf8 -*-
import cookielib, urllib2
from lxml import etree
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open("http://www.419scam.org/emails/")
page.addheaders = [('User-agent', 'Mozilla/5.0')]
reddit = etree.HTML(page.read())
# XPath expression : we get all links under body/p[2] containing *.htm
for node in reddit.xpath('/html/body/p[2]/a[contains(#href,".htm")]'):
for i in node.items():
url = 'http://www.419scam.org/emails/' + i[1]
page = opener.open(url)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
lst = url.split('/')
try:
if lst[6]: # else it's a "month" link
filename = '/tmp/' + url.split('/')[4] + '-' + url.split('/')[5]
f = open(filename, 'w')
f.write(page.read())
f.close()
except:
pass
# vim:ts=4:sw=4
You could use HTML parser and specify the type of object you are searching for.
from HTMLParser import HTMLParser
import urllib2
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
print attr[1]
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
parser = MyHTMLParser()
parser.feed(html)

Categories