Python3 scraper. Doesn't parse the xpath till the end - python

I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.

That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.

For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)

Related

cannot figure beautifulsoup string sum

Hello I want to scrape a webpage. I posted my code but the line which I targeted is important. It doesn't worked. I mean there is no error but also no output. My code is there. I need to sum to strings and there is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text() <---this line
html_content_course=requests.get(a).text
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
When I am changing the line which I showed to
url_course=url_course_main+'AKM'
it worked.
Also soup.find_all('option')[1].get_text() is equal to AKM.
Can you guess where the mistake is.
Instead of requests module, try Python's standard urllib.request. It seems that requests module has problem opening the page:
import urllib.request
from bs4 import BeautifulSoup
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text()
html_content_course=urllib.request.urlopen(url_course).read()
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text(strip=True))
Prints:
2019-2020 Yaz Dönemi AKM Kodlu Derslerin Ders Programı
...
Problem is that get_text() gives 'AKM ' with space at the end and requests sends url with this space - and server can't find file 'AKM ' with space.
I used >< in string '>{}<'.format(param) to show this space - >AKM < - because without >< it seems OK.
Code needs get_text(strip=True) or get_text().strip() to remove this space.
import requests
from bs4 import BeautifulSoup
url = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, 'lxml')
url_course_main = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
param = soup.find_all('option')[1].get_text()
print('>{}<'.format(param)) # I use `> <` to show spaces
param = soup.find_all('option')[1].get_text(strip=True)
print('>{}<'.format(param)) # I use `> <` to show spaces
url_course = url_course_main + param
html_content_course = requests.get(url_course).text
soup_course = BeautifulSoup(html_content_course, 'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.
The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

Need to take 'href' value from 'a' tag in following output [duplicate]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I retrieve the links of a webpage and copy the url address of the links using Python?
Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'): # select the url in href for all a tags(links)
print link
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'
The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))
Links can be within a variety of attributes so you could pass a list of those attributes to select.
For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Attribute = value selectors
[attr^=value]
Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.
There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.
Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/#href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.
just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.
This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/#href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link
To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is "re.findall()".
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Why not use regular expressions:
import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
print('href: %s, HTML text: %s' % (link[0], link[1]))
Here's an example using #ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
I found the answer by #Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.
BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[#href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[#href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
The usage is as follows:
getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"
There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Trying to scrape information from an iterated list of links using Beautiful Soup or ElementTree

I'm attempting to scrape an xml database list of links for these addresses. (The 2nd link is an example page that actually contains some addresses. Many of the links don't.)
I'm able to retrieve the list of initial links I'd like to crawl through, but I can't seem to go one step further and extract the final information I'm looking for (addresses).
I assume there's an error with my syntax, and I've tried scraping it using both beautiful soup and Python's included library, but it doesn't work.
BSoup:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
data = bs.find("html",{"i"})
print data
Non 3rd Party:
import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text
Any input is appreciated! Thanks.
Your syntax is ok. You need to simply follow those links in the first page, here's how it will look like for the Milano page:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
company_menu = bs.find_all('loc')
for item in company_menu:
if 'milano' in item.text:
subpage = requests.get(item.text)
subsoup = BeautifulSoup(subpage.text)
adresses = subsoup.find_all(class_='riquadro_agenzia_off')
for adress in adresses:
companyname.append(adress.text)
print companyname
To get all addresses you can simply remove if 'milano' block in the code. I don't know if they are all formatted according to coherent rules, for milano addresses are under div with class="riquandro_agenzia_off", if other subpages are also formatted in this way then it should work. Anyway this should get you started. Hope it helps.

Categories