I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/
Event 1
Event 2
example.com/events/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
To use in Python...
import bs4 as BeautifulSoup
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
from lxml.html.soupparser import fromstring
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
FULL PYTHON 3 EXAMPLE
Packages
# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4
Example:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://www.wikipedia.org/') as f:
data = f.read().decode('utf-8')
d = BeautifulSoup(data)
d.title.string
The above should print out 'Wikipedia'
Related
I want to get The following term was not found in PubMed: SNP5265 this message from
<em class="altered-search-explanation query-error-message">The following term was not found in PubMed: SNP5265</em>
like picture.Is this possible?
Thanks.
Use Beautiful soup to parse the page. This answer assumes multiple instances. First install requests and Beautiful Soup with pip install BeautifulSoup4 requests
then:
#!/usr/bin/env python
from bs4 import BeautifulSoup as BS
import requests
url = 'https://pubmed.ncbi.nlm.nih.gov/?term=SNP653+fever'
soup = BS(requests.get(url).text, 'html.parser')
print(soup)
#find all <em> elements in soup object matching {attribute:value} condition
for x in soup.findAll('em', {"class":"altered-search-explanation query-error-message"}):
if not(x.text ==""): # if text is not empty
print(x.text) # or return text.. list.append(x.text)
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...
I'm using Python3, BeautifulSoup4
When I run code below, it gives just url "www.google.com" not XML.
I couldn't find it What is wrong.
from bs4 import BeautifulSoup
import urllib
html = "www.google.com"
soup = BeautifulSoup(html)
print (soup.prettify())
You need to use urllib2 or a similar library to fetch the HTML
import urllib2
html = urllib2.urlopen("www.google.com")
soup = BeautifulSoup(html)
print (soup.prettify())
EDIT: Just as a side note to clarify why I suggested urllib2. If you read the urllib documentation, you'll find "The urlopen() function has been removed in Python 3 in favor of urllib2.urlopen()." Given that you have tagged Python3, urllib2 would probably be your best option.
I'm not sure if I'm approaching this correctly. I'm using requests to make a GET:
con = s.get(url)
when I call con.content, the whole page is there. But when I pass con into BS:
soup = BeautifulSoup(con.content)
print(soup.a)
I get none. There are lots of tags in there, not behind any JS, that are preset when i call con.content, but when I try to parse with BS most of the page is not there.
Change the parser to html5lib
pip install html5lib
And then,
soup = BeautifulSoup(con.content,'html5lib')
The a tags are probably not on the top level.
soup.find_all('a')
is probably what you wanted.
In general, I found lxml to be more reliable, consistent in the API and faster. Yes, even more reliable - I have repeatedly had documents where BeautifulSoup failed to parse them, but lxml in its robust mode lxml.html.soupparser still worked well. And there is the lxml.etree API which is really easy to use.
Without being able to see you're html you're getting I just did this on the hacker news site and it returns all the a tags as expected.
import requests
from bs4 import BeautifulSoup
s = requests.session()
con = s.get('https://news.ycombinator.com/')
soup = BeautifulSoup(con.text)
links = soup.findAll('a')
for link in links:
print link
I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code
Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1