I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text
Related
I want to open a website to download resume from it, but following code tries to get to absolute path instead of just url:
import webbrowser
soup = BeautifulSoup(webbrowser.open('www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0'),"lxml")
generates the following error:
gvfs-open: /home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0:
error opening location: Error when getting information for file
'/home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0': No such file or directory
Clearly it is taking the home address and trying to search that on web which will not be present. What am I doing wrong here? Thanks in advance
I suppose you are confusing the usage of Beautiful Soup and webbrowser together. Webbrowser it is not needed to access the page.
From Documentation
Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting what you need. It doesn't take
much code to write an application
Adapting the tutorial example to your task to print the resume in output
from bs4 import BeautifulSoup
import requests
url = "www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0"
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
print soup.find("div", {"id": "resume"})
I'm trying to create a basic scraper that will scrape Username and Song Title from a search on Soundcloud. By inspecting the element I needed (using Chrome), I found I needed to find the string associated with every tag 'span' with title="soundTitle__usernameText". Using BeautifulSoup, urllib2, and lxml, I have the following code for a search 'robert delong':
from lxml import html
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
def search_results(url):
html = urlopen(url).read()
# html = requests.get(url) I've tried this also
soup = BeautifulSoup(html, "lxml")
usernames = [span.string for span in soup.find_all("span", "soundTitle__usernameText")]
return usernames
print search_results('http://soundcloud.com/search?q=robert%20delong')
This returns an empty list. However, when I save the complete webpage on Chrome by selecting File>Save>Format-Webpage, Complete, and use that associated HTML file instead of the file obtained with urlopen, the code then prints
[u'Two Door Cinema Club', u'whatever-28', u'AWOLNATION', u'Two Door Cinema Club', u'Sean Glass', u'Capital Cities', u'Robert DeLong', u'RAC', u'JR JR']
which is the ideal outcome. To me, it appears that urlopen uses somewhat truncated HTML code to conduct its search, which is why it returns an empty list.
Any thoughts on how I may be able to access the same HTML obtained by manually saving the webpage, but using Python/Terminal? Thank you.
You guessed right. Downloaded HTML does not contain all the data. Javascript is used to request information in JSON format which is then inserted into the document.
By looking at the request Chrome made (ctrl+shift+i, "Network"), I see that it requested https://api-v2.soundcloud.com/search?q=robert%20delong. I believe the response to that has the information you need.
Actually, this is good for you. Reading JSON should me much more straight-forward than parsing HTML ;)
This is the code that you can use to download the html of the webpage using terminal and its related links and images:
wget -p --convert-links http://www.website.com/directory/webpage.html
I am trying to get the set of url's(which are webpages) from newyork times, but i get a different answer, I am sure that I gave a correct class, though it extracts different classes. My ny_url.txt has "http://query.nytimes.com/search/sitesearch/?action=click®ion=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis; http://query.nytimes.com/search/sitesearch/?action=click®ion=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis/since1851/allresults/2/"
Here is my code:
import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = open('ny_url.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.find_all('div', attrs = {'class' : 'element2'})
for href in links:
print href
Well its not that simple.
The data you are looking for is not in your page_source downloaded by urllib2.
Try printing the opener.open(line).read() you will find the data to be missing.
This is because, the site is making another GET request to http://query.nytimes.com/svc/cse/v2pp/sitesearch.json?query=isis&page=1
Where within the url your query parameters are passed query=isis and page=1
The data fetched is in json format, try opening the url above in the browser manually. You will find your data there.
So a pure pythonic way would be to call this url and parse JSON to get what you want.
No rocket science needed - just parse the dict using proper keys.
OR
An easier way would be to use webdrivers like Selenium - navigate to the page - and parse the page source using BeautifulSoup. That should easily fetch the entire Content.
Hope that helps. Let me know if you need more insights.
Using python, how would you go about scraping both pictures and text from a website. For example say I wanted to scrape both the pictures and the text here, what python tools/libraries would I use? Any tutorials?
Please never use regular expressions, it's not made for parsing html.
Normally I make use of the following combination of tools:
requests module
lxml.html
beautifulsoup4 to detect the website encoding
A approach would look like this and I hope you get the idea (The code just illustrated the concept, not tested, won't work):
import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit
# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()
# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
doc = UnicodeDammit(html, is_html=True)
parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
dom = lxml.html.document_fromstring(html, parser=parser)
dom.resolve_base_href()
except Exception as e:
print('Some error occured while lxml tried to parse: {}'.format(e.msg))
return False
# Try to extract all data that we are interested in with CSS selectors!
try:
results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
for e in results:
# access elements like
print(e.get('href')) # access href attribute
print(e.text_content()) # the content as text
# or process further
found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
print(e.__cause__)
requests, scrapy, and BeatidulSoup.
Scrapy is optional, but requests are becoming nonofficial standard, and I haven't seen bettern parsing tool than BS.
I'm trying to scrape the data from a table - namely (http://stats.nba.com/leagueTeamGeneral.html?pageNo=1&rowsPerPage=30). I am having difficulty with using the right commands. Tried various parameters, none worked. Ideally having the data returned in the format,
example:
Atlanta Hawks,32, 48.8, 18, 14, .563, etc
I can get the data formatted no problem, just getting the required data is what is causing me grief.
import urllib2
from bs4 import BeautifulSoup
page = 'http://stats.nba.com/leagueTeamGeneral.html?pageNo=1&rowsPerPage=30'
page = urllib2.urlopen(page)
soup = BeautifulSoup(page)
for dS in soup.find_all(???):
print(dS.get(???))
use a tool like firefox firebug to track down the html call you need, looking at the link you shared in firebug 'net' tab shows that the data you are after is in a subsequent request call to http://www.nba.com/cmsinclude/desktopWrapperHeader_jsonp.html
which actually contains json data, not sure BeautifulSoup will be handy here, try to load it using python json
Thank for the suggestion, Worked rather nicely. What I ended up using was something like
import json
from pprint import pprint
with open('NBA_DATA.json') as data_file:
data = json.load(data_file)
#Have this here for debug purpose just to see output
pprint(data["resultSets"])
for hed in data["resultSets"]:
s1 = hed["headers"]
s2 = hed["rowSet"]
#more debugging
#pprint(hed["headers"])
#pprint(hed["rowSet"])
list_of_s1 = list(hed["headers"])
list_of_s2 = list(hed["rowSet"])