Parsing from a website -- source code does not contain the info I need - python

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

Related

How can i convert scraping script as web-service?

I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Change website text with python

This is my first StackOverflow post so please bear with me.
What I'm trying to accomplish is a simple program written in python which will change all of a certain html tag's content (ex. all <h1> or all <p> tags) to something else. This should be done on an existing web page which is currently open in a web browser.
In other words, I want to be able to automate the inspect element function in a browser which will then let me change elements however I wish. I know these changes will just be on my side, but that will serve my larger purpose.
I looked at Beautiful Soup and couldn't find anything in the documentation which will let me change the website as seen in a browser. If someone could point me in the right direction, I would be greatly appreciative!
What you are talking about seems to be much more of the job of a browser extension. Javascript will be much more appropriate, as #brbcoding said. Beautiful Soup is for scraping web pages, not for modifying them on the client side in a browser. To be honest, I don't think you can use Python for that.

How can I use python to get rendered ASP pages?

I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a frameset. There is nothing fancy going on from the server side that I can tell. When I use requests or urllib to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset with a single frame in it.
The frame target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML framesets work. You're downloading the frameset, not downloading the referenced frame, and wondering why the resulting page looks like an empty frameset.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset: For each frame tag, it follows the src attribute, then it replaces the contents of the frame tag with a #document tag with no attributes, with the results of reading the src URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget). So, next time you can say "In Chrome's Inspect Page, I see a #document under the frame, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame has no children".

transferring real time data from a website in python

I am programming in Python.
I would like to extract real time data from a webpage without refreshing it:
http://www.fxstreet.com/rates-charts/currency-rates/
I think the real time data webpage is written in AJAX but I am not quite sure..
I thought about opening an internet browser with the program but I do not really know/like this way... Is there an other way to do it?
I would like to fill a dictionnary in my program (or even a SQL database) with the latest numbers each second.
please help me in python, thanks!
To get the data, you'll need to look through the javascript and HTML source to find what URL it's hitting to get the data it's displaying. Then, you can call that URL with urllib or your favorite python library and parse it
Also, it may be easier if you use a plugin like Firebug that lets you watch the AJAX requests.

Categories