Querying web pages with Python

Querying web pages with Python - python

I am learning web programming with Python, and one of the exercises I am working on is the following: I am writing a Python program to query the website "orbitz.com" and return the lowest airfare. The departure and arrival cities and dates are used to construct the URL.
I am doing this using the urlopen command, as follows:
(search_str contains the URL)
from lxml.html import parse
from urllib2 import urlopen
parsed = parse(urlopen(search_str))
doc = parsed.getroot()
links = doc.findall('.//a')
the_link = (links[j].text_content()).strip()
The idea is to retrieve all the links from the query results and search for strings such as "Delta", "United" etc, and read off the dollar amount next to the links.
It worked successfully until today - It looks like orbitz.com has changed their output page. Now, when you enter the travel details on the orbitz.com website, there appears a page showing a wheel saying "looking up itineraries" or something to that effect. This is just a filler page and contains no real information. After a few seconds, the real results page is displayed. Unfortunately, the Python code return the links for the filler page each time, and I never obtain the real results.
How can I get around this? I am a relative beginner to web programming, so any help is greatly appreciated.

This kind of things is normal in the world of crawlers.
What you need to do is figure out what url it is redirecting to after the "itinerary page" and you hit that url directly from your script.
Then figure out if they have changed the final search results page too, if so modify your script to accommodate those changes.

Related

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.

Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values

The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).

You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Python- Downloading a file from a webpage by clicking on a link

I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website.
What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was:
<a id="lnkDownLoad" href="javascript:getQuotes(true);">
Download this file in Excel Format
</a>
No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?

Using Chrome, go to View > Developer > Developer Tools
In this new developer tools UI, change to the Network tab
Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity.
Click the link, and see if there was any requests made to the server
If there was, click it, and see if you can reverse engineer the API of its endpoint
Please be aware that this may be against the website's Terms of Service!

It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently.
import requests
from bs4 import BeautifulSoup
URL = 'http://www.nasdaq.com/symbol/amd/historical'
page = requests.get(URL).text
soup = BeautifulSoup(page, 'lxml')
tableDiv = soup.find_all('div', id="historicalContainer")
tableRows = tableDiv[0].findAll('tr')
for tableRow in tableRows[2:]:
row = tuple(tableRow.getText().split())
print ('"%s",%s,%s,%s,%s,"%s"' % row)
Output:
"03/24/2017",14.16,14.18,13.54,13.7,"50,022,400"
"03/23/2017",13.96,14.115,13.77,13.79,"44,402,540"
"03/22/2017",13.7,14.145,13.55,14.1,"61,120,500"
"03/21/2017",14.4,14.49,13.78,13.82,"72,373,080"
"03/20/2017",13.68,14.5,13.54,14.4,"91,009,110"
"03/17/2017",13.62,13.74,13.36,13.49,"224,761,700"
"03/16/2017",13.79,13.88,13.65,13.65,"44,356,700"
"03/15/2017",14.03,14.06,13.62,13.98,"55,070,770"
"03/14/2017",14,14.15,13.6401,14.1,"52,355,490"
"03/13/2017",14.475,14.68,14.18,14.28,"72,917,550"
"03/10/2017",13.5,13.93,13.45,13.91,"62,426,240"
"03/09/2017",13.45,13.45,13.11,13.33,"45,122,590"
"03/08/2017",13.25,13.55,13.1,13.22,"71,231,410"
"03/07/2017",13.07,13.37,12.79,13.05,"76,518,390"
"03/06/2017",13,13.34,12.38,13.04,"117,044,000"
"03/03/2017",13.55,13.58,12.79,13.03,"163,489,100"
"03/02/2017",14.59,14.78,13.87,13.9,"103,970,100"
"03/01/2017",15.08,15.09,14.52,14.96,"73,311,380"
"02/28/2017",15.45,15.55,14.35,14.46,"141,638,700"
"02/27/2017",14.27,15.35,14.27,15.2,"95,126,330"
"02/24/2017",14,14.32,13.86,14.12,"46,130,900"
"02/23/2017",14.2,14.45,13.82,14.32,"79,900,450"
"02/22/2017",14.3,14.5,14.04,14.28,"71,394,390"
"02/21/2017",13.41,14.1,13.4,14,"66,250,920"
"02/17/2017",12.79,13.14,12.6,13.13,"40,831,730"
"02/16/2017",13.25,13.35,12.84,12.97,"52,403,840"
"02/15/2017",13.2,13.44,13.15,13.3,"33,655,580"
"02/14/2017",13.43,13.49,13.19,13.26,"40,436,710"
"02/13/2017",13.7,13.95,13.38,13.49,"57,231,080"
"02/10/2017",13.86,13.86,13.25,13.58,"54,522,240"
"02/09/2017",13.78,13.89,13.4,13.42,"72,826,820"
"02/08/2017",13.21,13.75,13.08,13.56,"75,894,880"
"02/07/2017",14.05,14.27,13.06,13.29,"158,507,200"
"02/06/2017",12.46,13.7,12.38,13.63,"139,921,700"
"02/03/2017",12.37,12.5,12.04,12.24,"59,981,710"
"02/02/2017",11.98,12.66,11.95,12.28,"116,246,800"
"02/01/2017",10.9,12.14,10.81,12.06,"165,784,500"
"01/31/2017",10.6,10.67,10.22,10.37,"51,993,490"
"01/30/2017",10.62,10.68,10.3,10.61,"37,648,430"
"01/27/2017",10.6,10.73,10.52,10.67,"32,563,480"
"01/26/2017",10.35,10.66,10.3,10.52,"35,779,140"
"01/25/2017",10.74,10.975,10.15,10.35,"61,800,440"
"01/24/2017",9.95,10.49,9.95,10.44,"43,858,900"
"01/23/2017",9.68,10.06,9.68,9.91,"27,848,180"
"01/20/2017",9.88,9.96,9.67,9.75,"27,936,610"
"01/19/2017",9.92,10.25,9.75,9.77,"46,087,250"
"01/18/2017",9.54,10.1,9.42,9.88,"51,705,580"
"01/17/2017",10.17,10.23,9.78,9.82,"70,388,000"
"01/13/2017",10.79,10.87,10.56,10.58,"38,344,340"
"01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900"
"01/11/2017",11.39,11.41,11.15,11.2,"39,337,330"
"01/10/2017",11.55,11.63,11.33,11.44,"29,122,540"
"01/09/2017",11.37,11.64,11.31,11.49,"37,215,840"
"01/06/2017",11.29,11.49,11.11,11.32,"34,437,560"
"01/05/2017",11.43,11.69,11.23,11.24,"38,777,380"
"01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680"
"01/03/2017",11.42,11.65,11.02,11.43,"55,114,820"
"12/30/2016",11.7,11.78,11.25,11.34,"44,033,460"
"12/29/2016",11.24,11.62,11.01,11.59,"50,180,310"
"12/28/2016",12.28,12.42,11.46,11.55,"71,072,640"
"12/27/2016",11.65,12.08,11.6,12.07,"44,168,130"
The script escapes dates and thousands-separated numbers.

Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that.
If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire

My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').

Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

Using BeautifulSoup to parse facebook

so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:
for urls in my_urls:
try:
page = urllib2.urlopen(urls)
soup = BeautifulSoup(page)
info = soup.find_all("div", class_="fsl fwb fcb")
info2 = info.findall('a')
The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.
This line successfuly grabs the pageTitle:
info = soup.find_all("title", attrs={"id": "pageTitle"})
This line can get pretty far down the list of elements, but can't go any farther.
info = soup.find_all(id="pagelet_timeline_main_column")
Here's a sample page that I'm trying to parse, I want current city from it:
https://www.facebook.com/100004210542493
and heres a quick screenshot of what the part I want looks like:
http://prntscr.com/1t8xx6
I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!
EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.

Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.
wget https://www.facebook.com/100004210542493
You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).
Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too

Identify Webpage

Hi I am trying to parse a webpage in Python. This webpage is in a restricted area so I can not give the link. In this webpage you can do queries which then are published in a table which is added on the same webpage, but with new url. When I parse the page I get everything except the table.
I have noticed that it does not matter how my queries are, the url is always the same. So I always get the same result from my parser, which is the webpage without the query result (the table). But if I inspect the webpage (in Chrome) then the table and its results is included in the HTML. My parser just look like this:
import urllib.request
with urllib.request.urlopen("http://www.home_page.com") as url:
s = url.read()
#I'm guessing this would output the html source code?
print(s)
Then my question, are there some other way to identify the webpage so I will receive everything that is published on the webpage?

will based on your question i think you are looking up for web scraping techniques
will here is what i'm suggesting
you could use regular expressing to get data that can be expressed in specific patterns
for example
import urllib,re
siteContent = urllib.urlopen("http://example.com").read()
GetBoldWords = re.findall(r"<b>[\w\d ]+",siteContent)
print "Bold Words are :"
print getBoldWords
so in this case you have to learn more about regex (regular expression) and get your own pattern
in some specific cases you might have to deal with Client Side (for example you have to submit query's through pop up pages from javascript or you have to ignore some alert from javascript then you have to use web browsers api , you could use Selenium to deal with this kind of issues

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Querying web pages with Python - python

Related

Python - Scrapy ecommerce website

Python- Downloading a file from a webpage by clicking on a link

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

Using BeautifulSoup to parse facebook

Identify Webpage

Categories

Resources