How to fetch data from Skyscanner? - python

I am new to Python and there has been a request for grabbing the dynamic data from www.skyscanner.net.
Can someone guide me on doing so?
import requests
import lxml.html as lh
url = 'http://www.skyscanner.net/transport/flights/sin/lhr/131231/140220/'
response = requests.post(url)
tree = lh.document_fromstring(response.content)
print(tree);
All I did was to find the pattern in URL and attempt to grab from there. However, no data were successfully pulled. I learnt that Python was the best language in doing such task, but the library seems too huge and I do not know where to start form.

My name is Piotr - I work for Skyscanner - in Data Acquisition team - which I assume that you are applying to join :-) As this is a part of your task I wouldn't like to give you a straight answer , however you might consider:
Understand how our site works - how the requests are built and what data you can find in the http response.
You could use some libraries that will help you parsing xml/json responses
I think that's all I can say :-)
Cheers,
piotr

Related

How can i convert scraping script as web-service?

I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!

How can I load all of a site's resources, including AJAX requests, etc.. in Python?

I know how to request a web site and read its text with Python. In the past, I've tried using a library like BeautifulSoup to make all of the requests to links on a site, but that doesn't get things that don't look like full urls, such as AJAX requests and most requests to the original domain (since the "http://example.com" will be missing, and more importantly, isn't in an <a href='url'>Link</a>format, so BeautifulSoup will miss that).
How can I load all of a site's resources in Python? Will it require interacting with something like Selenium, or is there a way that's not too difficult to implement without that? I haven't used Selenium much, so I'm not sure how difficult that will be.
Thanks
It all depends on what you want and how you want it. The closest that may work for you is
from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content
You can know more on: http://jeanphix.me/Ghost.py/
Mmm that's a pretty interesting question. For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created.
One option is using something like what this answer describes, which is using a third party library, like Qt, to actually run the website. To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same).
Finally once you have the URL's, you can use something like Requests to download the external resources.
I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url)--which was enough for me in this case`:
import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

getting information from a webpage for an application using python

I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.

Extracting and parsing HTML from a secure website with Python?

Let's dive into this, shall we?
Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)
EDIT:
Currently I am using python's NLTK module. Here is a simple version of my code:
url = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
This code works fine for both http and https, but not for instances where authentication is required.
Is there a Python module which deals with secure authentication?
Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.
Mechanize (2) is one option, other is just with urllib2

Proper way to extract JSON data from the web given an API

I have an URL in the form of
http://site.com/source.json?s=
And I wish to use Python to create a class that will allow me to parse in my "s" query, send it to that site, and extract out the JSON results.
I've tried importing json/setting up the class, but nothing ever really works and I'm trying to learn good practices at the same time. Can anyone help me out?
Ideally, you should (especially when starting out), use the requests library. This would enable your code to be:
import requests
r = requests.get('http://site.com/source.json', params={'s': 'somevalue/or other here'})
json_result = r.json()
This automatically escapes the parameters, and automatically converts your JSON result into a Python dict....

Categories