I have an URL in the form of
http://site.com/source.json?s=
And I wish to use Python to create a class that will allow me to parse in my "s" query, send it to that site, and extract out the JSON results.
I've tried importing json/setting up the class, but nothing ever really works and I'm trying to learn good practices at the same time. Can anyone help me out?
Ideally, you should (especially when starting out), use the requests library. This would enable your code to be:
import requests
r = requests.get('http://site.com/source.json', params={'s': 'somevalue/or other here'})
json_result = r.json()
This automatically escapes the parameters, and automatically converts your JSON result into a Python dict....
Related
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
My input is the URL of a page. I wanna get the HTML of the page then parse it for a specific JSON response and grab a Product ID and another URL. On the next step, would like to append the Product ID to the URL found.
Any advice on how to achieve this?
As far as retrieving the page, the requests library is a great tool, and much more sanity-friendly than cURL.
I'm not sure based on your question, but if you're getting JSON back, just import the native JSON library (import json) and use json.loads(data) to get a dictionary (or list) provided the response is valid JSON.
If you're parsing HTML, there are several good choices, including BeautifulSoup and lxml. The former is easier to use but doesn't run as quickly or efficiently; the latter can be a bit obtuse but it's blazingly fast. Which is better depends on your app's requirements.
I am new to python, and I am currently repeating all the solved examples from the book "python for data analysis"
in one example, I need to access the APIs from twitter search. Here is the code:
import requests
url='https://twitter.com/search?q=python%20pandas&src=typd'
resp=requests.get(url)
everything works okay up to here. problems comes when I use the json
import json
data=json.loads(resp.text)
then I received the error message ValueError: No JSON object could be decoded
I tried to switch to another url instead of the twitter, but I still receive the same error message.
does anyone have some ideas? thanks
You need a response with JSON content. To search Twitter this requires use of api.twitter.com, but you need to get an OAuth key. It's more complicated than the 5 lines of code you have. See http://nbviewer.ipython.org/github/chdoig/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%201%20-%20Mining%20Twitter.ipynb
I am new to Python and there has been a request for grabbing the dynamic data from www.skyscanner.net.
Can someone guide me on doing so?
import requests
import lxml.html as lh
url = 'http://www.skyscanner.net/transport/flights/sin/lhr/131231/140220/'
response = requests.post(url)
tree = lh.document_fromstring(response.content)
print(tree);
All I did was to find the pattern in URL and attempt to grab from there. However, no data were successfully pulled. I learnt that Python was the best language in doing such task, but the library seems too huge and I do not know where to start form.
My name is Piotr - I work for Skyscanner - in Data Acquisition team - which I assume that you are applying to join :-) As this is a part of your task I wouldn't like to give you a straight answer , however you might consider:
Understand how our site works - how the requests are built and what data you can find in the http response.
You could use some libraries that will help you parsing xml/json responses
I think that's all I can say :-)
Cheers,
piotr
Let's dive into this, shall we?
Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)
EDIT:
Currently I am using python's NLTK module. Here is a simple version of my code:
url = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
This code works fine for both http and https, but not for instances where authentication is required.
Is there a Python module which deals with secure authentication?
Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.
Mechanize (2) is one option, other is just with urllib2