Okay, so I've got a website. It's a site, where I can check my university timetable. So I've decided to write an application in Python, which logs in, scraps the timetable (classrooms, hours, lecturers, etc.), and display them, lets say for the beggining, in a .txt file. So I've done some Basic Authentication with HTTP Requests, it looked just like this:
url = "http://httpbin.org"
authURL = "http://httpbin.org/basic-auth/user/passwd"
r=requests.get(authURL, auth=HTTPBasicAuth("user", "passwd"))
print (r.content)
It's a freely hosted service, just to practise. Okay, but there are many other types of authentication. And here's my question: How can I actually determine which one is this website using, and then use that information in my application?
Related
I would like to create a tool with Python and the Twitter API to be able to create lists of tweets that match certain criteria like "contains the word Python" or "has at least 2 likes". Or simple stats like top posters, most liked, etc.
All of my search pointed me to the Tweepy project. But for that I need 0Auth tokens. So I applied for a developer account and was denied with the comment "we are unable to serve your use case".
Do I have any alternatives?
Well, as a general answer for these situations, you can always use a web-based automation tool, which is basically a library that interacts with the remote feature of the browsers and replicates what would be you "opening the website, logging in, etc" and can subsequently parse all data from the rendered elements.
Try looking at selenium, i've used that library in the past to raw scrap facebook and it worked flawless.
Edit: Note that this isn't a twitter specific library, you will have to find the html tags in the login website and use them to log in, same for parsing data, etc.
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
I use a webapp that can generate a PDF report of some data stored in the app. To get to that report, however, requires several clicks and monkeying around with the app.
I support a group of users of this app (we use the app, we don't create the app) and I'd like them to be able to generate and view this report with as few clicks as possible. Thankfully, this web app provides a lot of data via a RESTful API. So I did some scripting.
I have a Python script that makes an HTTP GET request, processes the JSON results, and uses that resultant data to dynamically build a URL. Here's a simplified version of my python code:
#!/usr/bin/env python
import requests
app_id="12345"
secret="67890"
api_url='https://api.webapp.example/some_endpoint'
resp = requests.get(api_url, auth=(app_id,secret))
json_data = resp.json()
# Simplification of the data processing I'm doing
my_data = json_data['attr1']['attr2'] + my_data_processing
# Result of the script is a link to a dynamically generated PDF
pdf_url = 'https://pdf.webapp.example/items/' + my_data
The above is a simplification of the code I actually have, but it shows the relevant points. In my actual script, I continue on by doing another GET with the dynamically built URL. The webapp generates a PDF based on the my_data portion of the URL, and I write that PDF to file. This works very well today.
Currently, this is a python script that runs on my local machine on-demand. However, I'd like to host this somewhere on the web so that when a user hits a URL in their browser it runs and generates the pdf_url, instead of having to install this script on each user's local machine, and so that the PDF can be generated and viewed on a mobile device.
The thought is that the user can open http://example.com/report-shortcut, the python script would run server-side, dynamically build the URL, and redirect the user to that URL, which would then show the PDF in the browser (assuming the user is using a browser that shows PDFs like Chrome, Safari, etc). Alternately, if a redirect is problematic, going to http://example.com/report-shortcut could just show an HTML page with a link to the URL generated by the Python script.
I'm looking for a solution on how to host this Python script and have it run when a user accesses a webpage. I've looked into AWS Lambda and Django, but both seem like overkill for such a simple script (~20 lines of code, plus comments and whitespace). I've also looked at Python CGI scripting, which looks promising, but I have no experience setting up something like that.
Looking for suggestions on how best to host and run this code when a user goes to the example URL.
PS: I thought about just re-implementing in Javascript, but I'd rather the API key not be publicly accessible.
I suggest building the script in AWS Lambda and using the API Gateway to invoke it.
You could create the pdf, store it in S3 and generate a pre-signed URL. Then return a response 302 to the user to redirect them to the pre-signed URL. This will display the PDF in their browser.
Very quick to setup and using Boto3 getting the PDF into S3 and generating the URL is simple.
It will be much simpler than some of your other suggestions.
See API Gateway
& Boto3
I want to read the HTML contents of a site on Google's Play Store developer backend from Python.
The Url is
https://play.google.com/apps/publish/?dev_acc=1234567890#AppListPlace
The site is of course only accessibly if you're logged in.
I naively tried:
response = requests.get(url, auth=HTTPBasicAuth('username#gmail.com', 'mypassword'))
which yielded only the default 'you need to be logged in to view this page' html content.
Any way to do this?
Trying to read the HTML contents of the page is not the way to go.
Basic HTTP authentication is not something you will see very often these days. It's the kind which pops up a browser alert message asking you for your username and password. Google, like most other websites, uses their own more sophisticated system. That system is not designed to be accessed by anyone but humans. Not to mention that storing your Google account password in your source code is a terrible idea.
Instead, you should look into the Google Play Developer API, which is designed to be accessed by machines, and uses OAuth2 authentication.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes