The goal here, given a user facebook profile url, access and open the profile page. Some simple python code:
from urllib2 import urlopen
url = "http://www.facebook.com/username"
page = urlopen(url)
The problem is that for some "username" this causes HTTP ERROR 404. I noticed this error only happening when the path includes a name rather than the "profile.php?id=XXX" format.
Notice that we only have the url here and not the user id.
UPDATE:
This turned out to happen also for some of the "profile.php?id=XXX" and other username formats.
This is a privacy feature of Facebook. Users have the ability to hide their profile page so that only logged in users can view their page. Accessing the page with /profile.php?id=XXX or with /username makes no difference. You must be logged-in in order to view the HTML page.
In your context, you'd have to first log in to a valid Facebook account before requesting the page and you should no longer receive the 404's.
One way to check this is on the graph API, graph.facebook.com/USERNAME will return a link property in the resulting JSON if they have a public page, and it will be omitted on private pages.
Not every Facebook account is accessible as FIRST.LAST, so you won't be able to reliably do this.
There is currently no guarantee that an account is accessible with a vanity name.
Works perfectly fine as long as the username exists.
Are you trying to open the page in a Web Browser or access the HTML source generated by the page?
If the latter, have you thought of using the Facebook Graph API to achieve whatever it is that you are doing? This will be much faster and the API is all documented. Plus the page's HTML source could change at any point in time, whereas the Graph API will not.
Edit
You could use the Graph API without having to even create an application to get the user ID, but going to http://graph.facebook.com/username and parsing the JSON response. You can then access the profile HTML using http://www.facebook.com/profile.php?id=userId
Related
I am trying to write a python script to login to the following site in order to automatically keep on eye on some account details: https://gateway.usps.com/eAdmin/view/signin
I have the right credentials, but something isn't quite working correctly, I don't know if it is because of the hidden inputs that exist on the form
import requests
from bs4 import BeautifulSoup
user='myusername'
passwd='mypassword'
s=requests.Session()
r=s.get("https://gateway.usps.com/eAdmin/view/signin")
soup=BeautifulSoup(r.content)
sp=soup.find("input",{"name":"_sourcePage"})['value']
fp=soup.find("input",{"name":"__fp"})['value']
si=soup.find("input",{"name":"securityId"})['value']
data={
"securityId": si,
"username":user,
"password":passwd,
"_sourcePage":sp,
"__fp":fp}
headers={"Content-Type":"application/x-www-form-urlencoded",
"Host":"gateway.usps.com",
"Origin":"https://gateway.usps.com",
"Referer":"https://gateway.usps.com/eAdmin/view/signin"}
login_url="https://gateway.usps.com/eAdmin/view/signin"
r=s.post(login_url,headers=headers,data=data,cookies=r.cookies)
print(r.content)
_sourcePage, securityId and __fp are all hidden input values from the page source. I am scraping this from the page, but obviously when I get to do the POST request, I'm opening the url again, so these values change and are no longer valid. However, I'm unsure how to rewrite the POST line to ensure that I extract the correct hidden values for submission.
I don't think that this is only relevant to this site, but for any site with hidden random values.
You can't do that.
You are trying to authenticate using an HTTP POST request outside the application scope, the login page and his own web form.
For security reasons the web page implements differents techniques, one of all the Anti CSRF Token ( which it's probably __sourcePage ) to ensure that the login request comes exclusively from the web page.
For this reason, every time you scrape the page grabbing the content of the security hidden inputs, the web application generate them every time. Thus when you reuse them to craft the final request of course they are not anymore valid.
See also: https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF)
I am attempting to scrape some data from a website which requires a login. To complicate matters, I am scraping data from three different accounts. So in other words, I need to login to the site, scrape the data and then logout, three times.
The html behind the logout button looks like this:
The (very simplified) code I've tried is below:
import requests
for account in [account1,account2,account3]:
with requests.session() as session:
[[login code here]]
[[scraping code here]]
session.get(url + "/logout")
The scraping using the first account works fine, but after that it doesn't. I'm assuming this is because I'm not logging out properly. What can I do to fix this?
It's quite simple:
You should forge correct login request.
To do it go to the login page:
open 'Inspect' tool, 'Network' tab. Checking 'Preserve log' option is quite useful as well.
Log in to the site, and you'll see login request appeared in Network tab (Usually it's a POST request).
Right-click to request, select Copy -> Copy as Curl, and then just use this brilliant tool
Usually, you can trim up and headers and cookies of the code produced by the tool(but be careful trimming Content-Type header, it can break your code).
Replace requests.[get|post](...) to session.[get|post](...)
Profit. You'll have logged in session by execution of the upper code. Logging out and any form population is made pretty much the same way.
So I'm trying to generate a PDF of a view that I have in a django web application. This view is protected, meaning the user has to be logged in and have specific permission to view the page. I also have some attachments (stored in the database as FileFields) that I would like to append to the end of the PDF.
I've read most of the posts I could find on how to generate PDFs from a webpage using pdfkit or reportlab, but all of them fail for me for some reason or another.
Currently, the closest I've gotten is successfully generating a PDF of the page using pdfkit, but this requires me to remove the restrictions that require the user to be logged in and have page permissions, which really isn't an option long term. I found a couple posts that discuss printing pdfs on protected pages and providing login information, but I couldn't get any of that to work.
I haven't found anything on how to include attachments, and don't really know where to start with that.
I'm more than happy to update this question with more information or snippets of code if need be, but there's quite a few moving parts here and I don't want to flood people with useless information. Let me know if there's any other information I should provide, and thanks in advance for any help.
I got it working! Through a combination of PyPDF2 and pdfkit, I got this to work pretty simply. It works on protected pages because django takes care of getting the complete html as a string, which I just pass to pdfkit. It also supports appending attachments, but I doubt (though I haven't tested) that it works with anything other than pdfs.
from django.template.loader import get_template
from PyPDF2 import PdfFileWriter, PdfFileReader
import pdfkit
def append_pdf(pdf, output):
[output.addPage(pdf.getPage(page_num)) for page_num in range(pdf.numPages)]
def render_to_pdf():
t = get_template('app/template.html')
c = {'context_data': context_data}
html = t.render(c)
pdfkit.from_string(html, 'path/to/file.pdf')
output = PdfFileWriter()
append_pdf(PdfFileReader(open('path/to/file.pdf', "rb")), output)
attaches = Attachment.objects.all()
for attach in attaches:
append_pdf(PdfFileReader(open(attach.file.path, "rb")), output)
output.write(open('path/to/file_with_attachments.pdf', "wb"))
If you just want to secure it, you could write a custom Authentication Backend that lets your server spoof users. Way over-kill but it would solve your problem and at least you get to learn about custom auth backends! (Note: You should be using HTTPS.)
https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#writing-an-authentication-backend
Create auth backend in app/auth_backends.py
Add app.auth_backends.SpoofAuthBackend backend to settings.py that takes a shared_secret and user_id.
Create a URL route like url(r'^spoof-user/(?P<user_id>\d+)/$', 'app.views.spoof_user', name="spoof-user")
Add the view spoof_user that must invoke both django.contrib.auth.authenticate (which invokes backend in #1 above) and after getting user from authenticate(...) you pad the request with the user django.contrib.auth.login(request, user). Finally, this view should return HttpResponseForbidden if the shared secret is wrong or HttpResponseRedirect to the PDF URL you actually want (after logging in to spoof user programmatically via authenticate and login).
You would probably want to create a random secret key each request using something like cache.set('spoof-user-%s' % user_id, RANDOM_STRING, 30) which persists shared secret for 30 seconds to allow time for request. Then perform pdf_response = requests.get("%s?shared_secret=1a2b3c&redirect_uri=/path/to/pdf/" % reverse('spoof-user', kwargs={'user_id': 1234})). Your new view will test the provided shared_secret in auth backend, login user to request and perform redirect to request.GET.get('redirect_uri').
You can use pdfkit to do that. You can retrieve the page using the url and pdfkit will handle the rest:
pdfkit.from_url('http://website.com/somepage', 'somepage.pdf')
You will have to properly access the page using the appropriate headers for it is protected of course:
options = {
'cookie': [
('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2'),
]
}
pdfkit.from_url('http://website.com/somepage', 'somepage.pdf')
`
I want to read the HTML contents of a site on Google's Play Store developer backend from Python.
The Url is
https://play.google.com/apps/publish/?dev_acc=1234567890#AppListPlace
The site is of course only accessibly if you're logged in.
I naively tried:
response = requests.get(url, auth=HTTPBasicAuth('username#gmail.com', 'mypassword'))
which yielded only the default 'you need to be logged in to view this page' html content.
Any way to do this?
Trying to read the HTML contents of the page is not the way to go.
Basic HTTP authentication is not something you will see very often these days. It's the kind which pops up a browser alert message asking you for your username and password. Google, like most other websites, uses their own more sophisticated system. That system is not designed to be accessed by anyone but humans. Not to mention that storing your Google account password in your source code is a terrible idea.
Instead, you should look into the Google Play Developer API, which is designed to be accessed by machines, and uses OAuth2 authentication.
I'm working on app which saves things from many cross domains via Ajax POST method to my server/app. I need to find a solution how to send a POST and verify if the user who sent it is already signed on my site and than save it to the database.
I am pretty sure that I need Chrome, Firefox extension to do it, because I need to embed my js on every page my users surf on. The thing is I don't know where to start and how should it work. I could set up proxy to make JSON POST work, but I don't know how to verify if the user is signed on my site.
Should I get cookies of my users from browser via Chrome API and sent it in the POST and authenticate the cookie/session in Django? What do you suggest?
Thank you for your help. I appreciate every hint.
When the user logons at http://yourserver.com, you can set a permanent cookie to identify him. (see SESSION_EXPIRE_AT_BROWSER_CLOSE and COOKIE_AGE variables in django)
Then, when he embeds any JS from another site from yourserver.com domain, the cookies are automatically sent for this domain, and you can check on your django side for the cookie existence and validity and give the good JS.
Because of crossdomain issues, you should better use form POST as an alternative as AJAX as it is not security restricted. You can then play with iframes and javascript to make both domains communicates.
To embed the JS in another website, you can use a browser extension, or a simple bookmarklet, which will load your code in the current page when the user clicks it from any webpage.
My 2 cents;