Regardless of Zalando usually blocking any requests traffic (I already know how to get around this), how can I detect the Post method that Zalando uses to login me in using their form zalando.de/myaccount/? With DevTools I don't seem to find the specific post method.
As far as I could see: with having the data that is needed, I could then perform something like this: How to "log in" to a website using Python's Requests module?
Can anyone show me how such a request would look like? Thanks.
Firstly, please respect their wishes to not send direct requests to their site.
For any other site that allows it:
Make sure you set up your dev tools to not clear the log on redirect
Check the response header (or your browser-storage) for which cookies have been set, since those are used to identify your session (e.g. Service-Client-Id, ...)
Make a request with those cookies from your script to pretend to have a logged in user
NEVER SHARE OR POST THOSE COOKIES ANYWHERE and read up on session-hijacking
Related
I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes
In my Django application I would like to know if the browser the client is using has AJAX or not. This is because I have, for example, profile editing. I have a version that edits the user's profile in-place and another one that redirects you to an edit page.
I know that most browsers have AJAX nowadays, but just to make sure, how can I check that in a Django application?
I believe that the correct thing would be to use some sort of graceful degradation and check for ajax in the request using Django's request.is_ajax() method
https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.is_ajax
In your view there would be something like
if form.is_valid():
if request.is_ajax():
return simplejson.dumps(something)
return redirect('/some-url/)
User agent sniffing and the like is not seen as the best solution... if you can afford that, rather use projects like hasjs on client side to check what the user's browser really is capable and send the information to the server somehow (like, serving the checking page when there is no session, let it do the checks and post the results to the server, which then creates a session and remember the capabilities for that session or the something similar).
If you want know if a browser support AJAX, you need know the capabilities of the browser, you need this project:
https://github.com/clement/django-wurfl/
I haven't foun a way to do this, therefore what I could do was prepare a JavaScript-free version and a JavaScript version of my template.
I load a .js file, and it replaces all the links to other pages with AJAX links. Therefore, if the user doesn't have JavaScript he will see all the original links and functionality, and if he has JavaScript he will see all AJAX functionality.
It's my first question here.
Today, I've done a little application using wxPython: a simple Megaupload Downloader, but yet, it doesn't support premium accounts.
Now I would like to know how to download from MU with a login (free or premium user).
I'm very new to Python, so please don't be specific and "professional".
I used to download files with urlretrieve but, but is there a way to pass "arguments" or something to be able to log in as a premium user ?
Thank you. :D
EDIT =
News: new help needed xD
After trying with PyCUrl, htmllib2 and mechanize, I've done the login with urllib2 and cookiejar (the requested html says the username).
But when I start download a file, surely the server doesn't keep my login, in fact the downloaded file seems corrupted (I changed wait time from 45 to 25 seconds).
How can I download a file from MegaUpload keeping my previously done login? Thanks for your patient :D
Questions like this are usually frowned upon, they are very broad, and there are already an abundance of answers if you just search on google.
You can use urllib, or mechanize, or any library you can make an http post request with.
megaupload looks to have the form values
login:1
redir:1
username:
password:
just post those values at http://megaupload.com/?c=login
all you should have to do is set your username and password to the correct values!
For logging in using Python follow the following steps.
Find the list of parameters to be sent in the POST request and the url where the request has to be made by viewing the source of the login form. You may use a browser with "Inspect Element" feature to find it easily. [parameter name examples - userid, password]. Just check the tags name attribute.
Most of the sites set a cookie on logging in and the cookie has to be sent along with subsequent requests. To handle this download httllib2 (http://code.google.com/p/httplib2/ ) and read the wiki page on the link given. It has shown how to login with examples.
Now you can make subsequent requests for files, the cookies etc. will be handled automatically by httplib2.
i do alot of web stuff with python, i perfer using pycurl you can get it here
it is very simple to post data and login with curl, i've used it accross many languages such as PHP, python, and C++, hope this helps
You can use urllib this is a good example
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.