I am not experienced in web development, and trying to use requests.get to get some authenticated data. So far the internet appears to tell me to just do it, and i think i am formatting it wrong, but unsure how. After some trial and error, i was able to grab my cookie for the website. The following is some a made up version of what i grabbed with similar formating.
cookie = "s:abcDEfGHIJ12k34LMNopqRst5UvW-6xy.ZAbCd/eFGhi7j8KlmnoPqrstUvWXYZ90a1BCDE2fGH3"
Then, in python, i am trying to send a request. Following is a bit more pseudo code for what i am doing
r = requests.get('https://www.website.com/api/getData', cookies={"connect.sid": cookie})
After all this, the site keeps sending me a 400 error. Wondering if you guys had any idea if I am putting in the wrong cookie/part of cookie. If everything looks right and it is probably the site at fault, or what.
Grabbed a wireshark capture, and found there were other fields in the cookie that were sent that i had not filled out.
_ga
_gid
___gads
Filled those out with the relevant values, and it works.
Related
My question is so superficial, it will probably be deleted right away. But I just can't get this to work. I'm sorry.
I am trying to work with an api (https://api.immobilienscout24.de/). But I can't get anything to work.
There a two ways for authentication. I figure that for my needs (searching for and viewing listings) the two-legged OAuth should be sufficient. The way I undestand it, that way I simply need to give my username and usersecret for each request.
But I can't just give those in the request-headers or anything. I somehow have to sign the request with my credentials. But I don't understand what that means and how to do that.
There is even a postman-collection, but it can't be not faulty. The request doesn't use any kind of authentication. It's just a request without any headers and parameters which of course does not work.
All the github projects, that use python and work with this website are just webcrawlers. Thus I couldn't find anything helpful there either.
I am currently pulling data from a public series data from https://www3.bcb.gov.br/expectativas/publico/en/serieestatisticas
This is a public page that uses apache wicket I believe.
I usually am ok with scraping, whether GET or POST. Here I and my colleagues are stuck. Can anyone help understand what URL needs to be used to actually make the request. Here's what I've got so far:
The form with inputs:
The Fiddler capture manually executed:
Text View:
form19_hf_0=&indicador=0&calculo=0&linhaPeriodicidade%3Aperiodicidade=0&tfDataInicial=11%2F10%2F2015&tfDataFinal=11%2F24%2F2015&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaInicial=16&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaFinal=16&btnCSV=Generate+CSV
Form data I'm passing in the request:
Summary:
I need some help, I can't seem to get the POST working correctly, it takes me to a different page, and I'm not sure of how to work through this one.
NB: I'm trying to grab back a CSV.
The libraries I'm using are primarily Requests (I was going to use LXML but I don't think its going to be applicable here).
I've been trying to figure out the right form with Postman and Fiddler to understand what the request needs to be.
So,
The solution to this was somewhat indirect. We were not able to do a straight POST because the the page incremented the actual POST url in a way that was generally impossible to predict.
The solution that we used was installing Selenium web driver and using that to simulate the dropdown visible values and button clicks.
This worked out very cleanly.
Thanks and HTH anyone else who might have a similar problem.
I've been trying to get my python code to post. I have tried using the Postman Plugin to test the post method and I would get a 405 method error. I am planning to have the user post the information and have it displayed.
Currently if I press submit I would get a error loading page, changing the form to get results in the submit button working and returning to the previous page. If I change the handler to post the screen would instantly display '405 Method Not Allowed'. I've looked through the Google App Engine logs and there are no errors. Can someone help me with what I done wrong and advise me on how to the post method functioning?
Thanks for the time.
You're getting '405 method not allowed' because the POST is going to the same url that served up the page, but the handler for that path (MainPage) does not have a post method.
That's the same diagnosis that you were given when you asked this question two hours earlier under a different user id.
Stepping back further to address the "what have I done wrong" question: It seems to me that you've gotten quite far along before discovering that what you have doesn't work. So far along that the example is cluttered with irrelevant code that's unrelated to the failure. That makes it harder for you for figure this out for yourself, and it makes it more difficult for people here to help you.
In this situation, you help yourself (and help others help you) by seeing how much code you can eliminate while still demonstrating the failure.
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I've asked one question about this a month ago, it's here: "post" method to communicate directly with a server.
And I still didn't get the reason why sometimes I get 404 error and sometimes everything works fine, I mean I've tried those codes with several different wordpress blogs. Using firefox or IE, you can post the comment without any problem whatever wordpress blog it is, but using python and "post" method directly communicating with a server I got 404 with several blogs. And I've tried to spoof the headers, adding cookies in the code, but the result remains the same. It's bugging me for quite a while... Anybody knows the reason? Or what code should I add to make the program works just like a browser such as firefox or IE etc ? Hopefully you guys would help me out!
You should use somthing like mechanize.
The blog may have some spam protection against this kind of posting. ( Using programmatic post without accessing/reading the page can be easily detected using javascript protection ).
But if it's the case, I'm surprised you receive a 404...
Anyway, if you wanna simulate a real browser, the best way is to use a real browser remote controlled by python.
Check out WebDriver (http://seleniumhq.org/docs/09_webdriver.html) It has a python implementation and can run HtmlUnit, chrome, IE and Firefox browsers.