I have this little website i want to fill in a form with the requests library. The problem is i cant get to the next site when filling the form data and hitting the button(Enter does not work).
The important thing is I can't do it via a clicking bot of some kind. This needs to be done so I can run in without graphics.
info = {'name':'JohnJohn',
'message':'XXX',
'sign':"XXX",
'step':'1'}
First three entries name, message, sign are the text areas and step is I think the button.
r = requests.get(url)
r = requests.post(url, data=info)
print(r.text)
The Form Data looks like this when i send a request via chrome manually:
name:JohnJohn
message:XXX
sign:XXX
step:1
The button element looks like this:
<td colspan="2" style="text-align: center;">
<input name="step" type="hidden" value="1">
<button id="button" type="button" onclick="myClick();"
style="background-color: #ef4023; width: 80px; font-face: times; font-size: 14pt;">
Wyślij
</button>
</td>
The next site if i do this manually has the same adres.
As you might see from the snipped you posted, clicking the button is triggering some JavaScript code, namely a method called myClick().
It is not straightforward to click on this thing using pythons requests library. You might have more luck trying to find out what happens inside myClick(). My guess would be that at some point, a POST request will be made to a HTTP endpoint. If you can figure this out, you can translate it into your python code.
If that does not work, another option would be to use something like Selenium/PhantomJS, which give you the ability to have a real, headless and scriptable browser. Using such a tool, you can actually have it fill out forms and click buttons. You can have a look at this so answer, as it shows you how to use Selenium+PhantomJS from python.
Please make sure not to abuse such methods by spamming forums or [insert illegal or otherwise abusive activity here].
In such a situation when you need to forge scripted button's request, it may be easier not to guess the logic of JS but instead perform a physical click and look into chrome devtools' network sniffer which gives you a plain request made which, in turn, can be easily forged in Python
Related
I'm trying to write a small python script that will get a list of URLs (which are mostly just a search bar with some other elements), "write" something in the search bar and "press" enter.
My goal is to get to the next page, where all the search results are, and look at the new URL
The sites are different, so I can't just get the query parameter since I don't know it, and every site is different.
I was thinking about searching for the "input" part of the page (since the search bar is supposed to be the only input there), and send something to it. Then wait for the new URL.
Is that possible? Is there a smarter way?
I'm trying to avoid using something else but python for now (selenium, etc..)
I was searching every possible answer here and on the web, but nothing was possible so I was thinking about using the input part somehow...
Without selenium and similar softwares, you'll have to understand what is actually happening when you click such a button.
I'll take an example on a famous site (hint: if you read this you'll know which site I mean). In the HTML source I can see that (truncated) piece of code:
<form id="search" role="search" action=/search method="get" class="grid--cell fl-grow1 searchbar px12 js-searchbar " autocomplete="off">
<div class="ps-relative">
<input name="q"
type="text"
...
What it means is that when you submit the form, the browser would hit the URL /search (which here means https://stackoverflow.com/search), with a get request, wrapping all of the form's fields in the URL, after a ?. If I searched for the term python it would lead to http://stackoverflow.com/search?q=python.
Note that according to their content, the parameters would need to be URL-encoded.
If the form contains more input fields, you'll have to wrap them too, separating them by & signs, in such a way: param1=value1¶m2=value2&...
To sum up, searching only inputs won't be sufficient, you'll have to parse the forms.
Not knowing more about your data, I cannot elaborate, but I think you might be able to do something with that.
So there is a page with the following contents;
<input type="text" name="input1">
<input type="text" name="input2">
<button onclick="dosomething()" id="check">Check</button>
This page gets loaded in with javascript, so you need to wait until it's fully loaded. After that the text files has to be filled in, and click the submit button. If it succeeds, it will set a cookie and redirect you. If it doesn't succeeds, it doesn't do anything.
How can I make this work in python? I couldn't really find a answer that would work for me. This python code will be runned in Windows.
Most likely you could just send direct request with form's field. Based on form's method it'll be GET or POST request. Later you can catch and save coockies and head to redirected page.
If it doesn't work you could also try Selenium to emulate user's behaviour
If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using
driver.find_element_by_xpath('xpath').click()
to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window?
I searched a lot in stackoverflow and google
but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!
To expand on Chillie's answer.
The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:
See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.
You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.
But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:
<div class="age-check-modal" id="age-check-modal">
You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.
So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?
You should inspect the html code more to see what each website does, this might make your scraping tasks easier.
Edit: After inspecting the original html you can see the following:
<div class="products-list">
<div class="products-container-block">
<div class="products-container">
<div id="hits" class='row'>
</div>
</div>
</div>
</div>
You can also see a lot of JS script tags.
The browser element inspector shows us the following:
The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.
What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.
Is it possible to call a Javascript function on a website that I'm web scraping and saving the result of the function?
I'm using Requests to establish a connection and saving certain pages that I need and BeautifulSoup to make it readable and accessing certain parts.
There is one part that I'm not sure how to call, or even if it's possible:
<tr class=TRDark>
<td width=100% colspan=3>
<a href="" onclick="OpenPayPlan('payplan.asp?app=******');return false;">
Betalingsplan
</a>
</td>
</tr>
This function will open a new window and calculate some data that I need. Is this possible to do with Python?
I cannot use Selenium or similar programs for this. This must be executed in the terminal and only the terminal.
You need to find a JavaScript interpreter with Python bindings maybe. When you've found one which will fit with your needs you can read the documentation and there you can see how this interpreter works. An example could be pyv8. Python however, does not include a JavaScript interpreter.
Almost all examples on line for POST and GET request in python are using the same URL (https://api.github.com/events or similar). I'd like to have a real concrete example to understand how it works.
My aim is to download stockexchange data from this website
https://www.abcbourse.com/download/historiques.aspx
By looking at the HTML code I found the "download button" :
type="submit" name="ctl00$BodyABC$Button1" value="Télécharger" id="ctl00_BodyABC_Button1
and one of the checkbox: #id="ctl00_BodyABC_xcac40p" type="checkbox" name="ctl00$BodyABC$xcac40p" />
I have no idea now how to use to make a script saying 'check the box ***, press the download button'
Any help would be appreciate,
it's not so simple here to explain in detail what I would do ... But I can show you a way forward ..., Python's Selenium library makes it possible to generate bots on webpages to perform actions you want. You can through a findElement, localize the check button and press it. I did a recent job using this library and I know that it is perfect for solving your problem.