I'm trying to make a web scraper using Python. The website has a login form though and I've been trying to log in for a few days with no results. The code looks like this:
session_requests = requests.Session()
r = session_requests.get(login_url, headers=dict(referer=login_url))
print(r.content)
tree = html.fromstring(r.text)
authenticity_token = list(set(tree.xpath('//input[#name="_csrf_token"]/#value')))[0]
payload = {"_csrf_token": authenticity_token, "_username": "-username-", "_password": "-password-",}
r = session_requests.post(login_url, data=payload, headers=dict(referer=login_url))
print(r.content)
You can see I print out r.content both before and after posting to the login page, and in theory I should get different outputs (because the second one should be the content of the actual web page after the login), but unfortunately I get the exact same output.
Here's a screenshot of what the login page requires to log in:
enter image description here
Also, I know for sure that the _csrf_token is correct because I have tested it a few times, so no doubts about that part.
Another thing that might be useful: I don't think I really need to include the headers because the outputs are exactly the same with or without them (I include them just because). Thanks in advance.
Edit: the URL is https://nuvola.madisoft.it/login
Here's some more useful stuff:
Related
Here is my code:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie)
print(response.content)
I also tried this:
print(response.text)
But both return this:
'<script>window.location="URL HERE";</script>'
All I want is to get the html of the page.
Any ideas?
EDIT:
I don't know how I did it, but I got the header and the cookie again from the website and plugged them in and it worked. I think it might be because I use Firefix and I had an update so maybe the header changed IDK.
That is the HTML of the whole page. Unfortunately, the page seems to require JavaScript, which your approach does not support. I don't know much about JS, but it seems to be a function that redirects you to a different page.
You could use an approach based on Selenium if the website only works when JS is enabled.
The website you used seems to redirect people when I went to the link, so when you get the url, make sure to allow redirects.
Do this:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie, allow_redirects=True)
import requests
url = "https://stackoverflow.com/"
payload = {"q": "python"}
s = requests.session()
r = s.post(url, data=payload)
print r.text
I wish to use a post request in order to obtain the subsequent webpage. However, the above code prints the source code of the home page and not the the next page. Can someone tell me what I should do to obtain the source code of the next page? I have searched through many questions on StackOverflow related to this and haven't found a solution.
Thanks in advance.
I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.
It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
Using mechanize (and python) I can go to a website, log in, find a form, fill in some answers, and submit that form. However, I don't know how I can open the "response" page - that is, the page that automatically loads once you've submitted the form.
Here's the python code:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = "KO"
br.form["quantityTextbox"] = "10"
br.form["previewOrderButton"]
preview = br.submit()
print preview.read
With the above code, I can see what the response page holds. But I want to actually open that page and interact with it. How can I do that with mechanize? Thank you.
EDIT: So I answered my own question soon after posting this. Here's the code:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = symbol
br.form["transactionTypeDropDown"] = [order_type]
br.form["quantityTextbox"] = amount
br.form["previewOrderButton"]
no_url = br.submit()
final = no_url.geturl()
x = br.open(final)
print x.read()
To get the html source code of the response page (the page that loads when you submit a form), I simply had to get the url of br.submit(). And there's a built in mechanize function for that, geturl().
The OP's answer is a bit convoluted and resulted in a AttributeError. This worked better for me:
br.submit()
base_url = br.geturl()
print base_url
Getting the URL of the new page and opening it isn't necessary. Once the form has been submitted the new page opens automatically and you can start interacting with it using the same mechanize browser object.
Using the original code from your question, if you wanted to submit the form and store all links on the new page in a list:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = "KO"
br.form["quantityTextbox"] = "10"
br.form["previewOrderButton"]
br.submit()
# Here we store all links on the new page
# but we can use br do any necessary processing.
links = [link for link in br.links()]
# This will take us back to the original page with the "simTrade" form.
br.back()