Python scraper with POST request doesn't bring any results - python

I've written a script to scrape the "First Name" from a webpage using post request in python. However, Running my script I get neither any results nor any error. It seems to me that I'm doing things the right way. Hope somebody will point me into the right direction showing me what I'm missing here:
import requests
from lxml import html
payload = {'ScriptManager1':'UpdatePanel1|btnProceed','__EVENTTARGET':'','__EVENTARGUMENT':'','__VIEWSTATE':'/wEPDwULLTE2NzQxNDczNTcPZBYCAgQPZBYCAgMPZBYCZg9kFgQCAQ9kFgQCAQ9kFgICAQ9kFg4CBQ8QZGQWAGQCFQ8QZGQWAWZkAiEPEGRkFgFmZAI3DxBkZBYAZAI7DxBkZBYAZAJvDw9kFgIeBXZhbHVlZWQCew8PZBYCHwBlZAICD2QWAgIBD2QWAgIBD2QWAmYPZBYSAgcPEGRkFgBkAi0PEGRkFgFmZAJFDxYCHgdFbmREYXRlBmYcik5ut9RIZAJNDxBkZBYBZmQCZQ8WAh8BBmYcik5ut9RIZAJ7DxBkZBYAZAKBAQ8QZGQWAGQCyAEPD2QWAh8AZWQC1AEPD2QWAh8AZWQCBw9kFgICAw88KwARAgEQFgAWABYADBQrAABkGAMFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYDBQxyZG9QZXJtYW5lbnQFDHJkb1Byb3Zpc2lvbgUMcmRvUHJvdmlzaW9uBQlHcmlkVmlldzEPZ2QFCk11bHRpVmlldzEPD2RmZFSgnfO4lYFs09JWdr2kB8ZwSO3808nJf+616Y8YJ3UF','__VIEWSTATEGENERATOR':'5629D98D','__EVENTVALIDATION':'/wEdAAekSVFWk+dy9X9XnzfYeR4NT1Z25jJdJ6rNAjXmHpbD+Q8ekkJ2enuXq0jY/CeUlod/njRPjRiZUniYWoSlesZ/+0XiOc/vwjI5jxqS0D5ang1Wtvp3KMocxPzInS3xjMbN+DvxnwFeFeJ9MIBWR693SSiBqUlIhPoALKQ2G08CpjEhrdvaa2JXqLbLG45vzvU=','r1':'rdoPermanent','txtRegistNo':'SRO0394294','__ASYNCPOST':'true','btnProceed':'Proceed'}
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}
response = requests.post("https://www.icaionlineregistration.org/StudentRegistrationForCaNo.aspx", params=payload, headers=headers).text
tree = html.fromstring(response)
item = tree.xpath('//div[#class="div_input_place"]/input[#id="txt_name"]/#value')
print(item)
URL is given in my script and the reg number to get the "First Name" is "SRO0394294". The xpath I've used above is the correct one.

__VIEWSTATE input is always changing. this input could be used to prevent the registration form from bots

The problem is probably that the __EVENTTARGET field is empty, it may be needed in order to submit your request. You can find the value to set with the form submit button in most cases.
Also since the __VIEWSTATE is always regenerating upon requests you'll need to grab it. You can do firstly a GET request, save the __VIEWSTATE input and then do a POST request with the previous __VIEWSTATE value.

Related

Website Login Not Accepting POST Method Python Script Login

I am trying to login to a site with POST method, then navigate to another page and scrape the HTML data from that second page.
However, the website is not accepting the package I am pushing to it, and the script is returning the data for a non-member landing page instead of the member page I want.
Below is the current code that does not run.
#Import Packages
import requests
from bs4 import BeautifulSoup
# Login Data
url = "https://WEBSITE.com/ajax/ajax.login.php"
data ={'username':'NAME%40MAIL.com','password':'PASSWORD%23','token':'ea83a09716ffea1a3a34a1a2195af96d2e91f4a32e4b00544db3962c33d82c40'}
# note that in HTML have encoded '#' like NAME%40MAIL.com
# note that in HTML have encoded '#' like %23
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}
# Post data package and log in
session = requests.Session()
r = session.post(url, headers=headers, data=data)
# Navigate to the next page and scrape the data
s = session.get('https://WEBSITE.com/page/93/EXAMPLE-PAGE')
soup = BeautifulSoup(s.text, 'html.parser')
print(soup)
I have inspected the Elements on the login page and the AJAX URL for the login action is correct and there are 3 forms that need to be filled as seen in the image below. I pulled the hidden token value from the inspect element panel and passed it along with the username/e-mail and password:
Inspect Element Panel
I really have no clue what the issue might be, but there is a BOOLEAN variable for IS_GUEST returning TRUE in the HTML return that tells me I have done something wrong and the script has not been granted access.
This is also puzzling to troubleshoot since there is a redirect landing page and no server error codes to analyze or give me a hint.
I am using a different header than my actual machine, but that has never stopped me before from more simple logins.
I have encoded the string passed in the login data email with '%40' instead of '#' and the special character required in the password was encoded as '%23' for '#' (i.e. NAME#MAIL.COM = 'NAME%40MAIL.COM' and PASSWORD# = 'PASSWORD%23') Whenever I change the e-mail to use the '#' I get a garbage response, and I tried putting the '#' back in the password, but that changed nothing either.

Adding a product to cart on Supreme via POST requests in Python - Request not working

I am trying to make a bot in Python that can add a product to my cart on Supreme upon detection. I want this to be efficient, and when I try to use HTTP post requests to get the job done, I receive response code 200 (OK) but the product isn't added in my basket.
I have tried this with both the Python requests module and the selenium requests module. The code is below:
post_headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36', 'x-requested-with': 'XMLHttpRequest', 'content-type': 'application/x-www-form-urlencoded'}
post_data = {"utf-8": "%E2%9C%93", 's': size_id, 'st': style_id, "X-CSRF-Token": csrf, "commit": "add to cart"}
url = "https://www.supremenewyork.com/shop/{productid}/add".format(productid=id)
add_to_cart = session.post(url, headers=post_headers, data=post_data)
The response for add_to_cart is the HTTP code 200 (OK) but when I run print(add_to_cart.text), I expect to see the product I added, however I just see [] (mobile user agent) or the supreme homepage html (desktop user agent), and figure out that there is nothing in the basket. I have also tried using a mobile user agent to get it working (json), and have also failed.
When I try to use selenium requests, I am using Google Chrome (otherwise I am using custom user agents).
I would appreciate any suggestion or way to fix this and be able to add products to my basket via HTTP POST requests.
In order to see what you get in the response, you can also use .content:
add_to_cart = session.post(url, headers=post_headers, data=post_data)
print(add_to_cart.content)
From what I see being returned in that content, only var h = {"76049":1,"cookie":"1 item--76049,26482"} can be helpful to verify it was added.
Per what I see on that site, in order to get the full contents of the cart, you should also do another API call, GET on https://www.supremenewyork.com/shop/cart with your headers.
Hopefully, this is helpful. Good luck!
Why are you expecting to see your cart in the response to that POST? I know it seems logical that it perhaps would, but many websites are built in strange and mysterious ways.
Are you using the Chrome Developer Tools? If you look in the Network tab for a request to add something to the cart, you'll see under the response tab you just get a load of JavaScript back. However, if you look under the response cookies, you'll see something like this:
cart 1+item--62197%2C28449
Which looks like the product IDs for whats in the cart, is in a Cookie. You could then look for that in your response by calling:
add_to_cart.cookies["cart"]
Alternatively, you could do a GET on:
https://www.supremenewyork.com/shop/cart
but you would then need to parse the HTML you get back.. probably easier to check the Cookies.

Validate if login happened using requests module in python 2.7

I am trying to login-in into a website using Python requests module.
Website : http://www.way2sms.com/
I use POST to submit the form data. Following is the code that is use.
import requests as r
URL = "http://www.way2sms.com"
data = {'mobileNo':'###','password':'#####'}
header={'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.42 Safari/537.36'}
sess = r.Session()
x = sess.post(url , data= data , headers = header)
print x.status_code()
I don't seem to find a way to validate if the login was successful or not. Also the Response is always 200 whether if I enter the right login details or not.
My whole intention is to login-in and then send text messages using this website(I know that I could have used some API). But I am unable to know if I have logged-in successfully or not.
Also this website uses some kind of JSESSIONID (don't know much about that) to maintain the session.
As you can see in the picture, site submit an AJAX request to www.way2sms.com/re-login so it would be better to submit your request directly here and then check response (returned content)
Something like this would help:
session = requests.Session()
URL = 'http://www.way2sms.com/re-login'
data = {'mobileNo': '94########', 'password': 'pass'} # Make sure to remove '+' from your number
post = session.post(URL, data=data)
if post.text != 'login-reg': # This returned when i did input invalid credentials
print('Login successful')
else:
print(post.text)
Since i don't have an account there you may also need to check success response
Check if the response object contains the cookie you're looking for, namely JSESSIONID.
if x.cookies.get('JSESSIONID'):
print 'Login successful.'

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

requests & urllib failed get complete html

I was trying to get the comments & authors. The authors are chained so that I know who was replying to who. So it is important to store all comments out there otherwise replies to the missing comment are nowhere to be chained. (I know it is kinda confusing, but on this website, replies are also comments but special that also indicates the author of the comment they reply to.)
From a Chinese website (https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%2215184366%22%7D) using requests.
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
headers = {'User-Agent': user_agent,
}
url = "https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%"+"2215184366"+"%22%7D"
r = requests.get(url, headers=headers, allow_redirects = True)
soup = BeautifulSoup(r.text,"lxml")
soup.prettify()
for comment in soup.find_all("div", "zm-item-comment"):
p = comment.find("a", "zg-link author-link")
print(p)
However, I found that the codes above can get me most of the content I want but with some "holes". Most of the comments are nicely listed but some are missing. During the debug, I found that the response from requests was incomplete. The response itself missed some comments for unknown reasons.
Console Output(where all "None" should be comments)
I also tried similar approach using urllib and no good.
Could you please help me get the complete html as the browser does?
Update:
I think the problem has to do with the response from the website. The simple requests.get cannot get the full website as Chrome does. I am wondering if a fundamental solution to get the complete html exists.
I have tried #eLRuLL's code. It does get the lost authors name. However, the lost authors all appear to be "知乎用户” which means the general user of that website. (I am expecting different and specific user names) Comparing to the Chrome browser, the browser displays specific user names well.
Try this. You will have all the authors and comments.
import requests
from bs4 import BeautifulSoup
url = "https://www.zhihu.com/node/AnswerCommentListV2?params=%7B%22answer_id%22%3A%"+"2215184366"+"%22%7D"
res = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".zm-item-comment"):
try:
author = item.select(".author-link")[0].text
comment = item.select(".zm-comment-content")[0].text
print(author,comment)
except:pass
The problem seems to be that you think that all the comments should be inside an a tag, but if you check, the comments you are missing are exactly the ones that don't have a link on the user's name (so you can't use a tag to find them), so to get the name of the author you'd have to use:
p = comment_author.find("div", "zm-comment-hd").text
print(p)

Categories