I am attempting to write a script with SharePoint package to access files on my company's SharePoint. The tutorial states
First, you need to create a SharePointSite object. We’ll assume you’re using basic auth; if you’re not, you’ll need to create an appropriate urllib2 Opener yourself.
However, after several attempts, I've concluded that basic auth is not sufficient. While researching how to try to make it work, I came upon this article which gives a good overview of the general scheme of authentication. What I'm struggling with is implementing this in Python.
I've managed to hijack the basic auth in the SharePoint module. To do this, I took the XML message in the linked article and used it to replace the XML generated by the SharePoint module. After making a few other changes, I now recieve a token as described in Step 2 of the linked article.
Now, in Step 3, I need to send that token to SharePoint with a POST. The below is a sample of what it should look like:
POST http://yourdomain.sharepoint.com/_forms/default.aspx?wa=wsignin1.0 HTTP/1.1
Host: yourdomain.sharepoint.com
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)
Content-Length: [calculate]
t=EwBgAk6hB....abbreviated
I currently use the following code to generate my POST. With the guidance from a few other questions, I've omitted the content-length header since that should be automatically calculated. I was unsure of where to put the token, so I just shoved it in data.
headers = {
'Host': 'mydomain.sharepoint.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)'
}
data = {'t':'{}'.format(token[2:])}
data = urlencode(data)
postURL = "https://mydomain.sharepoint.com/_forms/default.aspx?wa=wsignin1.0"
req = Request(postURL, data, headers)
response = urlopen(req)
However, this produces the following error message:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
How do I generate a POST which will correctly return the authentication cookies I need?
According to Remote Authentication in SharePoint Online Using Claims-Based Authentication and SharePoint Online authentication articles :
The Federation Authentication (FedAuth) cookie is for each top level
site in SharePoint Online such as the root site, the MySite, the Admin
site, and the Public site. The root Federation Authentication (rtFA)
cookie is used across all of SharePoint Online. When a user visits a
new top level site or another company’s page, the rtFA cookie is used
to authenticate them silently without a prompt.
To summarize, to acquire authentication cookies the request needs to be sent to the following endpoint:
url: https://tenant.sharepoint.com/_forms/default.aspx?wa=wsignin1.0
method: POST
data: security token
Once the request is validated the response will contain authentication cookies (FedAuth and rtFa) in the HTTP header as explained in the article that you mentioned.
SharePoint Online REST client for Python
As a proof of concept the SharePoint Online REST client for Python has been released which shows how to:
perform remote authentication in SharePoint Online
perform basic CRUD operations against SharePoint resources such as
Web, List or List Item using REST API
Implementation details:
AuthenticationContext.py class contains the SharePoint
Online remote authentication flow implementation, in particular the
acquireAuthenticationCookie function demonstrates how to handle
authentication cookies
ClientRequest.py class shows how to consume SharePoint Online REST API
Examples
The example shows how to read Web client object properties:
from client.AuthenticationContext import AuthenticationContext
from client.ClientRequest import ClientRequest
url = "https://contoso.sharepoint.com/"
username = "jdoe#contoso.onmicrosoft.com"
password = "password"
ctxAuth = AuthenticationContext(url)
if ctxAuth.acquireTokenForUser(username, password):
request = ClientRequest(url,ctxAuth)
requestUrl = "/_api/web/" #Web resource endpoint
data = request.executeQuery(requestUrl=requestUrl)
webTitle = data['d']['Title']
print "Web title: {0}".format(webTitle)
else:
print ctxAuth.getLastErrorMessage()
More examples could be found under examples folder of GitHub repository
Related
I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url
General Problem
I am trying to gather stock data from Investors Business Daily (IBD) using Python. My goal is to take a stock list and get data for each ticker in the list. The specific page I am interested in is what IBD calls the stock checkup. This page is only viewable for paid subscribers (I am using a free trial).
Specifics
I am trying to use the requests library to login to the session and then use .get to access the stock checkup page. Python version 3.7
import requests
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
LOGIN = 'https://myibd.investors.com/secure/signin.aspx?login'
PROTECTED_PAGE = 'https://research.investors.com/stock-checkup/nasdaq-applied-materials-inc-amat.aspx'
payload = {
'username': 'blahblah#gmail.com'
,'password': 'secretpw'
}
with requests.session() as s:
s.post(LOGIN, data=payload, headers = headers)
response = s.get(PROTECTED_PAGE, headers = headers)
print(response.text)
From other posts I've learned to look for the login form in the html to find the specific tag names for the username, password and other inputs. This is how I found the information for the payload above. I believe there are also hidden inputs that are making this more difficult than I hoped for (eg. __VIEWSTATE). The response text that I receive back indicates that I have not logged in and there is not specific information on the stock of interest (AMAT).
Is there a better way to do this? I have also tried using Selenium but there were issues with that as well. Is the IBD website really that tricky to login to or am I missing something? I've spent countless hours on this so any help is much appreciated!
For starters, you use incorrect endpoint for authentication.
If the website does not provide a dedicated developer's API, you are restricted to whatever the webapp itself uses.
LOGIN = 'https://myibd.investors.com/secure/signin.aspx?login' - this here is the page where the user enters their credentials. It itself does not and is not expected to accept POST requests.
Whenever the user enters the creds and presses the SUBMIT button, the webapp makes request to the actual authentication endpoint.
You can view requests issued by your browser in the developer's panel, in the "Network" section. And for IBD the request seems to go to
https://login.investors.com/accounts.login with the following payload (form data):
loginID: afds#asdf.vsd
password: asdfasdgasdfasdfasdf
sessionExpiration: 31536000
targetEnv: jssdk
include: profile,data,emails,subscriptions,preferences,
includeUserInfo: true
loginMode: standard
lang: en
riskContext: {"yadda": "yadda-yadda"}
APIKey: XXXXXXXXXXXXXXX
source: showScreenSet
sdk: js_latest
authMode: cookie
pageURL: https://myibd.investors.com/register/signin-iframe.aspx?checkauth=true&display=&t=1660650471
sdkBuild: 13318
format: json
Now what are the fields in this payload - your guess is probably better than mine. But you gotta nail down the exact request-response ping-pong and the meaning of all fields before trying to build a program around it. Your code does not even check the return code of the authentication attempt - you don't even know if you've logged in successfully or just sent your credentials somewhere.
I am trying to scrape a table in a webpage, or even download the .xlsx file of this table, using the requests library.
Normal workflow:
I log into the site. Go to my reporting page, choose report, click button that says "Test" and a second window opens up with my table and gives me the option to download the .xlsx file.
When I try to access this url I can copy and paste it into any chrome browser that I am currently logged into. When I try with requests, even when passing an auth into my get() i get a 200 response but it is a simple page with one line of text telling me to "contact my tech staff to receive the proper url to enter your username and password". This is the same as when i paste the url into a browser where I am not logged into the site. Except when I do that i am redirected to a new url that has the same sentence.
So I imagine there is a slug for the organization that is not passed in the url but somewhere in the headers or cookies when I access this site in my browser. How do i identify this parameter in the HTTP header? Then how do I send it to requests so I can get my table and move on to try and automate downloading the .xlsx.
import requests
url = 'myorganization.com/adhocHTML.xsl?x=adhoc.AdHocFilter-listAdhocData&filterID=45678&source=live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
data = requests.get(url, headers=headers, auth=('username', 'Password'))
Any help would be greatly appreciated as I am new to the requests library and just trying to automate some data flow before it ever gets to analyzing it.
You need to login with requests. You can do this with making a session and make other requests with this session (it will save all cookies and other stuffs).
Before going for code you should do a few steps:
make sure you are logged out. open browser Inspect in log in page. go to network tab. log in and find a POST request in network tab that is related to your login request. at the end of this tab you find some parameters for login. make does parameters a dictionary (login_data) in your code and go as below:
session = requests.Session()
session.post('url_to_login_page', login_data)
data = session.get(url, headers=headers)
Login data for each website are different from others so I can't give you a specific example. You should be able to find it as I said above. If you had problem with those, tell me.
I work in an environment where occasionally we have to bulk configure TP-Link ADSL routers. As one can understand this does cause productivity issues. I solved the issue using python & especially it's requests.session() library. It worked tremendously well especially for older TP-LINK models such as TP-LINK Archer D5.
Reference: How to control a TPLINK router with a python script
The method that i used was to do the configuration via browser, packet capture using Wireshark and replicate it using Python. Archer VR600 introduces new method. When starting configuration using the browser, the main page requests for new password. Once done then it generates a random long string (KEY) which is sent to the router.This key is random and unique, based on this random string JSESSIONID is generated and used throughout the session.
AC1600 IP Address: 192.168.1.1
PC IP Address: 192.168.1.100
KEY and SESSIONID when configured via Browser.
KEY and SESSIONID when configured via Python Script.
As you can see i am trying to replicate the steps via script but failing due to not been able to create a unique key which will be accepted by the router, thus failing to generate a SESSIONID and enable rest on the configuration.
Code:
def configure_tplink_archer_vr600():
user = 'admin'
salt = '%3D'
default_password = 'admin:admin'
password = "admin"
base_url = 'http://192.168.1.1'
setPwd_url = 'http://192.168.1.1/cgi/setPwd?pwd='
login_url = "http://192.168.1.1/cgi/login?UserName=0f98175e8bd1c9297fc22ec6a47fa4824bfb3c8c73141acd7b46db283557d229c9783f409690c9af5e87055608b358ab4d1dfc45f17e6261daabd3e042d7aee92aa1d8829a8d5a69eb641dcc103b17c4f443a96800c8c523b911589cf7e6164dbc1001194"
get_busy_url = "http://192.168.1.1/cgi/getBusy"
authorization = base64.b64encode(
(default_password).encode()).decode('ascii')
salted_password = base64.b64encode((password).encode()).decode('ascii')
salted_password = salted_password.replace("=", "%3D")
print("Salted Password" + salted_password)
setPwd_url = setPwd_url + salted_password
rs = requests.session()
rs.headers['Cookie'] = 'Authorization=Basic ' + authorization
rs.headers['Referer'] = base_url
rs.headers[
'User-Agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
print("This is authorization string: " + authorization)
response = rs.post(setPwd_url)
print(response)
print(response.text.encode("utf-8"))
response = rs.post(get_busy_url)
print(response)
print(response.text.encode("utf-8"))
response = rs.post(login_url)
print(response)
print(response.text.encode("utf-8"))
Use the python requests library to log in to the router, this cuts the need for any manual work:
Go to the login page and right click + inspect element.
Navigate to the resources tab, here you can see HTTP methods as they
happen.
Login using some username and password and you should see the
corresponding GET/POST method on the network tab.
Click on it and find the payload it sends to the router, this is
usually in json format and you'll need to build it in your python
script, and send it as an input to the webpage.(luckily there are
many tutorials for this out there.
Note that sometimes a payload for the script is actually generated by some javascript, but in most cases it's just some string cramped into the HTML source. If you see a payload you don't understand, just search for it in the page source. Then you'll have to extract it with something like regex and add it to your payload.
I am looking at a workflow to try and automate Twitter authentication for applications with a specific user. In order to achieve this (it seems to me: I am a beginning with OAuth workflows) I have to create a session logged in to Twitter through the web.
The application-only workflow is relatively easy to establish with the following code:
consumer_key = API_key = "app API key"
consumer_secret = API_secret = "app API secret"
encoded_secrets = base64.b64encode(consumer_key+":"+consumer_secret) # key & secret should be URL-encoded
r = requests.post("https://api.twitter.com/oauth2/token",
headers={"authorization": "Basic "+encoded_secrets },
data={"grant_type": "client_credentials"})
This retrieves the bearer token that can be used to authenticate an application to Twitter without any user interaction. For example,
access_data = json.loads(r.content)
bearer_auth_header = {"authorization": "Bearer "+access_data['access_token']}
timeline_endpoint = "https://api.twitter.com/1.1/trends/available.json"
response = requests.get(timeline_endpoint,
headers=bearer_auth_header)
This code works fine (and fails due to authentication failure when the headers are omitted).
When it comes to authorizing an application with a user's identity, however, things are more complex as an account identity has to be authenticated. The standard workflow is to bring up a link in the user's web browser that allows them to authorize the application to access their details through the web (by clicking on a button and submitting a form.
The issue comes when trying to automate this login using a request from a program rather than a web browser. My current attempt looks like this.
# access the Twitter login page to receive a '_twitter_sess' session cookie.
twit_response = requests.get("https://twitter.com/login")
# Submit sign-on form through the web
TW_ID = "username"
TW_PW = "password"
login_data = {
"session[username_or_email]": TW_ID,
"session[password]": TW_PW,
"return_to_ssl": True,
"scribe_log": "",
"redirect_after_login": "http://holdenweb.com/contact",
"force_login": True
}
login_response = requests.post("https://twitter.com/sessions",
data=login_data,
cookies=twit_response.cookies,
headers={"User-Agent": "Mozilla/5.0 "
"(Macintosh; Intel Mac OS X 10.8; rv:24.0) "
"Gecko/20100101 Firefox/24.0"})
This fails (though because this is not web authentication it receives a 200 response). I also observe that despite the cookies=twit_response.cookies argument to requests.post() the request did not seem to include the session cookie and the User-Agent header is still incorrect. However, I am unsure whether this is the cause of the failure, so I only mention this in case it's relevant.
trrh = twit_response.request.headers
for header in trrh:
print "%s: %s" % (header, trrh[header])
prints
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/2.2.1 CPython/2.7.2 Darwin/12.5.0
Can anyone help me make progress, please, it's getting to the stage where I feel I am overlooking some fundamental and possibly quite simple mistake(s).