Python - Download files from SharePoint site - python

I have a requirement of downloading and uploading the files to Sharepoint sites. This has to be done using python.
My site will be as https://ourOrganizationName.sharepoint.com/Followed by Further links
Initially I thought I could do this using Request, BeautifulSoup etc., But I am not at all able to go to "Inspect Element" on the body of the site.
I have tried libraries such as Sharepoint,HttpNtlmAuth,office365 etc., but I am not successful. It always returning 403.
I tried google as much I can but again not successful. Even Youtube hasn't helped me.
Could anyone help me how to do that? Suggestion on Libraries with documentation link is really appreciated.
Thanks

Have you tried Office365-REST-Python-Client library, it supports SharePoint Online authentication and allows to download/upload a file as demonstrated below:
Download a file
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
ctx_auth = AuthenticationContext(url)
ctx_auth.acquire_token_for_user(username, password)
ctx = ClientContext(url, ctx_auth)
response = File.open_binary(ctx, "/Shared Documents/User Guide.docx")
with open("./User Guide.docx", "wb") as local_file:
local_file.write(response.content)
Upload a file
ctx_auth = AuthenticationContext(url)
ctx_auth.acquire_token_for_user(username, password)
ctx = ClientContext(url, ctx_auth)
path = "./User Guide.docx" #local path
with open(path, 'rb') as content_file:
file_content = content_file.read()
target_url = "/Shared Documents/{0}".format(os.path.basename(path)) # target url of a file
File.save_binary(ctx, target_url, file_content) # upload a file
Usage
Install the latest version (from GitHub):
pip install git+https://github.com/vgrem/Office365-REST-Python-Client.git
Refer /examples/shrepoint/files/* for a more details

You can also try this solution to upload file. For me, first solution to upload doesn't work.
First step: pip3 install Office365-REST-Python-Client==2.3.11
import os
from office365.sharepoint.client_context import ClientContext
from office365.runtime.auth.user_credential import UserCredential
def print_upload_progress(offset):
print("Uploaded '{0}' bytes from '{1}'...[{2}%]".format(offset, file_size, round(offset / file_size * 100, 2)))
# Load file to upload:
path = './' + filename # if file to upload is in the same directory
try:
with open(path, 'rb') as content_file:
file_content = content_file.read()
except Exception as e:
print(e)
file_size = os.path.getsize(path)
site_url = "https://YOURDOMAIN.sharepoint.com"
user_credentials = UserCredential('user_login', 'user_password') # this user must login to space
ctx = ClientContext(site_url).with_credentials(user_credentials)
size_chunk = 1000000
target_url = "/sites/folder1/folder2/folder3/"
target_folder = ctx.web.get_folder_by_server_relative_url(target_url)
# Upload file to SharePoint:
try:
uploaded_file = target_folder.files.create_upload_session(path, size_chunk, print_upload_progress).execute_query()
print('File {0} has been uploaded successfully'.format(uploaded_file.serverRelativeUrl))
except Exception as e:
print("Error while uploading to SharePoint:\n", e)
Based on: https://github.com/vgrem/Office365-REST-Python-Client/blob/e2b089e7a9cf9a288204ce152cd3565497f77215/examples/sharepoint/files/upload_large_file.py

Related

Read pdf file from storage account (Azure Data lake) without downloading it using python

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python.
I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working.
Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it.
Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function?
Any help is appreciated.
I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser:

Error while reading PDF files from SharePoint Online

I am using SharePoint Office365 python libraries to read the pdf files in a folder and copy them to s3 but I am getting error as:
b'The length of the URL for this request exceeds the configured maxUrlLength value.'
Here is my code
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
from office365.runtime.auth.user_credential import UserCredential
def sharepoint_connection(username, password, site_url, relative_url):
ctx = ClientContext(site_url).with_credentials(UserCredential(username, password))
web = ctx.web.get().execute_query()
return ctx
def sharepoint_files(relative_url):
file_names = []
file_details = {}
ctx = sharepoint_connection(username,password,site_url,relative_url)
files = ctx.web.get_folder_by_server_relative_url(relative_url).files
ctx.load(files)
ctx.execute_query()
for file in files:
file_names.append(file.properties['ServerRelativeUrl'])
file_url = file.properties['ServerRelativeUrl']
file_name = file_url[file_url.rfind("/")+1:]
file_details[file_name] = file_url
# print(file_details)
return file_details
site_url = "https://account.sharepoint.com/sites/ExternalSharing"
relative_url = "/sites/ExternalSharing/Shared Documents/OCE********"
ctx = sharepoint_connection(username,password,site_url,relative_url)
file_url = file_details[file]
response = File.open_binary(ctx, file_url)
print(response. Content)
I understand that the URL is too long. So I tried to map the sharepoint folder to one drive and then upload it but its the same issue.
Is there a way to handle this scenario.
Thanks in advance. Please let me know if any more information is needed.
Thanks,
Ashish

ValueError when reading excel from sharepoint to python

I am trying to read an excel file from sharepoint to python.
Q1: There are two URLs for the file. If I directly copy the link of the file, I get:
https://company.sharepoint.com/:x:/s/project/letters-numbers?e=lettersnumbers
If I click into folders from the webpage one after another, until I click and open the excel file, the URL now is:
https://company.sharepoint.com/:x:/r/sites/project/_layouts/15/Doc.aspx?sourcedoc=letters-numbers&file=Table.xlsx&action=default&mobileredirect=true
Which one should I use?
Q2: My code below:
import pandas as pd
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
URL = "https://company.sharepoint.com/:x:/s/project/letters-numbers?e=lettersnumbers"
USERNAME = "abc#a.com"
PASSWORD = "abcd"
ctx_auth = AuthenticationContext(URL)
if ctx_auth.acquire_token_for_user(USERNAME, PASSWORD):
ctx = ClientContext(URL, ctx_auth)
web = ctx.web
ctx.load(web)
ctx.execute_query()
print("Authentication successful")
else:
print(ctx_auth.get_last_error())
response = File.open_binary(ctx, URL)
bytes_file_obj = io.BytesIO()
bytes_file_obj.write(response.content)
bytes_file_obj.seek(0)
df = pd.read_excel(bytes_file_obj, sheet_name="Sheet2")
It works until the pd.read_excel(), where I get ValueError.
ValueError: Excel file format cannot be determined, you must specify an engine manually.
I don't know where it went wrong and if there will be further problems with loading. It will be highly appreciated if someone could warn me of the problems or leave an example.
If you take a look at the pandas documentation for ‘read_excel’ (https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html), you’ll see that there is an ‘engine’ parameter.
Try the different options and see which one works, since your error is saying that an engine has to be specified manually.
If this is correct, in the future, take the error messages literally and check the documentation
I have tried different URLs (and how to obtain them), and received different binary files. They are either a line of code status (like 403) or warning, or something that looks like a header. So I believe the problem is the URL format.
Here (github.com/vgrem) I found the answer.
It basically says that for ClientContext you need an absolute URL,
URL = "https://company.sharepoint.com/:x:/r/sites/project"
And for File you need a relative path, but with overlap with the URL:
RELATIVE_PATH = "/sites/project/Shared%20Documents/Folder/Table.xlsx"
The RELATIVE_PATH can be found like this:
Go to the folder of the file in Teams (or on the webpage).
Choose the file, Open in app (Excel).
In Excel, File -> Property, copy the path and adapt to the above format.
Replace Space with "%20".
ctx_auth = AuthenticationContext(URL)
if ctx_auth.acquire_token_for_user(USERNAME, PASSWORD):
ctx = ClientContext(URL, ctx_auth)
web = ctx.web
ctx.load(web)
ctx.execute_query()
print("Authentication successful")
else:
print(ctx_auth.get_last_error())
response = File.open_binary(ctx, RELATIVE_PATH)
bytes_file_obj = io.BytesIO()
bytes_file_obj.write(response.content)
bytes_file_obj.seek(0)
df = pd.read_excel(bytes_file_obj, sheet_name='Sheet2')
If the sheet_name is not specified and the original .xlsx has multiple sheets, the pd.read_excel() will generate warnings and the df here is actually a dict.

Flask send_file() working with pdf but not with html

I'm having trouble with a flask + azure app. I have some files saved on the storage (pdfs and htmls) and I need to return these files when I invoke the get_file_safe endpoint. This method takes a file_id parameter and accesses the database, goes to blob azure, creates a temporary file and returns that file. When I pass codes that refer to PDF files, it works perfectly and the file is displayed on the screen. When the code matches an HTML file the answer is blank. Does anyone have any idea what it might be? Thank you very much ! (Note: When I used GCP it worked but I had to migrate, so I put here that it is azure).
from flask import Flask, flash, jsonify, session, redirect, url_for, escape, request, render_template, session, send_file
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__, ContentSettings
def get_file_safe():
#login and security stuff (...) Logic goes here ->>>
file_id = request.args.get('file_id')
cursor.execute(
"""SELECT link, mimetype from TABLE where id = %s """, (file_id))
rows = cursor.fetchall()
link = rows[0][0]
mimetype = rows[0][1]
filename = link.split("/")[-1]
print("Filename{}".format(filename))
print("Mimetype {}".format(mimetype))
# google cloud version, commented
#client = storage.Client()
#bucket = client.get_bucket('BUCKET_NAME')
#blob = bucket.blob(link)
#with tempfile.NamedTemporaryFile() as temp:
# blob.download_to_filename(temp.name)
# return send_file(temp.name, attachment_filename=filename)
# azure verson
bucket_name = 'BUCKET-NAME'
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
blob_client = blob_service_client.get_blob_client(container=bucket_name, blob=link)
with tempfile.NamedTemporaryFile() as temp:
temp.write(blob_client.download_blob().readall())
#return send_file(temp.name, attachment_filename=filename, mimetype=mimetype)
return send_file(temp.name, download_name=filename)
As you mentioned only html files not able to read so I tried with html file reading temporary file display it on the browser
I tried with tempfile.NamedTemporaryFile() as temp: but getting the black page
And then I also tried with with tempfile.NamedTemporaryFile('w', delete=False, suffix='.html') as f:
And I write data as string able to get the page
Can you just try with tempfile.NamedTemporaryFile('w', delete=False, suffix='.html') as f: for html files
from azure.storage.blob import BlobServiceClient
import tempfile
import webbrowser
blob_service_client = BlobServiceClient.from_connection_string("Connection String ")
# Initialise container
blob_container_client = blob_service_client.get_container_client("test")
# Get blob
blob_client = blob_container_client.get_blob_client("test.html")
print("downloaded the blob ")
# Download
str=blob_client.download_blob().readall()
print(str)
print(str.decode("utf-8"))
//Getting the Blank Page
with tempfile.NamedTemporaryFile() as temp:
url = 'file://' + temp.name
temp.write(blob_client.download_blob().readall())
#temp.write(str)
webbrowser.open(url)
//Getting page
html=str.decode("utf-8")
with tempfile.NamedTemporaryFile('w', delete=False, suffix='.html') as f:
url = 'file://' + f.name
f.write(html)
webbrowser.open(url)
Here is the OUTPUT how it looks

Upload and delete file in Sharepoint with python automatically

I need to put together the following scenario.
What libraries or frameworks should I use to complete this scenario?
I have basic knowledge of Python.
I found the following way to implement a 'file upload and delete process' in SharePoint with the use of few python codes.
You will need the two python libraries 'sharepoint' and 'shareplum'.
To install 'sharepoint': pip install sharepoint
To install 'shareplum': pip install SharePlum
Then you can implement the main code to upload and delete the files
as following:
sharepoint.py
from shareplum import Site, Office365
from shareplum.site import Version
import json, os
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
config_path = '\\'.join([ROOT_DIR, 'config.json'])
# read config file
with open(config_path) as config_file:
config = json.load(config_file)
config = config['share_point']
USERNAME = config['user']
PASSWORD = config['password']
SHAREPOINT_URL = config['url']
SHAREPOINT_SITE = config['site']
SHAREPOINT_DOC = config['doc_library']
class SharePoint:
def auth(self):
self.authcookie = Office365(SHAREPOINT_URL, username=USERNAME, password=PASSWORD).GetCookies()
self.site = Site(SHAREPOINT_SITE, version=Version.v365, authcookie=self.authcookie)
return self.site
def connect_folder(self, folder_name):
self.auth_site = self.auth()
self.sharepoint_dir = '/'.join([SHAREPOINT_DOC, folder_name])
self.folder = self.auth_site.Folder(self.sharepoint_dir)
return self.folder
def upload_file(self, file, file_name, folder_name):
self._folder = self.connect_folder(folder_name)
with open(file, mode='rb') as file_obj:
file_content = file_obj.read()
self._folder.upload_file(file_content, file_name)
def delete_file(self, file_name, folder_name):
self._folder = self.connect_folder(folder_name)
self._folder.delete_file(file_name)
I save the above code in sharepoint.py file.
Then use the methods in the following way. I import the above methods
from the above 'sharepoint.py' file and use as follows:
updelsharepoint.py
from sharepoint import SharePoint
#i.e - file_dir_path = r'E:\project\file_to_be_uploaded.xlsx'
file_dir_path = r'E:\File_Path\File_Name_with_extension'
# this will be the file name that it will be saved in SharePoint as
file_name = 'File_Name_with_extension'
# The folder in SharePoint that it will be saved under
folder_name = 'SampleUploads'
# upload file
SharePoint().upload_file(file_dir_path, file_name, folder_name)
# delete file
SharePoint().delete_file(file_name, folder_name)
Finally, to configure your email, password, and SharePoint account you have to create a
'config.json' as follows.
config.json
{
"share_point":
{
"user": "{email}",
"password": "{password}",
"url": "https://{domain}.sharepoint.com",
"site": "https://{domain}.sharepoint.com/sites/{Name}/",
"doc_library": "Shared Documents/{Document}"
}
}
I hope this will help you to solve your problem. You can further improve the above sample code and implement a better code than what I shared.

Categories