I am an absolute beginner when it comes to working with REST APIs with python. We have received a share-point URL which has multiple folders and multiples files inside those folders in the 'document' section. I have been provided an 'app_id' and a 'secret_token'.
I am trying to access the .csv file and read them as a dataframe and perform operations.
The code for operation is ready after I downloaded the .csv and did it locally but I need help in terms of how to connect share-point using python so that I don't have to download such heavy files ever again.
I know there had been multiple queries already on this over stack-overflow but none helped to get to where I want.
I did the following and I am unsure of what to do next:
import json
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
from office365.runtime.http.request_options import RequestOptions
site_url = "https://<company-name>.sharepoint.com"
ctx = ClientContext(site_url).with_credentials(UserCredential("{app_id}", "{secret_token}"))
Above for site_url, should I use the whole URL or is it fine till ####.com?
This is what I have so far, next I want to read files from respective folders and convert them into a dataframe? The files will always be in .csv format
The example hierarchy of the folders are as follows:
Documents --> Folder A, Folder B
Folder A --> a1.csv, a2.csv
Folder B --> b1.csv, b2.csv
I should be able to move to whichever folder I want and read the files based on my requirement.
Thanks for the help.
This works for me, using a Sharepoint App Identity with an associated client Id and client Secret.
First, I demonstrate authenticating and reading a specific file, then getting a list of files from a folder and reading the first one.
import pandas as pd
import json
import io
from office365.sharepoint.client_context import ClientCredential
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
#Authentication (shown for a 'modern teams site', but I think should work for a company.sharepoint.com site:
site="https://<myteams.companyname.com>/sites/<site name>/<sub-site name>"
#Read credentials from a json configuration file:
spo_conf = json.load(open(r"conf\spo.conf", "r"))
client_credentials = ClientCredential(spo_conf["RMAppID"]["clientId"],spo_conf["RMAppID"]["clientSecret"])
ctx = ClientContext(site).with_credentials(client_credentials)
#Read a specific CSV file into a dataframe:
folder_relative_url = "/sites/<site name>/<sub site>/<Library Name>/<Folder Name>"
filename = "MyFileName.csv"
response = File.open_binary(ctx, "/".join([folder_relative_url, filename]))
df = pd.read_csv(io.BytesIO(response.content))
#Get a list of file objects from a folder and read one into a DataFrame:
def getFolderContents(relativeUrl):
contents = []
library = ctx.web.get_list(relativeUrl)
all_items = library.items.filter("FSObjType eq 0").expand(["File"]).get().execute_query()
for item in all_items: # type: ListItem
cur_file = item.file
contents.append(cur_file)
return contents
fldrContents = getFolderContents('/sites/<site name>/<sub site>/<Library Name>')
response2 = File.open_binary(ctx, fldrContents[0].serverRelativeUrl)
df2 = pd.read_csv(io.BytesIO(response2.content))
Some References:
Related SO thread.
Office365 library github site.
Getting a list of contents in a doc library folder.
Additional notes following up on comments:
The site path doesn't not include the full url for the site home page (ending in .aspx) - it just ends with the name for the site (or sub-site, if relevant to your case).
You don't need to use a configuration file to store your authentication credentials for the Sharepoint application identity - you could just replace spo_conf["RMAppID"]["clientId"] with the value for the Sharepoint-generated client Id and do similarly for the client Secret. But this is a simple example of what the text of a JSON file could look like:
{
"MyAppName":{
"clientId": "my-client-id",
"clientSecret": "my-client-secret",
"title":"name_for_application"
}
}
Related
My problem is the following:
I am sending queries via the Google Drive API that fetch all files that match a certain criteria. I won't post the entire code here as it's quite extensive, but the query criteria is just to get all files that belong in folders with a certain name (for example: "I want all files that reside in folders where the folder name contains the string 'meet'").
The code I have written for this particular part, is the following:
import json
import environ
import os
import google.auth
import io
from apiclient import discovery
from httplib2 import Http
from google.cloud import secretmanager
from googleapiclient.http import MediaIoBaseDownload
from oauth2client.service_account import ServiceAccountCredentials
# Imported functions from a local file. Just writing to database and establishing connection
from firestore_drive import add_file, establish_db_connection
.... some other code here ...
def update_files_via_parent_folder(self, parent_id, parent_name):
page_token = None
# Set a query that fetches all files based on the ID of its parent folder
# E.g. "get all files from folder whose ID is parent_id"
query = f"'{parent_id}' in parents"
response = self.execute_query(query, page_token)
files = response.get('files', [])
while True:
# Execute the query, and extract all resulting files in the folder
for file in files:
file_id = file['id']
filename = file['name']
# Start requesting the current file from Drive, and download through a byte-stream
request = self.service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
dl_counter = 0
while done is False:
# Start downloading the file from Drive, and convert it to JSON (dictionary)
status, done = downloader.next_chunk()
prefab_json = json.loads(fh.getvalue())
# Find the proper collection-name and then add the file to database
collection_name = next(type_name for type_name in self.possible_types if type_name in parent_name)
add_file(self.db, collection_name, filename, file_content=prefab_json)
# Find out if there are more files to download in the same folder
page_token = response.get('nextPageToken', None)
if page_token is None:
if len(files) == 0:
print(f'Folder found, but contained no files.')
break
response = self.execute_query(query, page_token)
files = response.get('files', [])
def execute_query(self, query, page_token):
"""
Helper function for executing a query to Google Drive. Implemented as a function due to repeated usage.
"""
return self.service.files().list(
q=query,
spaces='drive',
fields='nextPageToken, files(id, name)',
pageToken=page_token).execute()
Now my question is this:
Is there a way to download the files asynchronously or in parallel in the following section?
for file in files:
file_id = ...
filename = ...
# Same as above; start download and write to database...
For reference, the point of the code is to extract files that are located on Google Drive, and copy them over to another database. I'm not concerned with local storage, only fetching from Drive and writing to a database (if this is even possible to do in parallel).
I've tried various options such as multiprocessing.pool, multiprocessing.ThreadPool, and asyncio, but I'm not sure if I actually used them correctly. I can also mention that the database used, is Firestore.
Additional note: the reason I want to do it, is because this sequential operation is extremely slow, and I want to deploy this as a cloud function (which has a maximum time limit of 540 second (9 minutes)).
Any feedback is welcome :)
I am trying to save the data that pulled out from PostgreSQL db onto designated MS SharePoint folder. To do so, first I retrieved data from local db, then I need to store/save this data onto SharePoint folder. I tried of using office365 api to do this, but no data saved on SharePoint folder. Does anyone has similar experiences of doing this in python? Any workaround to do this in python? any thoughts?
My current attempt:
first, I did pull up data from local postgresql db as follow:
from sqlalchemy import create_engine
import pandas as pd
import os.path
hostname = 'localhost'
database_name = 'postgres'
user = 'kim'
pw = 'password123'
engine = create_engine('postgresql+psycopg2://'+user+':'+pw+'#'+hostname+'/'+database_name)
sql = """ select * from mytable """
with engine.connect() as conn:
df = pd.read_sql_query(sql,con=engine)
then, I tried to store/save the data to designated sharepoint folder as follow:
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
url_shrpt = 'https://xyzcompany.sharepoint.com/_layouts/15/sharepoint.aspx?'
username_shrpt = 'kim#xyzcompany.com'
password_shrpt = 'password123'
folder_url_shrpt = 'https://xyzcompany.sharepoint.com/:f:/g/EnIh9jxkDVpOsTnAUbo-LvIBdsN0X_pJifX4_9Rx3rchnQ'
ctx_auth = AuthenticationContext(url_shrpt)
if ctx_auth.acquire_token_for_user(username_shrpt, password_shrpt):
ctx = ClientContext(url_shrpt, ctx_auth)
web = ctx.web
ctx.load(web)
ctx.execute_query()
else:
print(ctx_auth.get_last_error())
response = File.open_binary(ctx, df)
with open("Your_Offline_File_Path", 'wb') as output_file:
output_file.write(response.content)
but file was not saved on SharePoint folder. How should we save the data from PostgreSQL onto SharePoint folder using python? Is there any workaround to do this? any thoughts?
objective:
I want to write down the data that pulled out from PostgreSQL db onto SharePoint folder. From my current attempt, above attempt didn't save data onto sharepoint folder. Can anyone suggest possible way of doing this?
I think you should write csv files locally, then try following in order to upload them onto SharePoint folder:
from shareplum import Site
from shareplum import Office365
from requests_ntlm import HttpNtlmAuth
from shareplum.site import Version
UN = "myself#xyzcompany.com"
PW = "hello#"
cred = HttpNtlmAuth(UN,PW)
authcookie = Office365('https://xyzcompany.sharepoint.com',username=UN,password=PW).GetCookies()
site = Site('https://xyzcompany.sharepoint.com/sites/sample_data/',version=Version.v365,authcookie=authcookie)
folder = site.Folder('Shared Documents/New Folder')
files = Path(os.getcwd()).glob('*.csv')
for file in files:
with open(file, mode='rb') as rowFile:
fileContent = rowFile.read()
folder.upload_file(fileContent, os.path.basename(file))
this is error-free and working solution, this should work for uploading files to SharePoint folder.
A slight variation to what #Jared had up there for example if one wants to create a folder based on a date and upload files to it from a location other than the root folder on the user's computer. This will be handy to people interested in a such a solution, a problem I had.
from shareplum import Site
from shareplum import Office365
from shareplum.site import Version
import pendulum #Install it for manipulation of dates
todaysdate = pendulum.now() Get todays date
foldername1 = todaysdate.strftime('%d-%m-%Y') #Folder name in a format such as 19-06-2021
UN = "myself#xyzcompany.com"
PW = "hello#"
path = r"C:\xxxx\xxx\xxx" #Path where the files to be uploaded are stored.
doc_library = "xxxxx/yyyy" #Folder where the new folder (foldername1) will be stored
authcookie = Office365('https://xyzcompany.sharepoint.com',username=UN,password=PW).GetCookies()
site = Site('https://xyzcompany.sharepoint.com/sites/sample_data/',version=Version.v365,authcookie=authcookie)
folder = site.Folder(doc_library+'/'+foldername1) #Creates the new folder matching todays date.
files = glob.glob(path+"\\*.csv")
for file in files:
with open(file, mode='rb') as rowFile:
fileContent = rowFile.read()
folder.upload_file(fileContent, os.path.basename(file))
That's a solution that will work well for anyone looking around for such code.
I'm currently using shareplum and was able to do the download thing using this code below:
from shareplum import Site
from shareplum import Office365
from shareplum.site import Version
import csv
authcookie = Office365('https://bboxxeng.sharepoint.com/', username='---', password='---').GetCookies()
site = Site('https://bboxxeng.sharepoint.com/sites/TESTIAN', version=Version.v2016, authcookie=authcookie)
folder = site.Folder('Shared%20Documents/Test')
data = folder.get_file('Office ss E1.csv')
with open('asas.csv', 'wb') as f:
f.write(data)
f.close()
I tried using list_data = sp_list.GetListItems() but have no luck extracting the file names, I've also ready and tried googling but still no luck.
I understand you want to list all files in a folder so that you can download or do other modification via the file name. If so, you can get it via below attrbutes:
files
folder = site.Folder('Shared Documents/test')
allfiles= folder.files
print(allfiles)
//////////// Updated //////
The result contains file name and other properties.
For example, i want to get the name of first file from the returned result.
allfiles= folder.files
demofile= allfiles[0]
print(demofile['Name'])
I am a python developer and somewhat new to using Google's gMail API to import .eml files into a gMail account.
I've gotten all of the groundwork done getting my oAuth credentials working, etc.
However, I am stuck where I load in the data-file. I need help loading the message data in to place in a variable..
How do I create the message_data variable reference - in the appropriate format - from my sample email file (which is stored in rfc822 format) that is on disk?
Assuming I have a file on disk at /path/to/file/sample.eml ... how do I load that to message_data in the proper format for the gMail API import call?
...
# how do I properly load message_data from the rfc822 disk file?
media = MediaIoBaseUpload(message_data, mimetype='message/rfc822')
message_response = service.users().messages().import_(
userId='me',
fields='id',
neverMarkSpam=True,
processForCalendar=False,
internalDateSource='dateHeader',
media_body=media).execute(num_retries=2)
...
You want to import an eml file using Gmail API.
You have already been able to get and put values for Gmail API.
You want to achieve this using google-api-python-client.
service in your script can be used for uploading the eml file.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Modification point:
In this case, the method of "Users.messages: insert" is used.
Modified script:
Before you run the script, please set the filename with the path of the eml file.
eml_file = "###" # Please set the filename with the path of the eml file.
user_id = "me"
f = open(eml_file, "r", encoding="utf-8")
eml = f.read()
f.close()
message_data = io.BytesIO(eml.encode('utf-8'))
media = MediaIoBaseUpload(message_data, mimetype='message/rfc822', resumable=True)
metadata = {'labelIds': ['INBOX']}
res = service.users().messages().insert(userId=user_id, body=metadata, media_body=media).execute()
print(res)
In above script, the following modules are also required.
import io
from googleapiclient.http import MediaIoBaseUpload
Note:
In above modified script, {'labelIds': ['INBOX']} is used as the metadata. In this case, the imported eml file can be seen at INBOX of Gmail. If you want to change this, please modify this.
Reference:
Users.messages: insert
If I misunderstood your question and this was not the result you want, I apologize.
Looking for a way using Azure files SDK to upload files to my azure databricks blob storage
I tried many things using function from this page
But nothing worked. I don't understand why
example:
file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')
generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
print(file_or_dir.name)
file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')
with test.xlsx = name of file in my azure file
/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system
I have the error message:
Exception=The specified resource name contains invalid characters
Tried to change the name but doesn't seem to work
edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?
Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.
For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.
Fig 1. Right click the file and then click the Get Shared Access Signature...
Fig 2. Must select the option Read permission for directly reading the file content.
Fig 3. Copy the url with sas token
Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.
# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService
account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'
file_service = FileService(account_name=account_name, account_key=account_key)
To generate the sas token url of a file
from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)
Or using get_file_to_stream function to read the file content
from io import BytesIO
import pandas as pd
stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)
Just as an addition to #Peter Pan answer, the alternative approach without using Pandas with python azure-storage-file-share library.
Very detailed documentation: https://pypi.org/project/azure-storage-file-share/#downloading-a-file