I have a simple task to do: script reads a free-text string which will contain SharePoint URLs. Now those URLs are provided by the users, basically a copy-paste from their browser. The thing that my app has to do is go to those links and check if there are any files under it.
So from what I can gather, there are many possible SharePoint URLs, for example:
<host>/sites/<site_name>/SitePages/something.aspx - for example a simple post
<host>/:w/r/sites/<site_name>/_layouts/15/something.aspx (like a shortcut URL) - for example a MS Office Word document
<host>/sites/<site_name>/<drive_name>/Forms/something.aspx?[...]&id=%2Fsites%2F<site-name>%2F<drive_name>%2F<path> - a URL to a file tree view of some files on a drive
<host>/:f:/r/sites/<site_name>/<drive_name>/<path_to_a_file>
The last one is perfect, because it contains the path to the directory in the url path. The 3rd one does have it as well, but in the urlencoded query params part.
What I do in this scenario is I parse the URL, extracting:
site name
drive name (not ID)
path (from the path in url or from the encoded &id= part)
Then, I can connect to SharePoint, get a site, list all the site drives (/drives), check if their "web_url" is a substring of my Sharepoint URL (I could search the appropriate drive by name, but the thing returned from the API is the "display name" and in my URL resides an "actual drive name"). Okay, so I've got my drive and now I can get my item by path. This all can be done via the regular MS Graph API (each step is needed for getting the object - site/drive ID) or via a python wrapper (I use python-o365).
As you can see, this is a real pain. Is there a standard way to deal with this? I mean, if I had the site and drive IDs, I could do it in a single API call, but given the fact that I only have a SharePoint link, I can't get those two, right? And how about the URL parsing?
Related
The Link URL at the Object details in the GoogleCloudStorage browser follows the template:
https://storage.cloud.google.com/<project_name>/<file_name>?authuser=0&organizationId=<org_id>
I'm trying to get the exact same link using the python package google-cloud-storage. Diving into the blob properties I've found the followings (none of which are exactly what I need):
self_link: https://www.googleapis.com/storage/v1/b/<project_name>/o/<file_name>
media_link: https://storage.googleapis.com/download/storage/v1/b/<project_name>/o/<file_name>?generation=<gen_id>&alt=media
Note: If I replace storage.googleapis.com with storage.cloud.google.com at the media_link I get to download the file as I expect (getting asked for a valid Google Account with the required permissions).
Is there any way to get the link directly from the blob object?
Here the pattern:
https://storage.googleapis.com/<BUCJET_NAME>/path/to/file
For example, for a bucket my_bucket and a file stored in this path folder_save/2020-01-01/backup.zip, you have this url https://storage.googleapis.com/my_bucket/folder_save/2020-01-01/backup.zip
I think that the best approach is manually generate the URL that you need by replacing the domain of the URL.
In client library source I couldn’t find any reference to a method/property in the blob class that uses the domain “storage.cloud.google.com”
even using the public url property the result is the URL that points to googleapis
I am trying to produce a list of files that branch from a certain URL.
Is this possible? Is it possible with Python??
The URL is: https://pace.oregonstate.edu/courses/sites/default/files/resources/pdf/
I want to produce a list of all the pdf files that are at this location.
One PDF is as follows:
https://pace.oregonstate.edu/courses/sites/default/files/resources/pdf/ch01_botany.pdf
How do I produce a list of other PDF files within the folder that contains "01_botany.pdf"?
Marshall
The top level link that you gave redirects onto itself so, without any other information to go on, you will not be able to discover the other files within that location.
If you happen to know the scheme that is being used to name the files then you could guess, e.g. ch01_botany.pdf is chapter one of a botany text, so you could guess that another file might be ch02_botany.pdf for the second chapter. However, accessing that URL redirects to a login page, so I guess that most of the content requires you to be a registered user. If you are a registered user, you could login and then perhaps guess, or even see, the list of files.
Here's what I want to do: Generate a url that I can put in my wordpress blog which will let users view a big text file. I don't know how I can generate this url. I was inspired by websites like Flickr which generate urls for images and was hoping there is a corollary for just text files.
I was taking the MITx 6.00.1x Python course, and one assignment had us refer to a text file that the professor had uploaded onto his course site. So the text file has a url:
https://courses.edx.org/c4x/MITx/6.00.1x_5/asset/words.txt
Not sure if this url is available to non members.
Is there a way I can upload this file to a universal url that anyone can access for free?
Kind regards,
Spencer
The way this works is actually supplying the path to the text file (in this case words.txt) on the web server. When you click that link, you're going from the root through several directories and accessing that file.
If you have access to the actual files on your wordpress blog, you can add a text file there and then give people the path to that file.
Otherwise, use a service such as Pastebin to provide the text file.
So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.
Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.
You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.
You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium
I have a situation, where I need to upload a file to my Dropbox Public Folder, and also once uploaded I need to store the uploaded file's public url ? I am using python, and any help on this would be great.
Thanks.
Use this to set up a Python SDK in your program
https://www.dropbox.com/developers/start/setup#python
This will give you all of the file information:
folder_metadata = client.metadata('/')
I beleive you are talking about these short links, just so you know, every small link from the public folder is generated only by special request and has an expiration date.
If you want a permanent link skip to step 2.
STEP 1
This information was taken from: https://www.dropbox.com/developers/reference/api
/shares
DESCRIPTION
Creates and returns a shareable link to files or folders.
Note: Links created by the /shares API call expire after thirty days.
URL STRUCTURE
https://api.dropbox.com/1/shares/<root>/<path>
root The root relative to which path is specified. Valid values are sandbox and dropbox.
path The path to the file or folder you want a shareable link to.
VERSIONS
0, 1
METHOD
POST
PARAMETERS
locale Use to specify language settings for user error messages and other language
specific text. See the notes above for more information about supported locales.
RETURNS
A shareable link to the file or folder. The link can be used publicly and directs to a preview page of the file. Also returns the link's expiration date in Dropbox's usual date format.
Sample JSON return value for a file
{
"url": "http://db.tt/APqhX1",
"expires": "Wed, 17 Aug 2011 02:34:33 +0000"
}
If you did step 1 don't do step 2.
STEP 2
/files (GET)
DESCRIPTION
Downloads a file. Note that this call goes to the api-content server.
URL STRUCTURE
https://api-content.dropbox.com/1/files/<root>/<path>
root The root relative to which path is specified. Valid values are sandbox and dropbox.
path The path to the file you want to retrieve.
VERSIONS
0, 1
METHOD
GET
PARAMETER
rev The revision of the file to retrieve. This defaults to the most recent revision.
RETURNS
The specified file's contents at the requested revision.
The HTTP response contains the content metadata in JSON format within an x-dropbox-metadata header.
ERRORS
404 The file wasn't found at the specified path, or wasn't found at the specified rev.
NOTES
This method also supports HTTP Range Retrieval Requests to allow retrieving partial file contents.`
DONE