Download from private GitLab without API - python

I need to download a file from a private GitLab.
I already saw this post:
Download from a GitLab private repository
But I cannot use the API, since I dont have the needed IDs to download the files.
In fact, I need to download them by theirs HTTP raw urls, like:
http://gitlab.private.com/group/repo_name/raw/master/diagrams/test.plantuml
Since I turned on authentication, every time I try to access something programatically, I am redirected to login page.
I wrote a Python script to mimic the login process, obtain the authenticity_token and the _gitlab_session cookie, but still not working.
If I grab a session cookie from my Chrome browser after a successful login, everything works like a charm (from the file download perspective) on Python and even curl.
So, any help is apreciated to obtain this cookie, os a different approach. To use the API I would first need to struggle among all repos performing strings matches so I can find the proper IDs. This is the last option.
Tks
Marco

First generate a Personal Access Token from your settings page: /profile/personal_access_tokens. Make sure it has the read_repository scope.
Then you can use one of two methods, replacing PRIVATETOKEN with the token you acquired:
Pass a Private-Token header in your request. E.g.
curl --header "Private-Token: PRIVATETOKEN" http://gitlab.private.com/group/repo_name/raw/master/diagrams/test.plantuml
Add a private_token query string to your request. E.g.
curl 'http://gitlab.private.com/group/repo_name/raw/master/diagrams/test.plantuml?private_token=PRIVATETOKEN'

Related

Artifactory REST query returns URIs which don't work in a browser

I want a list of URLs that will download artifacts from Artifactory. The list of URLs should come from a REST query.
I am successfully calling the Artifactory (3.2.2) REST API for various data I need. In one case, I am doing a property search, searching for artifacts with an "application_name" and "release_version". These properties were added by TeamCity when the artifacts were deployed. I can successfully search Artifactory using the Artifactory Property Search tool in the web console, and I can successfully search with those same terms from my python script.
The REST call returns json. Within that json is an array of dicts, and each of those is a {uri: url}. All good, but not quite.
The URL returns a 404 when pasted into a web browser. By walking thru the url, I discover that the /api/storage part is what's throwing off the browser. I suspect that's because this URI is not meant for browsers, but for another REST query. Sheesh.
The documentation is not clear on this. It sure seems like I should be able to get a proper browser URL from a REST call.
Example URL: "http://ourserver.org:8081/artifactory/api/storage/our-releases/com/companyname/Training/1.7.4/Training.ipa"
It's easy to replace "/api/storage" with "/simple" in that URL string and that makes the URL work in a browser. I just think it's an ugly solution. I mostly think I'm missing something, perhaps obvious.
Suggestions welcome!
OK, got it figured.
The Artifactory REST call for Property Search does indeed return REST like URIs. There's a different REST call for File Info. The results from the File Info call include a downloadUri.
So, I first have to use the Property Search REST call, take the results from that, massage the resulting URI a bit to get the file path. That path is like "com/companyname/appgroup/artifactname/version/filename". I used urllib.parse to help with that, but still had to massage the result a bit as it still included "artifactory/repo-key".
I use the repo key and file path to make another REST call for File Info. The results of that include the downloadUri.
This seems like the long way around, but it works. It's still not as elegant as I'd like. Two REST calls take some time to finish.
As explained in the Artifactory REST API, optional headers can be used to add extra information of the found artifact. For instance, you could try adding the header 'X-Result-Detail: info' , which will return the downloadUri property as well:
$ wget --header 'X-Result-Detail: info' 'http://<URL>/artifactory/api/search/artifact?name=dracut-fips' -O /dev/stdout
$ curl -X GET -H 'X-Result-Detail: info' 'http://<URL>/artifactory/api/search/artifact?name=dracut-fips'
Using any of the above command you can get the downloadUri and you could even easily parse it with python:
$ read -r -d '' py_get_rpm_packages << EOF
import sys, json
for obj in json.load(sys.stdin)["results"]:
print obj["downloadUri"]
EOF
export py_get_rpm_packages
$ wget --header 'X-Result-Detail: info' 'http://<URL>/artifactory/api/search/artifact?name=dracut-fips' -O /dev/stdout | python -c "$py_get_rpm_packages"
https://<URL>/artifactory/centos7/7.1-os/dracut-fips-aesni-033-240.el7.x86_64.rpm
https://<URL>/artifactory/centos63-dvd/dracut-fips-aesni-004-283.el6.noarch.rpm
Artifactory REST API:
https://www.jfrog.com/confluence/display/RTF/Artifactory+REST+API

Error when POSTing from python requests library; does not occur through alfresco share UI

Using OOTB Alfresco 5 Community edition running on Ubuntu14.04
Steps:
Create site through the share UI.
Copy request as curl from Chromium developer tools.
Reconstructed request in python requests library as:
s=requests.post('http://<IP>:8080/share/service/modules/create-site',data=site_data,cookies=THE_cookie
Where THE_cookie was obtained via a POST to http://:8080/share/page/dologin, which gave a 200, and site_data has different names to the site created through the share UI.
That request gave a 500 error stating that
u'freemarker.core.InvalidReferenceException: The following has evaluated to null or missing:\n==> success [in template "org/alfresco/modules/create-site.post.json.ftl" at line 2, column 17]\n\nTip: If the failing expression is known to be legally null/missing, either specify a default value with myOptionalVar!myDefault, or use <#if myOptionalVar??>when-present<#else>when-missing</#if>. (These only cover the last step of the expression; to cover the whole expression, use parenthessis: (myOptionVar.foo)!myDefault, (myOptionVar.foo)??\n\nThe failing instruction:\n==> ${success?string} [in template "org/alfresco/modules/create-site.post.json.ftl" at line 2, column 15]', ...
When in Chromium, there is no response, but a site is created successfully.
I've also not got the curl request from the command line to work -- it needs the CSRF token removed, then gives a 200 and does nothing; no logs. My understanding is that Alfresco always gives a 200 on a successful request regardless of whether it's a GET or POST.
If anyone has any ideas that would be amazing. There doesn't seem to be anything that we can do to get create-site to work outside of the share UI, but we absolutely need it to do so.
Since the script is expecting JSON, you need to set the HTTP header "Content-Type: application/json".
Have a look to Requests session objects, which are designed to persist session cookies between requests (like a browser do). You can try an approach similar to this:
s = requests.session()
s.post('http://<IP>:8080/share/page/dologin', data=login_data)
r = s.post('http://<IP>:8080/share/service/modules/create-site', data=site_data)

Download from Megaupload with login - Python

It's my first question here.
Today, I've done a little application using wxPython: a simple Megaupload Downloader, but yet, it doesn't support premium accounts.
Now I would like to know how to download from MU with a login (free or premium user).
I'm very new to Python, so please don't be specific and "professional".
I used to download files with urlretrieve but, but is there a way to pass "arguments" or something to be able to log in as a premium user ?
Thank you. :D
EDIT =
News: new help needed xD
After trying with PyCUrl, htmllib2 and mechanize, I've done the login with urllib2 and cookiejar (the requested html says the username).
But when I start download a file, surely the server doesn't keep my login, in fact the downloaded file seems corrupted (I changed wait time from 45 to 25 seconds).
How can I download a file from MegaUpload keeping my previously done login? Thanks for your patient :D
Questions like this are usually frowned upon, they are very broad, and there are already an abundance of answers if you just search on google.
You can use urllib, or mechanize, or any library you can make an http post request with.
megaupload looks to have the form values
login:1
redir:1
username:
password:
just post those values at http://megaupload.com/?c=login
all you should have to do is set your username and password to the correct values!
For logging in using Python follow the following steps.
Find the list of parameters to be sent in the POST request and the url where the request has to be made by viewing the source of the login form. You may use a browser with "Inspect Element" feature to find it easily. [parameter name examples - userid, password]. Just check the tags name attribute.
Most of the sites set a cookie on logging in and the cookie has to be sent along with subsequent requests. To handle this download httllib2 (http://code.google.com/p/httplib2/ ) and read the wiki page on the link given. It has shown how to login with examples.
Now you can make subsequent requests for files, the cookies etc. will be handled automatically by httplib2.
i do alot of web stuff with python, i perfer using pycurl you can get it here
it is very simple to post data and login with curl, i've used it accross many languages such as PHP, python, and C++, hope this helps
You can use urllib this is a good example

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)

Programmatic login and use of non-api-supported Google services

Google provides APIs for a number of their services and bindings for several languages. However, not everything is supported. So this question comes from my incomplete understanding of things like wget, curl, and the various web programming libraries.
How can I authenticate programmatically to Google?
Is it possible to leverage the existing APIs to gain access to the unsupported parts of Google?
Once I have authenticated, how do I use that to access my restricted pages? It seems like the API could be used do the login and get a token, but I don't understand what I'm supposed to do next to fetch a restricted webpage.
Specifically, I am playing around with Android and want to write a script to grab my app usage stats from the Android Market once or twice a day so I can make pretty charts. My most likely target is python, but code in any language illustrating non-API use of Google's services would be helpful. Thanks folks.
You can get the auth tokens by authenticating a particular service against https://www.google.com/accounts/ClientLogin
E.g.
curl -d "Email=youremail" -d "Passwd=yourpassword" -d "service=blogger" "https://www.google.com/accounts/ClientLogin"
Then you can just pass the auth tokens and cookies along when accessing the service. You can use firebug or temper data firefox plugin to find out the parameter names etc.
You can use something like mechanize, or even urllib to achieve this sort of thing. As a tutorial, you can check out my article here about programmatically submitting a form .
Once you authenticate, you can use the cookie to access restricted pages.
CLientLogin is now deprecated: https://developers.google.com/accounts/docs/AuthForInstalledApps
How can we authenticate programmatically to Google with OAuth2?
I can't find an expample of request with user and password parameter as in the CLientLogin :(
is there a solution?

Categories