How to retrieve a file from github through python program - python

I am trying to retrieve an xml file from the github using python program. I am trying different url but for each of them it say content not found. What are different options to retrieve the file from github? Any help on this would be much appreciated!!
import requests
from github import Github
import base64
g = Github("access_token")
#url = 'https://api.github.com/repos/{username}/{repos_name}/contents/{path}/{filename}.xml'
#url = 'https://git.<company domain>.com/raw/IT/{repos_name}/{path}/{filename}.xml?token<value>'
url = 'https://git.<company domain>.com/raw/IT/{repos_name}/{path}/{filename}.xml?token=<value>'
req = requests.get(url)
#print ('Keep Going!', req.content)
if req.status_code == requests.codes.ok:
req = req.json()
`. `# the response is a JSON
# req is now a dict with keys: name, encoding, url, size ...
# and content. But it is encoded with base64.
content = base64.decodestring(req['content'])
else:
print('Content was not found.')
output:
Keep Going! b'{"message":"Not Found","documentation_url":"https://developer.github.com/v3/repos/contents/#get-contents"}'
Content was not found.

Replace all the <> and {} variable in your url with the actual path to the file you're trying to retrieve.

Related

Download text data from HTML link in Python

Hi I want to download delimited text which is hosted on a HTML Link. (The link is accessible on a Private network only, so can't share here).
In R, following function solves the purpose (all other functions gave "Unauthorized access" or "401" error)
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
download.file(url, "~/insights_dashboard/testing_file.tsv")
a = read.csv("~/insights_dashboard/testing_file.tsv",header = T,stringsAsFactors = F,sep='\t')
I want to do the same thing in Python, for which I used:
(A)urllib and requests.get()
import urllib.request
url_get = requests.get(url, verify=False)
urllib.request.urlretrieve(url_get, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
(B)requests.get() and read.html
url='https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
s = requests.get(url, verify=False)
a = pd.read_html(io.StringIO(s.decode('utf-8')))
(C) Using wget:
import wget
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
wget.download(url,--auth-no-challenge, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
OR
wget --server-response -owget.log "https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain"
NOTE: The URL doesn't asks for any credentials and it is accessible by browser and able to download using R with download.file. I am looking for a solution in Python
def geturls(path):
yy=open(path,'rb').read()
yy="".join(str(yy))
yy=yy.split('<a')
out=[]
for d in yy:
z=d.find('href="')
if z>-1:
x=d[z+6:len(d)]
r=x.find('"')
x=x[:r]
x=x.strip(' ./')
#
if (len(x)>2) and (x.find(";")==-1):
out.append(x.strip(" /"))
out=set(out)
return(out)
pg="./test.html"# your html
url=geturls(pg)
print(url)

Extract images from HTML file using python standard libraries

so I'm trying to write a script that basically parses through an HTML file, finds all the images and saves those images into another folder. How would one accomplish this only using libraries that come with python3 when you install it on your computer? I currently have this script that I would like to incorporate more into.
date = datetime.date.today()
backup_path = os.path.join(str(date), language)
if not os.path.exists(backup_path):
os.makedirs(backup_path)
log = []
endpoint = zendesk + '/api/v2/help_center/en-us/articles.json'
while endpoint:
response = requests.get(endpoint, auth=credentials)
if response.status_code != 200:
print('Failed to retrieve articles with error {}'.format(response.status_code))
exit()
data = response.json()
for article in data['articles']:
if article['body'] is None:
continue
title = '<h1>' + article['title'] + '</h1>'
filename = '{id}.html'.format(id=article['id'])
with open(os.path.join(backup_path, filename), mode='w', encoding='utf-8') as f:
f.write(title + '\n' + article['body'])
print('{id} copied!'.format(id=article['id']))
log.append((filename, article['title'], article['author_id']))
endpoint = data['next_page']
This is a script I found on a zendesk forum that basically backs up our articles on Zendesk.
Try using beautiful soup to retrieve all the nodes and for each node using urllib to get the picture.
from bs4 import BeautifulSoup
#note here using response.text to get raw html
soup = BeautifulSoup(response.text)
#get the src of all images
img_source = [x.src for x in soup.find_all("img")]
#get the images
images = [urllib.urlretrieve(x) for x in img_source]
And you probably need to add some error handling and change it a bit to fit your page, but the idea remains the same.

Download a binary file using Python requests module

I need to download a file from an external source, I am using Basic authentication to login to the URL
import requests
response = requests.get('<external url', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<API URL to download the attachment>', auth=('<username>', '<password>'), stream=True)
print (data.content)
I am getting below output
<url to download the binary data>
\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\xcb\x00\x00\x1e\x00\x1e\x00\xbe\x07\x00\x00.\xcf\x05\x00\x00\x00'
I am expecting the URL to download the word document within the same session.
Working solution
import requests
import shutil
response = requests.get('<url>', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<url>', auth=('<username>', '<password>'), stream=True)
with open("C:/myfile.docx", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
I am able to download the file as it is.
When you want to download a file directly you can use shutil.copyfileobj():
https://docs.python.org/2/library/shutil.html#shutil.copyfileobj
You already are passing stream=True to requests which is what you need to get a file-like object back. Just pass that as the source to copyfileobj().

Send a string of OCR text to resAPI

I am trying to work with RestfulAPI's on python.
After OCR a pdf, I want to send the text to an restfulAPI to get back retrieve specific words along with their position within the text. I have not manage to send the string of text to the API yet.
Code follows:
import requests
import PyPDF2
import json
url = "http://xxapi.xxapi.org/xxx.util.json"
pdfFileObj = open('/Users/xxx/pdftoOCR.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1) # To try with the text found in the first page
data = {"text": pageObj.extractText()}
data_json = json.dumps(data)
params = {'text':'string'}
r = requests.post(url, data=data_json, params=params)
r1 = json.loads(r.text)
Although I get a response 200 from the request, The data should come in Json format with the need to poll some token URL (Which I don`t know how to do it either) Also I don't think the request is correct as when I paste the token url to the browser I see an empty Json file (No words, no position) even if I know the piece of text I'm trying to send contains the desired words.
Thanks in advance! I work with OS X , python 3.5
Well, many thanks to #Jose.Cordova.Alvear for resolving this issue
import json
import requests
pdf= open('test.pdf','rb')
url = "http://xxapi.xxapi.org/xxx.util.json"
payload = {
'file' :pdf
}
response = requests.post(url, files=payload)
print response.json()

Python Google Drive APi v2 Search a file

I want to search the Google Drive folder.
To check if there is not a folder with the name of the folder to create.
I have this code:
request_url = "https://www.googleapis.com/drive/v2/files?access_token=%s" % (access_token)
data = {"q": "title=filename"}
data_json = json.dumps(data)
print data_json
req = urllib2.Request(request_url,data_json, headers)
print request_url
print data_json
content = urllib2.urlopen(req).read()
print content
content = json.loads(content)
Unfortunately, there is no research but creates a no name file.
Thank you in advance for your help
Basically with urllib2, when you specify the data while using Request(), it automatically considers your request as a POST request, and that's why it creates a file with no name.
If you want it to be a GET request, you have to encode the data in the URL before calling Request()

Categories