Get arbitrary resources content in Python 3 - python

I need to get the content of the resources received in command line. The user can write a relative path to a file or an URL. Is it possible to read from this resource regardless if it is a path to a file or an URL?
In Ruby I have something like the next, but I'm having problems finding a Python alternative:
content = open(path_or_url) { |io| io.read }

I don't know of a nice way to do it, however, urllib.request.urlopen() will support opening normal URLs (http, https, ftp, etc) as well as files on the file system. So you could assume a file if the URL is missing a scheme component:
from urllib.parse import urlparse
from urllib.request import urlopen
resource = input('Enter a URL or relative file path: ')
if urlparse(resource).scheme == '':
# assume that it is a file, use "file:" scheme
resource = 'file:{}'.format(resource)
data = urlopen(resource).read()
This works for the following user input:
http://www.blah.com
file:///tmp/x/blah
file:/tmp/x/blah
file:x/blah # assuming cwd is /tmp
/tmp/x/blah
x/blah # assuming cwd is /tmp
Note that file: (without slashes) might not be a valid URI, however, this is the only way to open a file specified by relative path, and urlopen() works with such URIs.

Related

How to download a file with .torrent extension from link with Python

I tried using wget:
url = https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA
wget.download(url, 'c:/path/')
The result was that I got a file with the name A4A68F25347C709B55ED2DF946507C413D636DCA and without any extension.
Whereas when I put the link in the navigator bar and click enter, a torrent file gets downloaded.
EDIT:
Answer must be generic not case dependent.
It must be a way to download .torrent files with their original name.
You can get the filename inside the content-disposition header, i.e.:
import re, requests, traceback
try:
url = "https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA"
r = requests.get(url)
d = r.headers['content-disposition']
fname = re.findall('filename="(.+)"', d)
if fname:
with open(fname[0], 'wb') as f:
f.write(r.content)
except:
print(traceback.format_exc())
Py3 Demo
The code above is for python3. I don't have python2 installed and I normally don't post code without testing it.
Have a look at https://stackoverflow.com/a/11783325/797495, the method is the same.
I found an a way that gets the torrent files downloaded with their original name like as they were actually downloaded by putting the link in the browser's nav bar.
The solution consists of opening the user's browser from Python :
import webbrowser
url = "https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA"
webbrowser.open(url, new=0, autoraise=True)
Read more:
Call to operating system to open url?
However the downside is :
I don't get the option to choose the folder where I want to save the
file (unless I changed it in the browser but still, in case I want to save
torrents that matches some criteria in an other
path, it won't be possible).
And of course, your browser goes insane opening all those links XD

access remote files on server with smb protocol python3

I have a remote server with some files.
smb://ftpsrv/public/
I can be authorized there as an anonymous user. In java I could simply write this code:
SmbFile root = new SmbFile(SMB_ROOT);
And get the ability to work with files inside (it is all I need, one row!), but I can't find how to manage with this task in Python 3, there are a lot of resources, but I think they are not relevant to my problem, because they are frequently tailored for Python 2, and old other approaches. Is there some simple way, similar to Java code above?
Or can somebody provide a real working solution if, for example, I want to access file fgg.txt in smb://ftpsrv/public/ folder. Is there really a handy lib to tackle this problem?
For example on site:
import tempfile
from smb.SMBConnection import SMBConnection
# There will be some mechanism to capture userID, password, client_machine_name, server_name and server_ip
# client_machine_name can be an arbitary ASCII string
# server_name should match the remote machine name, or else the connection will be rejected
conn = SMBConnection(userID, password, client_machine_name, server_name, use_ntlm_v2 = True)
assert conn.connect(server_ip, 139)
file_obj = tempfile.NamedTemporaryFile()
file_attributes, filesize = conn.retrieveFile('smbtest', '/rfc1001.txt', file_obj)
# Retrieved file contents are inside file_obj
# Do what you need with the file_obj and then close it
# Note that the file obj is positioned at the end-of-file,
# so you might need to perform a file_obj.seek() if you need
# to read from the beginning
file_obj.close()
Do I seriously need to provide all of these details: conn = SMBConnection(userID, password, client_machine_name, server_name, use_ntlm_v2 = True)?
A simple example of opening a file using urllib and pysmb in Python 3
import urllib
from smb.SMBHandler import SMBHandler
opener = urllib.request.build_opener(SMBHandler)
fh = opener.open('smb://host/share/file.txt')
data = fh.read()
fh.close()
I haven't got an anonymous SMB share ready to test it with, but this code should work.
urllib2 is the python 2 package, in python 3 it was renamed to just urllib and some stuff got moved around.
I think you were asking for Linux, but for completeness I'll share how it works on Windows.
On Windows, it seems that Samba access is supported out of the box with Python's standard library functions:
import glob, os
with open(r'\\USER1-PC\Users\Public\test.txt', 'w') as f:
f.write('hello') # write a file on a distant Samba share
for f in glob.glob(r'\\USER1-PC\Users\**\*', recursive=True):
print(f) # glob works too
if os.path.isfile(f):
print(os.path.getmtime(f)) # we can get filesystem information

PyQT | QDesktopServices.openUrl Doesn't work if path has spaces

I am trying to use QDesktopServices to have system open the files or folders specified.
The code below works perfect for paths which doesn't have spaces in them but fails to execute if otherwise
def openFile(self):
print self.oVidPath
print "\n"
url = QUrl(self.oVidPath)
QDesktopServices.openUrl(url)
self.Dialog.close()
and the output for paths with spaces is
/home/kerneldev/Documents/Why alcohol doesn't come with nutrition facts.mp4
gvfs-open: /home/kerneldev/Documents/Why%20alcohol%20doesn't%20come%20with%20nutrition%20facts.mp4: error opening location: Error when getting information for file '/home/kerneldev/Documents/Why%20alcohol%20doesn't%20come%20with%20nutrition%20facts.mp4': No such file or directory
I have verified that the path specified exists.
Please Help
You need to use a file:// url, otherwise QUrl will treat the path as a network url and will encode it for use in that context. So try this instead:
url = QUrl.fromLocalFile(self.oVidPath)

unotools - try to convert ods or excel files to csv using python

What I need is a command line tool to convert excel and ods spreadsheet files to csv which I can use on a web server (Ubuntu 16.04).
I already red this: https://pypi.python.org/pypi/unotools
which works fine for the given examples.
And this: http://www.linuxjournal.com/content/convert-spreadsheets-csv-files-python-and-pyuno-part-1v2
which should do the work I want it to do, but does not in my environment.
My problem I think is in the method Calc.store_to_url:
Line throwing exception
component.store_to_url(url,'FilterName','Text - txt - csv (StarCalc)')
I really would appreciate a hint.
Exception
unotools.unohelper.ErrorCodeIOException: SfxBaseModel::impl_store failed: 0x81a
Full source
import sys
from os.path import basename, join as pathjoin, splitext
from unotools import Socket, connect
from unotools.component.calc import Calc
from unotools.unohelper import convert_path_to_url
from unotools import parse_argument
def get_component(args, context):
_, ext = splitext(args.file_)
url = convert_path_to_url(args.file_)
component = Calc(context, url)
return component
def convert_csv(args, context):
component = get_component(args, context)
url = 'out/result.csv'
component.store_to_url(url,'FilterName','Text - txt - csv (StarCalc)')
component.close(True)
args = parse_argument(sys.argv[1:])
context = connect(Socket(args.host, args.port), option=args.option)
convert_csv(args, context)
The URL must be in file:// format.
url = convert_path_to_url('out/result.csv')
See the store_to_url example at https://pypi.python.org/pypi/unotools.
EDIT:
To use the absolute path, choose one of these; there is no need to combine them.
url = 'file:///home/me/out/result.csv'
url = convert_path_to_url('/home/me/out/result.csv')
To use the relative path, first verify that the working directory is '/home/me' by calling os.getcwd().

save url as a file name in python

I have a url such as
http://example.com/here/there/index.html
now I want to save a file and its content in a directory. I want the name of the file to be :
http://example.com/here/there/index.html
but I get error, I'm guessing that error is as the result of / in the url name.
This is what I'm doing at the moment.
with open('~/' + response.url, 'w') as f:
f.write(response.body)
any ideas how I should do it instead?
You could use the reversible base64 encoding.
>>> import base64
>>> base64.b64encode('http://example.com/here/there/index.html')
'aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA=='
>>> base64.b64decode('aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA==')
'http://example.com/here/there/index.html'
or perhaps binascii
>>> binascii.hexlify(b'http://example.com/here/there/index.html')
'687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c'
>>> binascii.unhexlify('687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c')
'http://example.com/here/there/index.html'
You have several problems. One of them is that Unix shell abbreviations (~) are not going to be auto-interpreted by Python as they are in Unix shells.
The second is that you're not going to have good luck writing a file path in Unix that has embedded slashes. You will need to convert them to something else if you're going to have any luck of retrieving them later. You could do that with something as simple as response.url.replace('/','_'), but that will leave you with many other characters that are also potentially problematic. You may wish to "sanitize" all of them on one shot. For example:
import os
import urllib
def write_response(response, filedir='~'):
filedir = os.path.expanduser(dir)
filename = urllib.quote(response.url, '')
filepath = os.path.join(filedir, filename)
with open(filepath, "w") as f:
f.write(response.body)
This uses os.path functions to clean up the file paths, and urllib.quote to sanitize the URL into something that could work for a file name. There is a corresponding unquote to reverse that process.
Finally, when you write to a file, you may need to tweak that a bit depending on what the responses are, and how you want them written. If you want them written in binary, you'll need "wb" not just "w" as the file mode. Or if it's text, it might need some sort of encoding first (e.g., to utf-8). It depends on what your responses are, and how they are encoded.
Edit: In Python 3, urllib.quote is now urllib.parse.quote.
This is a bad idea as you will hit 255 byte limit for filenames as urls tend to be very long and even longer when b64encoded!
You can compress and b64 encode but it won't get you very far:
from base64 import b64encode
import zlib
import bz2
from urllib.parse import quote
def url_strategies(url):
url = url.encode('utf8')
print(url.decode())
print(f'normal : {len(url)}')
print(f'quoted : {len(quote(url, ""))}')
b64url = b64encode(url)
print(f'b64 : {len(b64url)}')
url = b64encode(zlib.compress(b64url))
print(f'b64+zlib: {len(url)}')
url = b64encode(bz2.compress(b64url))
print(f'b64+bz2: {len(url)}')
Here's an average url I've found on angel.co:
URL = 'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'
And even with b64+zlib it doesn't fit into 255 limit:
normal : 316
quoted : 414
b64 : 424
b64+zlib: 304
b64+bz2 : 396
Even with the best strategy of zlib compression and b64encode you'd still be in trouble.
Proper Solution
Alternatively what you should do is hash the url and attach url as file attribute to the file:
import os
from hashlib import sha256
def save_file(url, content, char_limit=13):
# hash url as sha256 13 character long filename
hash = sha256(url.encode()).hexdigest()[:char_limit]
filename = f'{hash}.html'
# 93fb17b5fb81b.html
with open(filename, 'w') as f:
f.write(content)
# set url attribute
os.setxattr(filename, 'user.url', url.encode())
and then you can retrieve the url attribute:
print(os.getxattr(filename, 'user.url').decode())
'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'
note: setxattr and getxattr require user. prefix in python
for file attributes in python see related issue here: https://stackoverflow.com/a/56399698/3737009
Using urllib.urlretrieve:
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://example.com/here/there/index.html", "/tmp/index.txt")
May look into restricted charaters.
I would use a typical folder struture for this task. If you will use that with a lot of urls it will get somehow or other a mess. And you will run into filesystem performance issues or limits as well.

Categories