Open online txt file using python codecs.open - python

I am trying to open an online txt file using codecs.open. The code I have now is:
url = r'https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500.txt'
soup = BeautifulSoup(codecs.open(url, 'r',encoding='utf-8'), "lxml")
However, Python keeps reminding OSError:
OSError: [Errno 22] Invalid argument: 'https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500.txt'
I tried to replace "/" with "\". It still does not work. Is there any way to solve it? Since I have more than thousands of links to open, I do not quite want to download the online text files into my local drive.
I will appreciate it very much if someone can help here.
Thanks!

Is it something like this you're thinking of?
`from urllib.request import urlopen
url = urlopen('https://www.sec.gov/Archives/edgar/data/20/0000893220-96- 000500.txt')
html = url.read().decode('utf-8')
file = open('yourfile.txt', 'r')
file.read(html)
file.close`

Related

How to download a file with .torrent extension from link with Python

I tried using wget:
url = https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA
wget.download(url, 'c:/path/')
The result was that I got a file with the name A4A68F25347C709B55ED2DF946507C413D636DCA and without any extension.
Whereas when I put the link in the navigator bar and click enter, a torrent file gets downloaded.
EDIT:
Answer must be generic not case dependent.
It must be a way to download .torrent files with their original name.
You can get the filename inside the content-disposition header, i.e.:
import re, requests, traceback
try:
url = "https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA"
r = requests.get(url)
d = r.headers['content-disposition']
fname = re.findall('filename="(.+)"', d)
if fname:
with open(fname[0], 'wb') as f:
f.write(r.content)
except:
print(traceback.format_exc())
Py3 Demo
The code above is for python3. I don't have python2 installed and I normally don't post code without testing it.
Have a look at https://stackoverflow.com/a/11783325/797495, the method is the same.
I found an a way that gets the torrent files downloaded with their original name like as they were actually downloaded by putting the link in the browser's nav bar.
The solution consists of opening the user's browser from Python :
import webbrowser
url = "https://yts.lt/torrent/download/A4A68F25347C709B55ED2DF946507C413D636DCA"
webbrowser.open(url, new=0, autoraise=True)
Read more:
Call to operating system to open url?
However the downside is :
I don't get the option to choose the folder where I want to save the
file (unless I changed it in the browser but still, in case I want to save
torrents that matches some criteria in an other
path, it won't be possible).
And of course, your browser goes insane opening all those links XD

Open an XML file through URL and save it

With Python 3, I want to read an XML web page and save it in my local drive.
Also, if the file already exist, it must overwrite it.
I tested some script like :
import urllib.request
xml = urllib.request.urlopen('URL')
data = xml.read()
file = open("file.xml","wb")
file.writelines(data)
file.close()
But I have an error :
TypeError: a bytes-like object is required, not 'int'
First suggestion: do what even the official urllib docs says and don't use urllib, use requests instead.
Your problem is that you use .writelines() and it expects a list of lines, not a bytes objects (for once in Python the error message is not very helpful). Use .write() instead
import requests
resp = requests.get('URL')
with open('file.xml', 'wb') as foutput:
foutput.write(resp.content)
I found a solution :
from urllib.request import urlopen
xml = open("import.xml", "r+")
xml.write(urlopen('URL').read().decode('utf-8'))
xml.close()
Thanks for your help.

Errno socket error occurs when I retrieve file from url with python urllib

I try to download a .tiff file from NASA. When doing it in the browser it works out fine. When trying it with the following python code
import urllib
f = urllib.FancyURLopener()
url = "https://neo.sci.gsfc.nasa.gov/servlet/RenderData?si=1696692&cs=gs&format=TIFF&width=3600&height=1800"
f.retrieve(url, "test.TIFF")
I get the error
IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
I found one similar question here solving the error by creating a new SSLContext. However I can not figure out how to save a downloaded file as required in my case.
This seems to work:
from urllib.request import urlretrieve
url = 'https://neo.sci.gsfc.nasa.gov/servlet/RenderData?si=1696692&cs=gs&format=TIFF&width=3600&height=1800'
urlretrieve(url, 'result.TIFF')
Not sure if this will work in Python 2. Will update my answer later.
I found a solution with python 2 using urllib2 that works for me:
import urllib2
url = "https://neo.sci.gsfc.nasa.gov/servlet/RenderData?si=1696692&cs=gs&format=TIFF&width=3600&height=1800"
f = urllib2.urlopen(url)
data = f.read()
with open("img.TIFF", "wb") as imgfile:
imgfile.write(data)

Read file using urllib and write adding extra characters

I have a script that regularly reads a text file on a server and over writes a copy of the text to a local copy of the text file. I have an issue of the process adding extra carriage returns and an extra invisible character after the last character. How do I make an identical copy of the server file?
I use the following to read the file
for link in links:
try:
f = urllib.urlopen(link)
myfile = f.read()
except IOError:
pass
and to write it to the local file
f = open("C:\\localfile.txt", "w")
try:
f.write(myfile)
except NameError:
pass
finally:
f.close()
This is how the file looks on the server
!http://i.imgur.com/rAnUqmJ.jpg
and this is how the file looks locally. Besides, an additional invisible character after the last 75
!http://i.imgur.com/xfs3E8D.jpg
I have seen quite a few similar questions, but not sure how to handle the urllib to read in binary
Any solution please?
If you want to copy a remote file denoted by a URL to a local file i would use urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://anysite.co/foo.gz", "foo.gz")
I think urllib is reading binary.
Try changing
f = open("C:\\localfile.txt", "w")
to
f = open("C:\\localfile.txt", "wb")

How to crawl 'this' url use urllib?

I try to use urllib to crawl this file: http://www.anzhi.com/dl_app.php?s=68611, but always download a wrong file(size smaller). However, if I open up this link on chrome, it goes well and the downloaded file size is correct. The code is attached, what's the problem?
import urllib
apk = "http://sc.hiapk.com/Download.aspx?aid=294091"
local=r'x.apk'
webFile = urllib.urlopen(apk)
localFile = open(local, "w")
realurl = webFile.geturl()
print realurl
realFile = urllib.urlopen(realurl)
localFile.write(realFile.read())
webFile.close()
realFile.close()
localFile.close()
What OS are you on? This line of code:
localFile = open(local, "w")
opens a text-mode file on Windows, which will do things that you don't want. Does changing that to
localFile = open(local, "wb")
(opening the file in binary mode) make things work correctly?
You're not using the same URL in your code that you're asking about in the question. Use the anzhi.com URL and you'll get the file you want. :)

Categories