Reading non-text files into Python - python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!

Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")

You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.

If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

Related

python 2.7 cpickle.load adds \r to strings in dictionaries [duplicate]

I got a pickled object (a list with a few numpy arrays in it) that was created on Windows and apparently saved to a file loaded as text, not in binary mode (ie. with open(filename, 'w') instead of open(filename, 'wb')). Result is that now I can't unpickle it (not even on Windows) because it's infected with \r characters (and possibly more)? The main complaint is
ImportError: No module named multiarray
supposedly because it's looking for numpy.core.multiarray\r, which of course doesn't exist. Simply removing the \r characters didn't do the trick (tried both sed -e 's/\r//g' and, in python s = file.read().replace('\r', ''), but both break the file and yield a cPickle.UnpicklingError later on)
Problem is that I really need to get the data out of the objects. Any ideas how to fix the files?
Edit: On request, the first few hundred bytes of my file, Octal:
\x80\x02]q\x01(}q\x02(U\r\ntotal_timeq\x03G?\x90\x15r\xc9(s\x00U\rreaction_timeq\x04NU\x0ejump_directionq\x05cnumpy.core.multiarray\r\nscalar\r\nq\x06cnumpy\r\ndtype\r\nq\x07U\x02f8K\x00K\x01\x87Rq\x08(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbU\x08\x025\x9d\x13\xfc#\xc8?\x86Rq\tU\x14normalised_directionq\r\nh\x06h\x08U\x08\xf0\xf9,\x0eA\x18\xf8?\x86Rq\x0bU\rjump_distanceq\x0ch\x06h\x08U\x08\x13\x14\xea&\xb0\x9b\x1a#\x86Rq\rU\x04jumpq\x0ecnumpy.core.multiarray\r\n_reconstruct\r\nq\x0fcnumpy\r\nndarray\r\nq\x10K\x00\x85U\x01b\x87Rq\x11(K\x01K\x02\x85h\x08\x89U\x10\x87\x16\xdaEG\xf4\xf3?\x06`OC\xe7"\x1a#tbU\x0emovement_speedq\x12h\x06h\x08U\x08\\p\xf5[2\xc2\xef?\x86Rq\x13U\x0ctrial_lengthq\x14G#\t\x98\x87\xf8\x1a\xb4\xbaU\tconditionq\x15U\x0bhigh_mentalq\x16U\x07subjectq\x17K\x02U\x12movement_directionq\x18h\x06h\x08U\x08\xde\x06\xcf\x1c50\xfd?\x86Rq\x19U\x08positionq\x1ah\x0fh\x10K\x00\x85U\x01b\x87Rq\x1b(K\x01K\x02\x85h\x08\x89U\x10K\xb7\xb4\x07q=\x1e\xc0\xf2\xc2YI\xb7U&\xc0tbU\x04typeq\x1ch\x0eU\x08movementq\x1dh\x0fh\x10K\x00\x85U\x01b\x87Rq\x1e(K\x01K\x02\x85h\x08\x89U\x10\xad8\x9c9\x10\xb5\xee\xbf\xffa\xa2hWR\xcf?tbu}q\x1f(h\x03G#\t\xba\xbc\xb8\xad\xc8\x14h\x04G?\xd9\x99%]\xadV\x00h\x05h\x06h\x08U\x08\xe3X\xa9=\xc1\xb1\xeb?\x86Rq h\r\nh\x06h\x08U\x08\x88\xf7\xb9\xc1\t\xd6\xff?\x86Rq!h\x0ch\x06h\x08U\x08v\x7f\xeb\x11\xea5\r#\x86Rq"h\x0eh\x0fh\x10K\x00\x85U\x01b\x87Rq#(K\x01K\x02\x85h\x08\x89U\x10\xcd\xd9\x92\x9a\x94=\x06#]C\xaf\xef\xeb\xef\x02#tbh\x12h\x06h\x08U\x08-\x9c&\x185\xfd\xef?\x86Rq$h\x14G#\r\xb8W\xb2`V\xach\x15h\x16h\x17K\x02h\x18h\x06h\x08U\x08\x8e\x87\xd1\xc2
You may also download the whole file (22k).
Presuming that the file was created with the default protocol=0 ASCII-compatible method, you should be able to load it anywhere by using open('pickled_file', 'rU') i.e. universal newlines.
If this doesn't work, show us the first few hundred bytes: print repr(open('pickled_file', 'rb').read(200)) and paste the results into an edit of your question.
Update after file contents were published:
Your file starts with '\x80\x02'; it was dumped with protocol 2, the latest/best. Protocols 1 and 2 are binary protocols. Your file was written in text mode on Windows. This has resulted in each '\n' being converted to '\r\n' by the C runtime. Files should be opened in binary mode like this:
with open('result.pickle', 'wb') as f: # b for binary
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
with open('result.pickle', 'rb') as f: # b for binary
obj = pickle.load(f)
Docs are here. This code will work portably on both Windows and non-Windows systems.
You can recover the original pickle image by reading the file in binary mode and then reversing the damage by replacing all occurrences of '\r\n' by '\n'. Note: This recovery procedure is necessary whether you are trying to read it on Windows or not.
Newlines in Windows aren't just '\r', it's CRLF, or '\r\n'.
Give file.read().replace('\r\n', '\n') a try. You were previously deleting carriage returns that may not have actually been part of newlines.
Can't you -- on Windows -- just open the file in text mode, the same way it was written, read it in and then write it out to another file opened properly in binary mode?
Have you tried unpickling in text mode? That is,
x = pickle.load(open(filename, 'r'))
(On Windows, of course.)

Python fails to open 11gb csv in r+ mode but opens in r mode

I'm having problems with some code that loops through a bunch of .csvs and deletes the final line if there's nothing in it (i.e. files that end with the \n newline character)
My code works successfully on all files except one, which is the largest file in the directory at 11gb. The second largest file is 4.5gb.
The line it fails on is simply:
with open(path_str,"r+") as my_file:
and I get the following message:
IOError: [Errno 22] invalid mode ('r+') or filename: 'F:\\Shapefiles\\ab_premium\\processed_csvs\\a.csv'
The path_str I create using os.file.join to avoid errors, and I tried renaming the file to a.csv just to make sure there wasn't anything odd going on with the filename. This made no difference.
Even more strangely, the file is happy to open in r mode. I.e. the following code works fine:
with open(path_str,"r") as my_file:
I have tried navigating around the file in read mode, and it's happy to read characters at the start, end, and in the middle of the file.
Does anyone know of any limits on the size of file that Python can deal with or why I might be getting this error? I'm on Windows 7 64bit and have 16gb of RAM.
The default I/O stack in Python 2 is layered over CRT FILE streams. On Windows these are built on top of a POSIX emulation API that uses file descriptors (which in turn is layered over the user-mode Windows API, which is layered over the kernel-mode I/O system, which itself is a deeply layered system based on I/O request packets; the hardware is down there somewhere...). In the POSIX layer, opening a file with _O_RDWR | _O_TEXT mode (as in "r+"), requires seeking to the end of the file to remove CTRL+Z, if it's present. Here's a quote from the CRT's fopen documentation:
Open in text (translated) mode. In this mode, CTRL+Z is interpreted as
an end-of-file character on input. In files opened for reading/writing
with "a+", fopen checks for a CTRL+Z at the end of the file and
removes it, if possible. This is done because using fseek and ftell to
move within a file that ends with a CTRL+Z, may cause fseek to behave
improperly near the end of the file.
The problem here is that the above check calls the 32-bit _lseek (bear in mind that sizeof long is 4 bytes on 64-bit Windows, unlike most other 64-bit platforms), instead of _lseeki64. Obviously this fails for an 11 GB file. Specifically, SetFilePointer fails because it gets called with a NULL value for lpDistanceToMoveHigh. Here's the return value and LastErrorValue for the latter call:
0:000> kc 2
Call Site
KERNELBASE!SetFilePointer
MSVCR90!lseek_nolock
0:000> r rax
rax=00000000ffffffff
0:000> dt _TEB #$teb LastErrorValue
ntdll!_TEB
+0x068 LastErrorValue : 0x57
The error code 0x57 is ERROR_INVALID_PARAMETER. This is referring to lpDistanceToMoveHigh being NULL when trying to seek from the end of a large file.
To work around this problem with CRT FILE streams, I recommend opening the file using io.open instead. This is a backported implementation of Python 3's I/O stack. It always opens files in raw binary mode (_O_BINARY), and it implements its own buffering and text-mode layers on top of the raw layer.
>>> import io
>>> f = io.open('a.csv', 'r+')
>>> f
<_io.TextIOWrapper name='a.csv' encoding='cp1252'>
>>> f.buffer
<_io.BufferedRandom name='a.csv'>
>>> f.buffer.raw
<_io.FileIO name='a.csv' mode='rb+'>
>>> f.seek(0, os.SEEK_END)
11811160064L

Writing files in Python and carriage return in Windows

I'm using OpenCV Python library to extract descriptors and write them to file. Each descriptor is 32 bytes and I only save 80 of them. Meaning that, the final file must be exactly 2560 bytes. But it's 2571 bytes.
I also have another file which had been written using the same Python script (Not on Windows but I guess it was on Linux) and it's exactly 2560 bytes.
Using WinMerge, I tried to compare them and it gave me a warning that the carriage return is different in two files and asked me if I wanted to treat them equally. If I say "yes", then both files are identical but if I say "no" then they are different.
I was wondering if there is anyway in Python to write binary files which produce identical result on both Windows and Linux?
Not to mention this is the relevant part of the script:
f = open("something", "w+")
f.write(descriptors)
f.close()
Yes, there's a way to open a file in binary mode - just put the b character into the open.
f = open("something", "wb+")
If you don't do that in Windows, every linefeed '\n' will be converted to the two-character line ending sequence that is used by Windows, '\r\n'.

How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.
On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.
Here are my ideas so far:
Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.
My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?
It's easy to use Python to delete the DOS EOF chars; for example,
def delete_eof(fin, fout):
BUFSIZE = 2**15
EOFCHAR = chr(26)
data = fin.read(BUFSIZE)
while data:
fout.write(data.translate(None, EOFCHAR))
data = fin.read(BUFSIZE)
import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
delete_eof(fin, fout)
That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.
By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

Winzip cannot open an archive created by python shutil.make_archive on windows. On ubuntu archive manager does fine

I am trying to return a zip file in django http response, the code goes something like...
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
response['Content-Length'] = getsize(archive)
response['Content-Disposition'] = "attachment; filename=test %s.zip" % datetime.now()
return response
Now when this code is executed on ubuntu the resulting downloaded file opens without any issue, but when its executed on windows the file created does not open in winzip (gives error 'Unsupported Zip Format').
Is there something very obvious I am missing here? Isn't python code supposed to be portable?
EDIT:
Thanks to J.F. Sebastian for his comment...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
checkout, my answer to this question for more details...
The code you have written should work correctly. I've just run the following line from your snippet to generate a zip file and was able to extract on Linux and Windows.
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
There is something funny and specific going on. I recommend you check the following:
Generate the zip file outside of Django with a script that just has that one liner. Then try and extract it on a Windows machine. This will help you rule out anything going on relating to Django, web server or browser
If that works then look at exactly what is in the folder you compressed. Do the files have any funny characters in their names, are there strange file types, or super long filenames.
Run a md5 checksum on the zip file in Windows and Linux just to make absolutely sure that the two files are byte by byte identical. To rule out any file corruption that might have occured.
Thanks to J.F. Sebastian for his comment...
I'll still write the solution here in detail...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
because apparently, hidden somewhere in python 2.3 documentation on open:
The most commonly-used values of mode are 'r' for reading, 'w' for
writing (truncating the file if it already exists), and 'a' for
appending (which on some Unix systems means that all writes append to
the end of the file regardless of the current seek position). If mode
is omitted, it defaults to 'r'. The default is to use text mode, which
may convert '\n' characters to a platform-specific representation on
writing and back on reading. Thus, when opening a binary file, you
should append 'b' to the mode value to open the file in binary mode,
which will improve portability. (Appending 'b' is useful even on
systems that don’t treat binary and text files differently, where it
serves as documentation.) See below for more possible values of mode.
So, in simple terms while reading binary files, using open(file, 'rb') increases portability of your code (it certainly did in this case)
Now, it extracts without troubles, on windows...

Categories