Python - Open subset of a binary file

Python - Open subset of a binary file - python

In python if I want to open something in binary I simply do:
f = open("myfile.bin", "rb")
Then f is an handler, and I can use it to read, detect end of file, seek etc.
Now, what if I wanted to do this on a subset of myfile.bin? Something like:
f = open("myfile.bin", "rb", beg=1000, end=1100)
So that I end up with an handler to some sort of virtual file, of length 100 bytes, that actually maps to a subset of the original file?
I know that if I was to work on the file's content it would be easy: I could just seek around and be careful. But there are several libraries that ask for file handles as inputs, and I would like them to work only on subset of files. Again, I know that I could make a copy of it and make the library read that, but I am working on quite huge files and I would prefer not to have to duplicate files.
Thank you very much!

Related

How to not have set written on my file- python 2

So I basically just want to have a list of all the pixel colour values that overlap written in a text file so I can then access them later.
The only problem is that the text file is having (set([ or whatever written with it.
Heres my code
import cv2
import numpy as np
import time
om=cv2.imread('spectrum1.png')
om=om.reshape(1,-1,3)
om_list=om.tolist()
om_tuple={tuple(item) for item in om_list[0]}
om_set=set(om_tuple)
im=cv2.imread('RGB.png')
im=cv2.resize(im,(100,100))
im= im.reshape(1,-1,3)
im_list=im.tolist()
im_tuple={tuple(item) for item in im_list[0]}
ColourCount= om_set & set(im_tuple)
File= open('Weedlist', 'w')
File.write(str(ColourCount))
Also, if I run this program again but with a different picture for comparison, will it append the data or overwrite it? It's kinda hard to tell when just looking at numbers.

If you replace these lines:
im=cv2.imread('RGB.png')
File= open('Weedlist', 'w')
File.write(str(ColourCount))
with:
import sys
im=cv2.imread(sys.argv[1])
open(sys.argv[1]+'Weedlist', 'w').write(str(list(ColourCount)))
you will get a new file for each input file and also you don't have to overwrite the RGB.png every time you want to try something new.
Files opened with mode 'w' will be overwritten. You can use 'a' to append.

You opened the file with the 'w' mode, write mode, which will truncate (empty) the file when you open it. Use 'a' append mode if you want data to be added to the end each time
You are writing the str() conversion of a set object to your file:
ColourCount= om_set & set(im_tuple)
File= open('Weedlist', 'w')
File.write(str(ColourCount))
Don't use str to convert the whole object; format your data to a string you find easy to read back again. You probably want to add a newline too if you want each new entry to be added on a new line. Perhaps you want to sort the data too, since a set lists items in an ordered determined by implementation details.
If comma-separated works for you, use str.join(); your set contains tuples of integer numbers, and it sounds as if you are fine with the repr() output per tuple, so we can re-use that:
with open('Weedlist', 'a') as outputfile:
output = ', '.join([str(tup) for tup in sorted(ColourCount)])
outputfile.write(output + '\n')
I used with there to ensure that the file object is automatically closed again after you are done writing; see Understanding Python's with statement for further information on what this means.
Note that if you plan to read this data again, the above is not going to be all that efficient to parse again. You should pick a machine-readable format. If you need to communicate with an existing program, you'll need to find out what formats that program accepts.
If you are programming that other program as well, pick a format that other programming language supports. JSON is widely supported for example (use the json module and convert your set to a list first; json.dump(sorted(ColourCount), fileobj), then `fileobj.write('\n') to produce newline-separated JSON objects could do).
If that other program is coded in Python, consider using the pickle module, which writes Python objects to a file efficiently in a format the same module can load again:
with open('Weedlist', 'ab') as picklefile:
pickle.dump(ColourCount, picklefile)
and reading is as easy as:
sets = []
with open('Weedlist', 'rb') as picklefile:
while True:
try:
sets.append(pickle.load(output))
except EOFError:
break
See Saving and loading multiple objects in pickle file? as to why I use a while True loop there to load multiple entries.

How would you like the data to be written? Replace the final line by
File.write(str(list(ColourCount)))
Maybe you like that more.
If you run that program, it will overwrite the previous content of the file. If you prefer to apprend the data open the file with:
File= open('Weedlist', 'a')

Python basics - request data from API and write to a file

I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)

Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)

filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)

How can a Python program load and read specific lines from a file?

I have a huge file of numbers in binary format, and only certain parts of it needs to be parsed into an array. I looked into numpy.fromfile and open, but they don't have the option to read from location A to location B in the file. Can this be done?

If you're dealing with "huge files", I would not simply read-ignore everything up until the point where you actually need the data.
Instead: file objects in Python have a .seek() method which you can use to jump right where you need to start parsing the data efficiently bypassing everything before.
with open('huge_file.dat', 'rb') as f:
f.seek(1024 * 1024 * 1024) # skip 1GB
...
See also: http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

If you know about the precise location of the data you are interested in, you could just use the seek(<n bytes>) method on the file object as documented. Just call it once (with the given offset) before you start to read.

Python Disk Imaging

Trying to make a script for disk imaging (such as .dd format) in python. Originally started as a project to get another hex debugger and kinda got more interested in trying to get raw data from the drive. which turned into wanting to be able to image the drive first. Anyways, I've been looking around for about a week or so and found the best way get get information from the drive on smaller drives appears to be something like:
with file("/dev/sda") as f:
i=file("~/imagingtest.dd", "wb")
i.write(f.read(SIZE))
with size being the disk size. Problem is, which seems to be a well known issue, trying to use large disks shows up as (even in my case total size of 250059350016 bytes):
"OverflowError: Python int too large to convert to C long"
Is there a more appropriate way to get around this issue? As it works fine for a small flash drive, but trying to image a drive, fails.
I've seen mention of possibly just iterating by sector size (512) per the number of sectors (in my case 488397168) however would like to verify exactly how to do this in a way that would be functional.
Thanks in advance for any assistance, sorry for any ignorance you easily notice.

Yes, that's how you should do it. Though you could go higher than the sector size if you wanted.
with open("/dev/sda",'rb') as f:
with open("~/imagingtest.dd", "wb") as i:
while True:
if i.write(f.read(512)) == 0:
break

Read the data in blocks. When you reach the end of the device, .read(blocksize) will return the empty string.
You can use iter() with a sentinel to do this easily in a loop:
from functools import partial
blocksize = 12345
with open("/dev/sda", 'rb') as f:
for block in iter(partial(f.read, blocksize), ''):
# do something with the data block
You really want to open the device in binary mode, 'rb' if you want to make sure no line translations take place.
However, if you are trying to create copy into another file, you want to look at shutil.copyfile():
import shutil
shutil.copyfile('/dev/sda', 'destinationfile')
and it'll take care of the opening, reading and writing for you. If you want to have more control of the blocksize used for that, use shutil.copyfileobj(), open the file objects yourself and specify a blocksize:
import shutil
blocksize = 12345
with open("/dev/sda", 'rb') as f, open('destinationfile', 'wb') as dest:
shutil.copyfileobj(f, dest, blocksize)

Reading lines in text files using python

I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?

If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()

These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Open subset of a binary file - python

Related

How to not have set written on my file- python 2

Python basics - request data from API and write to a file

How can a Python program load and read specific lines from a file?

Python Disk Imaging

Reading lines in text files using python

Categories

Resources