Extract certain offsets in a file with python

Extract certain offsets in a file with python - python

i am trying to learn python as i go, but i have come to a brick wall.
i am just trying to extract certain offsets in a .bin file.
i have a bin file with a length of "00FFFFF0"
lets say i want to extract from "0x3F000" with a block size of "0x800" and then put that in a file how would i go about it? i dont have any code yet and am hoping i will get some good input. i am a beginner to python (been doing it for a few months) and would like to learn how to do this really just for educational purposes.
but the point is i want to be able to extract specific (offset;block size)
i hope you understand what i mean. and i very much appreciate any help i am given. thanks

It's pretty self-explanatory, actually:
# Use the with statement to open a file so it will later be closed automatically
with open("in.bin", "rb") as infile: # rb = read binary
infile.seek(0x3F000, 0) # 0 = start of file, optional in this case
data = infile.read(0x800)
with open("out.bin", "wb") as outfile:
outfile.write(data)

Related

Fast text use (getting it up to compare word vectors)

I am a little ashamed that I have to ask this question because I feel like I should know this. I haven't been programming long but I am trying to apply what I learn to a project I'm working on, and that is how I got to this question. Fast Text has a library of word and associated points https://fasttext.cc/docs/en/english-vectors.html . It is used to find the vector of the word. I just want to look a word or two up and see what the result is in order to see if it is useful for my project. They have provided a list of vectors and then a small code chunck. I cannot make heads or tails out of it. some of it I get but i do not see a print function - is it returning the data to a different part of your own code? I also am not sure where the chunk of code opens the data file, usually fname is a handle right? Or are they expecting you to type your file's path there. I also am not familiar with io, I googled the word but didn't find anything useful. Is this something I need to download or is it already a part of python. I know I might be a little out of my league but I learn best by doing, so please don't hate on me.
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data

Try the following:
my_file_name = 'C:/path/to/file.txt' # Use the path to your file of rows of sentences
my_data = load_vectors(my_file_name) # Function will return data
print(my_data) # To see the output

I'm trying to load a file into Python using pd.read_csv(), but I cannot understand the file's format

This is my very first question on stackoverflow, so I must beg your patience.
I believe there is something wrong with the format of a csv file I need to load into Python. I'm using a Jupyter Notebook. The link to the file is here.
It is from the World Inequality Database data portal.
I'm pretty sure the delimiter is a semi-colon ( sep=";" ) because the bottom half of the data renders neatly when I specify this argument. However the first half of the text in the file seems to make no sense. I have no idea how to tell the pd.read_csv() function how to read it. I suspect the first half of the data simply has terrible formatting. I've also tried header=None and sep="|" to no avail.
Any ideas or suggestions would be very helpful. Thank you very much!

This is common with speadsheets. You have have some commentary, tables may be inserted all over the place. It looks great to the content creator, but the CSV is a mess. You need to preprocess the CSV to create clean content for your analysis. In this case, its easy. The content starts at canned header and you can split the file there. If that header changes, you'll get an error and now its just one more sleepless night figuring out what they've done.
import itertools
canned_header_line = "Variable Code;country;year;perc;agdpro999i;"\
"npopul999i;mgdpro999i;inyixx999i;xlceux999i;xlcusx999i;xlcyux999i"
def scrub_WID_file(in_csv_filename, out_csv_filename):
with open(in_csv_filename) as in_file,\
open(out_csv_filename, 'w') as out_file:
out_file.writelines(itertools.dropwhile(
lambda line: line.strip() != canned_header_line,
in_fp))
if not os.stat.st_size:
raise ValueError("No recognized header in " + in_csv_filename)

Getting data from fastq by generator

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
#hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

Python Disk Imaging

Trying to make a script for disk imaging (such as .dd format) in python. Originally started as a project to get another hex debugger and kinda got more interested in trying to get raw data from the drive. which turned into wanting to be able to image the drive first. Anyways, I've been looking around for about a week or so and found the best way get get information from the drive on smaller drives appears to be something like:
with file("/dev/sda") as f:
i=file("~/imagingtest.dd", "wb")
i.write(f.read(SIZE))
with size being the disk size. Problem is, which seems to be a well known issue, trying to use large disks shows up as (even in my case total size of 250059350016 bytes):
"OverflowError: Python int too large to convert to C long"
Is there a more appropriate way to get around this issue? As it works fine for a small flash drive, but trying to image a drive, fails.
I've seen mention of possibly just iterating by sector size (512) per the number of sectors (in my case 488397168) however would like to verify exactly how to do this in a way that would be functional.
Thanks in advance for any assistance, sorry for any ignorance you easily notice.

Yes, that's how you should do it. Though you could go higher than the sector size if you wanted.
with open("/dev/sda",'rb') as f:
with open("~/imagingtest.dd", "wb") as i:
while True:
if i.write(f.read(512)) == 0:
break

Read the data in blocks. When you reach the end of the device, .read(blocksize) will return the empty string.
You can use iter() with a sentinel to do this easily in a loop:
from functools import partial
blocksize = 12345
with open("/dev/sda", 'rb') as f:
for block in iter(partial(f.read, blocksize), ''):
# do something with the data block
You really want to open the device in binary mode, 'rb' if you want to make sure no line translations take place.
However, if you are trying to create copy into another file, you want to look at shutil.copyfile():
import shutil
shutil.copyfile('/dev/sda', 'destinationfile')
and it'll take care of the opening, reading and writing for you. If you want to have more control of the blocksize used for that, use shutil.copyfileobj(), open the file objects yourself and specify a blocksize:
import shutil
blocksize = 12345
with open("/dev/sda", 'rb') as f, open('destinationfile', 'wb') as dest:
shutil.copyfileobj(f, dest, blocksize)

python - open, seek, write to a file, from another file

I guess I am doing something wrong.
I am not sure what it is though, but I keep getting TypeError: expected a character buffer object
I just want to open a file, seek to certain offsets and overwrite data from patch1 and patch2.
Here is the code I am using, please help me and show me what I am doing wrong:
patch1 = open("patch1", "r");
patch2 = open("patch2", "r");
main = open("patchthis.bin", "w");
main.seek(0xC0010);
main.write(patch1);
main.seek(0x7C0010);
main.write(patch1);
main.seek(0x40000);
main.write(patch2);
main.close();
I am noob when it comes to file handling with python, even though I have read up about it.
I really want to start learning more, but I need some good examples and any help sure would be appreciated :)

You are trying to write file object into file, not a string.
try:
patch1_text = patch1.read()
main.write(patch1_text)
and so on.
Also use with statement when operating on files:
with open('patch1', 'r') as patch1:
patch1_text = patch1.read()
patch1.close()
And don't use semi-colons at the end of line !!!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract certain offsets in a file with python - python

Related

Fast text use (getting it up to compare word vectors)

I'm trying to load a file into Python using pd.read_csv(), but I cannot understand the file's format

Getting data from fastq by generator

Python Disk Imaging

python - open, seek, write to a file, from another file

Categories

Resources