Python - Parse file into outputs based on magic number/length - python

I'm a complete beginner to coding - only started 3 weeks ago, and really only have codecademy's Python course under my belt - so simple explanations would be really appreciated!
I'm trying to write a python script that reads a file as a HEX string, and then parses the file into individual output files based on finding a "magic number" within the HEX string.
EG: if my HEX string were "0011AABB00BBAACC00223344", I might want to parse this string into new output files based on the magic number "00", and telling python that each output should be 8 characters long. The output for the example string above should be 3 files that contain the HEX values:
"0011AABB"
"00BBAACC"
"00223344"
Here's what I have so far (assuming in this case that the string above is contained within the "hextests" file
import os
import binascii
filename = "hextests"
# read file as a binary string
with open(filename, 'rb') as f:
content = f.read()
# convert binary string to hex string
hexString = binascii.hexlify(content)
# define magic number as "00"
magic_N = "00"
# attempting to create a new substring called newFile that is equal to each instance magic_N repeats throughout the file for a length of 8 characters
for chars in hexString:
newFile = ""
if chars == magic_N:
newFile += chars.len(9)
# attempting to create a series of new output files for each instance of newFile - while incrementing the output file name
if newFile != "":
i = 0
while os.path.exists("output_file%s.xyz" % i):
i += 1
fh = with open("output_file%s.xyz" % i, "wb"):
newFile
I'm sure I have a lot of errors to work through on this - and it's likely more complicated than I think .... but my main question has to do with the proper way to define the chars and newFile variables. I'm pretty sure python sees chars as only single characters in the string, so it's failing because I'm attempting to search for a magic_N that is longer than 1 character. Am I correct that that is part of the issue?
Also, if you understand the main goal of this script, any other thoughts about things I should be doing differently?
Thanks so much for the help!

You can try something like this:
filename = "hextests"
# read file as a binary string
with open(filename, "rb") as f:
content = f.read()
# You don't need this part if you want
# to parse the hex string as it is given in the file
# convert binary string to hex string
# hexString = binascii.hexlify(content)
# Remove the newline at the end of the string
hexString = content.strip()
# define magic number as "00"
magic_N = "00"
i = 0
j = 0
while i < len(hexString) - 1:
index = hexString.find(magic_N, i)
# This is the part which was incorrect in your code.
with open("output_file_%s.xyz" % j, "wb") as output:
output.write(hexString[i:i+8])
i += 8
j += 1
Note that you need to explicitly call write method to write the data to the output file.
Here it is assumed that the chunks of data are exactly 8 hex symbols long and they always start with 00. So it's not a flexible solution but it gives you an idea on how to tackle the problem.

Related

How to store first N strings from a txt file in Python?

I'm trying to figure out how to get the first N strings from a txt file, and store them into an array. Right now, I have code that gets every string from a txt file, separated by a space delimiter, and stores it into an array. However, I want to be able to only grab the first N number of strings from it, not every single string. Here is my code (and I'm doing it from a command prompt):
import sys
f = open(sys.argv[1], "r")
contents = f.read().split(' ')
f.close()
I'm sure that the only line I need to fix is:
contents = f.read().split(' ')
I'm just not sure how to limit it here to N number of strings.
If the file is really big, but not too big--that is, big enough that you don't want to read the whole file (especially in text mode or as a list of lines), but not so big that you can't page it into memory (which means under 2GB on a 32-bit OS, but a lot more on 64-bit), you can do this:
import itertools
import mmap
import re
import sys
n = 5
# Notice that we're opening in binary mode. We're going to do a
# bytes-based regex search. This is only valid if (a) the encoding
# is ASCII-compatible, and (b) the spaces are ASCII whitespace, not
# other Unicode whitespace.
with open(sys.argv[1], 'rb') as f:
# map the whole file into memory--this won't actually read
# more than a page or so beyond the last space
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_READ)
# match and decode all space-separated words, but do it lazily...
matches = re.finditer(r'(.*?)\s', m)
bytestrings = (match.group(1) for match in matches)
strings = (b.decode() for b in bytestrings)
# ... so we can stop after 5 of them ...
nstrings = itertools.islice(strings, n)
# ... and turn that into a list of the first 5
contents = list(nstrings)
Obviously you can combine steps together, even cramming the whole thing into a giant one-liner if you want. (An idiomatic version would be somewhere between that extreme and this one.)
If you're fine with reading the whole file (assuming it's not memory prohibitive to do so) you can just do this:
strings_wanted = 5
strings = open('myfile').read().split()[:strings_wanted]
That works like this:
>>> s = 'this is a test string with more than five words.'
>>> s.split()[:5]
['this', 'is', 'a', 'test', 'string']
If you actually want to stop reading exactly as soon as you've reached the nth word, you pretty much have to read a byte at a time. But that's going to be slow, and complicated. Plus, it's still not really going to stop reading after the nth word, unless you're reading in binary mode and decoding manually, and you disable buffering.
As long as the text file has line breaks (as opposed to being one giant 80MB line), and it's acceptable to read a few bytes past the nth word, a very simple solution will still be pretty efficient: just read and split line by line:
import sys
f = open(sys.argv[1], "r")
contents = []
for line in f:
contents += line.split()
if len(contents) >= n:
del contents[n:]
break
f.close()
what about just:
output=input[:3]
output will contain the first three strings in input

Same value in list keeps getting repeated when writing to text file

I'm a total noob to Python and need some help with my code.
The code is meant to take Input.txt [http://pastebin.com/bMdjrqFE], split it into seperate Pokemon (in a list), and then split that into seperate values which I use to reformat the data and write it to Output.txt.
However, when I run the program, only the last Pokemon gets outputted, 386 times. [http://pastebin.com/wkHzvvgE]
Here's my code:
f = open("Input.txt", "r")#opens the file (input.txt)
nf = open("Output.txt", "w")#opens the file (output.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
num = 0
tab = """ """
newl = """NEWL
"""
slash = "/"
while num != 386:
current = pokeData
current.append(line)
print current[num]
for tab in current:
words = tab.split()
print words
for newl in words:
nf.write('%s:{num:%s,species:"%s",types:["%s","%s"],baseStats:{hp:%s,atk:%s,def:%s,spa:%s,spd:%s,spe:%s},abilities:{0:"%s"},{1:"%s"},heightm:%s,weightkg:%s,color:"Who cares",eggGroups:["%s"],["%s"]},\n' % (str(words[2]).lower(),str(words[1]),str(words[2]),str(words[3]),str(words[4]),str(words[5]),str(words[6]),str(words[7]),str(words[8]),str(words[9]),str(words[10]),str(words[12]).replace("_"," "),str(words[12]),str(words[14]),str(words[15]),str(words[16]),str(words[16])))
num = num + 1
nf.close()
f.close()
There are quite a few problems with your program starting with the file reading.
To read the lines of a file to an array you can use file.readlines().
So instead of
f = open("Input.txt", "r")#opens the file (input.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
You can just do this
pokeData = open("Input.txt", "r").readlines() # This will return each line within an array.
Next you are misunderstanding the uses of for and while.
A for loop in python is designed to iterate through an array or list as shown below. I don't know what you were trying to do by for newl in words, a for loop will create a new variable and then iterate through an array setting the value of this new variable. Refer below.
array = ["one", "two", "three"]
for i in array: # i is created
print (i)
The output will be:
one
two
three
So to fix alot of this code you can replace the whole while loop with something like this.
(The code below is assuming your input file has been formatted such that all the words are split by tabs)
for line in pokeData:
words = line.split (tab) # Split the line by tabs
nf.write ('your very long and complicated string')
Other helpers
The formatted string that you write to the output file looks very similar to the JSON format. There is a builtin python module called json that can convert a native python dict type to a json string. This will probably make things alot easier for you but either way works.
Hope this helps

How to loop over every 2 characters in a file in python

I'm trying to loop over every 2 character in a file, do some tasks on them and write the result characters into another file.
So I tried to open the file and read the first two characters.Then I set the pointer on the 3rd character in the file but it gives me the following error:
'bytes' object has no attribute 'seek'
This is my code:
the_file = open('E:\\test.txt',"rb").read()
result = open('E:\\result.txt',"w+")
n = 0
s = 2
m = len(the_file)
while n < m :
chars = the_file.seek(n)
chars.read(s)
#do something with chars
result.write(chars)
n =+ 1
m =+ 2
I have to mention that inside test.txt is only integers (numbers).
The content of test.txt is a series of binary data (0's and 1's) like this:
01001010101000001000100010001100010110100110001001011100011010000001010001001
Although it's not the point here, but just want to replace every 2 character with something else and write it into result.txt .
Use the file with the seek and not its contents
Use an if statement to break out of the loop as you do not have the length
use n+= not n=+
finally we seek +2 and read 2
Hopefully this will get you close to what you want.
Note: I changed the file names for the example
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
n = 0
s = 2
while True:
the_file.seek(n)
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
n +=2
the_file.close()
Note that because, in this case, you are reading the file sequentially, in chunks i.e. read(2) rather than read() the seek is superfluous.
The seek() would only be required if you wished to alter the position pointer within the file, say for example you wanted to start reading at the 100th byte (seek(99))
The above could be written as:
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
while True:
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
the_file.close()
You were trying to use .seek() method on a string, because you thought it was a File object, but the .read() method of files turns it into a string.
Here's a general approach I might take to what you were going for:
# open the file and load its contents as a string file_contents
with open('E:\\test.txt', "r") as f:
file_contents = f.read()
# do the stuff you were doing
n = 0
s = 2
m = len(file_contents)
# initialize a result string
result = ""
# iterate over the file_contents, incrementing by 2, adding to results
for i in xrange(0, m, 2):
result += file_contents[i]
# write to results.txt
with open ('E:\\result.txt', 'wb') as f:
f.write(result)
Edit: It seems like there was a change to the question. If you want to change every second character, you'll need to make some adjustments.

Python: write and read blocks of binary data to a file

I am working on a script where it will breakdown another python script into blocks and using pycrypto to encrypt the blocks (all of this i have successfully done so far), now i am storing the encrypted blocks to a file so that the decrypter can read it and execute each block. The final result of the encryption is a list of binary outputs (something like blocks=[b'\xa1\r\xa594\x92z\xf8\x16\xaa',b'xfbI\xfdqx|\xcd\xdb\x1b\xb3',etc...]).
When writing the output to a file, they all end up into one giant line, so that when reading the file, all the bytes come back in one giant line, instead of each item from the original list. I also tried converting the bytes into a string, and adding a '\n' at the end of each one, but the problem there is that I still need the bytes, and I can't figure out how to undo the string to get the original byte.
To summarize this, i am looking to either: write each binary item to a separate line in a file so i can easily read the data and use it in the decryption, or i could translate the data to a string and in the decrpytion undo the string to get back the original binary data.
Here is the code for writing to the file:
new_file = open('C:/Python34/testfile.txt','wb')
for byte_item in byte_list:
# This or for the string i just replaced wb with w and
# byte_item with ascii(byte_item) + '\n'
new_file.write(byte_item)
new_file.close()
and for reading the file:
# Or 'r' instead of 'rb' if using string method
byte_list = open('C:/Python34/testfile.txt','rb').readlines()
A file is a stream of bytes without any implied structure. If you want to load a list of binary blobs then you should store some additional metadata to restore the structure e.g., you could use the netstring format:
#!/usr/bin/env python
blocks = [b'\xa1\r\xa594\x92z\xf8\x16\xaa', b'xfbI\xfdqx|\xcd\xdb\x1b\xb3']
# save blocks
with open('blocks.netstring', 'wb') as output_file:
for blob in blocks:
# [len]":"[string]","
output_file.write(str(len(blob)).encode())
output_file.write(b":")
output_file.write(blob)
output_file.write(b",")
Read them back:
#!/usr/bin/env python3
import re
from mmap import ACCESS_READ, mmap
blocks = []
match_size = re.compile(br'(\d+):').match
with open('blocks.netstring', 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mm:
position = 0
for m in iter(lambda: match_size(mm, position), None):
i, size = m.end(), int(m.group(1))
blocks.append(mm[i:i + size])
position = i + size + 1 # shift to the next netstring
print(blocks)
As an alternative, you could consider BSON format for your data or ascii armor format.
I think what you're looking for is byte_list=open('C:/Python34/testfile.txt','rb').read()
If you know how many bytes each item is, you can use read(number_of_bytes) to process one item at a time.
read() will read the entire file, but then it is up to you to decode that entire list of bytes into their respective items.
In general, since you're using Python 3, you will be working with bytes objects (which are immutable) and/or bytearray objects (which are mutable).
Example:
b1 = bytearray('hello', 'utf-8')
print b1
b1 += bytearray(' goodbye', 'utf-8')
print b1
open('temp.bin', 'wb').write(b1)
#------
b2 = open('temp.bin', 'rb').read()
print b2
Output:
bytearray(b'hello')
bytearray(b'hello goodbye')
b'hello goodbye'

Combine strings, extract substrings

(I'm using python)
I'm working with a large file of RNA sequences, and I'm trying to reformat it to use in a clustering program. My file is made up of two types of 'lines.' 1) Accession numbers for bacteria, (period) the nucleotide this sequence starts at, (period) the nucleotide it ends at. 2) lines of the actual sequence itself (across multiple lines, even though it's a continuous sequence):
>A45315.1.1521\n
GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGCAGGAAGCCGGCGGAUCCC\n
UUCGGGGUGAANCCGGUGGAAUGAGCGGCGGACGGGUGAGUAACACGUGGGCAACCUACC\n
UUGUAGACUGGGAUAACUCCGGGAAACCGGGGCUAAUACCGGAUGAUCAUUUGGAUCGCAU\n
GAUCCGAAUGUAAAAGUGGGGAUUUAUCCUCACACUGCAAGAUGGGCCCGCGGCGCA…..
>A93610.15.1301\n
CCACUGCUAUGGGGGUCCGACUAAGCCAUGCGAGUCAUGGGGUCCCUCUGGGACACCACC\n
GGCGGACGGCUCAGUAACACGUCGGUAACCUACCCUCGGGAGGGGGAUAACCCCGGGAAA\n
CUGGGGCUAAUCCCCCAUAGGCCUGAGGUACUGGAAGGUCCUCAGGCCGAAAGGGGCUU….
I need to create something that looks at the lines that start with >, and go to the number after the first decimal (so above that would be 1 and 15). Starting a count at that number (so 1 or 15 in the above example), it needs to extract the nucleotides (As,Cs,Gs or Us) that start at 69 and go to 497 (note for this example I took out a bunch of the nucleotides).
So, for my attempt, I thought it would make sense to make the nucleotide sequences into one long string, and then try to extract the nucleotides. But I can't seem to make the lines of RNA sequences into one long string (see below for what I tried). And once I have the large string, I'm not sure how to extract the right nucleotides. I need to write something like s = [x:497], where x is 69-(insert that number before the first decimal).
#!/usr/bin/env python
#Make a program that takes SSURef_NR99 file of sequences, makes a new file of
#Accession numbers and size of 16S.
import re
infilename = 'SSUtestdata.txt'
outfilename = 'SSUtestdata3.txt'
#Here I'm trying to search for one of the nucleotides, an end of line character and another nucleotide, trying to make a long string.
replace = re.compile(r'([A|C|G|U])(\n)([A|C|G|U])')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
line = replace.sub(r'\1\3', line)
#Write to OutFile
outfile.write(line)
Thank you for any ideas you might have!
If I understand your problem correctly, this should do it:
with open('path/to/input') as infile:
while 1:
try:
line = infile.readline()
_, start, end = line.strip().split('.')
start, end = int(start), int(end)
beg = infile.read(start-1)
infile.read(beg.count('\n'))
seq = infile.read(end-start)
extra = infile.read(seq.count('\n'))
seq = seq.replace('\n') + extra
print seq # print(seq) in python3
except:
break
Perhaps something like this, although not as elegant as #inspectorG4dget's solution.
with open(infilename) as infile:
nucStart=69
nucStop=497
nucleotides=[]
for line in infile:
if line.startswith(">"):
# process the previous list if populated
if len(nucleotides) > 0:
nucleotides = ''.join(nucleotides) # make a single string
# write out the accession information and the nucleotides we want
outfile.write("%s %s" % (accession_line,
nucleotides[nucStart-start-1:nucStop-start]))
nucleotides=[] # clear it for the next run
# this is the start of the next sequence
accession_line = line
start = int(line.split('.')[1])
else:
# this is a line containing a partial nucleotide sequence, so add it
nucleotides.append(line)

Categories