Parsing a string in Python with variable field sizes - python

I have This string of length 66.
RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000. Two alphabets (keyword like RP) show next is value until the next keyword.
Now I am parsing the string keeping the number of bytes constant between two keywords i.e. 000729 between RP and SP. And following is the code to parse like that.
msgStr = "RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000"
Ppm = msgStr[msgStr.find("RP")+2:msgStr.find("SP")]
Speed = msgStr[msgStr.find("SP")+2:msgStr.find("CT")]
Coolent_temp = msgStr[msgStr.find("CT")+2:msgStr.find("ET")]
ETime = msgStr[msgStr.find("ET")+2:msgStr.find("EL")]
E_load = msgStr[msgStr.find("EL")+2:msgStr.find("TP")]
Throttle_pos = msgStr[msgStr.find("TP")+2:msgStr.find("DS")]
Distance = msgStr[msgStr.find("DS")+2:msgStr.find("MF")]
MAF = msgStr[msgStr.find("MF")+2:msgStr.find("LT")]
Lat = msgStr[msgStr.find("LT")+2:msgStr.find("LN")]
Lon = msgStr[msgStr.find("LN")+2:]
print Ppm, Speed, Coolent_temp, ETime, E_load, Throttle_pos, Distance, MAF, Lat, Lon
Output:
000729 001 087 02367 048 020 042 0220 9.300000 4.500000
Now I want to collect if there are any number of bytes between two keywords. examples are given below
Example No. 1:
Example1_msgStr= "RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000"
Expected Ouput 1:
729 14 087 2367 48 20 42 0220 0.000000 0.000000
Example No. 2:
Example2_msgStr = "RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00"
Expected Ouput 2:
72956 134 874 02367 458 20 042 0220 53.000 45.00

You should use a regular expression to find variable length matches between two strings:
import re
regex = r'RP(\d+)SP'
strings = ['RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000',
'RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00']
for string in strings:
match = re.search(regex,string)
print('Matched:',match.group(1))
In the regex the brackets () specify a group to store and \d+ means 1 or more number characters. So the whole regex RP(\d+)SP will find a variable length string of numbers between RP and SP.
This shows you how to do one case, you'll need to loop through your delimiters (RP, SP, CT etc. etc.) to capture all the information you want. If the delimiters always come in the same order you can build one enormous regex to capture all the groups at once ...

If you want to check for characters between your delimiters, you can use your code and then turn any of your variables into bool type. If the string is not empty it means that there is something in that string and thus it returns True. If the string is empty, it returns False:
msgStr = 'RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000'
rpm = msgStr[msgStr.find("RP")+2:msgStr.find("SP")] # Outputs '000729'
Speed = msgStr[msgStr.find("SP")+2:msgStr.find("CT")] # Outputs '001'
coolent_temp = msgStr[msgStr.find("CT")+2:msgStr.find("ET")] # Outputs '087'
ETime = msgStr[msgStr.find("ET")+2:msgStr.find("EL")] # '02367'
e_load = msgStr[msgStr.find("EL")+2:msgStr.find("TP")] # '048'
throttle_pos = msgStr[msgStr.find("TP")+2:msgStr.find("DS")] # '020'
Distance = msgStr[msgStr.find("DS")+2:msgStr.find("MF")] # Outputs '042'
MAF = msgStr[msgStr.find("MF")+2:msgStr.find("LT")] # Outputs '0220'
Lat = msgStr[msgStr.find("LT")+2:msgStr.find("LN")] # Outputs '9.300000'
Lon = msgStr[msgStr.find("LN")+2:] # Outputs '4.500000'
bool(rpm) # Outputs True
bool(Speed) # Outputs True
bool(coolent_temp) # Outputs True
bool(ETime) # Outputs True
bool(e_load) # Outputs True
bool(throttle_pos) # Outputs True
bool(Distance) # Outputs True
bool(MAF) # Outputs True
bool(Lat) # Outputs True
bool(Lon) # Outputs True
You could check several fields being not empty at the same time:
all_filled = bool(rpm) and bool(Speed) and bool(coolent_temp) and \
bool(ETime) and bool(e_load) and bool(throttle_pos) and bool(Distance) \
and bool(MAF) and bool(Lat) and bool(Lon)
The way your code is set, if you try it with msgStr1 you get already the separation you desired:
Example1_msgStr= "RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000"
# ... your code ...
print (rpm, Speed, coolent_temp, ETime, e_load, throttle_pos, Distance, MAF, Lat, Lon)
#> 729 14 087 2367 48 20 42 0220 0.000000 0.000000
Example2_msgStr= "RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00"
# ... your code ...
print (rpm, Speed, coolent_temp, ETime, e_load, throttle_pos, Distance, MAF, Lat, Lon)
#> 72956 134 874 02367 458 20 042 0220 53.000 45.00

Related

How to reformat a text file data formatting in python (reading a csv/txt with multiple delimiters and rows)

Coding noob here with a question. I have a text file that has the following format:
img1.jpg 468,3,489,16,5 510,37,533,51,2 411,3,433,17,5 ....
img2.jpg 255,397,267,417,2 ....
.
.
.
The data is a series of images with information on co-ordinates where there are 5 variables separated by commas, and then a new set of co-ordinates is separated by a space. There are about 500 files and for each file there are variable numbers of co-ordinate groups. I'm wanting to convert this text file into the following kind of format:
File name
Co-ord 1
Co-ord 2
Co-ord 3
Co-ord 4
Co-ord 5
img1.jpg
468
3
489
16
5
img1.jpg
510
37
533
51
2
img1.jpg
411
3
433
17
5
img2.jpg
255
397
267
417
2
How can I do this in python?
Since each image name and group of co-ordinates are separated by spaces, we can use split() to split it info array and it's basically what your expected output.
Here i wrote an example that split your input into a list of lists, each inner list represent one line in your example output with the first element describe which image the co-ordinates belong to:
test = "img1.jpg 468,3,489,16,5 510,37,533,51,2 411,3,433,17,5"
str_list = test.split()
res = list()
recent_img = ''
for item in str_list:
if item.endswith("jpg"):
# find a new image name
recent_img = item
continue
co_ordinates_list = item.split(",")
if len(co_ordinates_list) == 5:
co_ordinates_list.insert(0, recent_img)
res.append(co_ordinates_list)

Python convert HEX number digits ASCI representation into string

I want to make function doing exactly this:
#This is my imput number
MyNumberDec = 114
MyNumberHex = hex(MyNumberDec)
print (MyNumberDec)
print (MyNumberHex)
#Output looks exactly like this:
#114
#0x8a
HexFirstDigitCharacter = MagicFunction(MyNumberHex)
HexSecondDigitCharacter = MagicFunction(MyNumberHex)
#print (HexFirstDigitCharacter )
#print (HexSecondDigitCharacter )
#I want to see this in output
#8
#A
What is that function?
Why I need this?
For calculating check-sum in message sending towards some industrial equipment
For example command R8:
N | HEX | ASC
1 52 R
2 38 8
3 38 8
4 41 A
Bytes 1 and 2 are command, bytes 3 and 4 are checksum
Way of calculating checksum: 0x52 + 0x38 = 8A
I have to send ASCII 8 as third byte and ASCII A as fourth byte
Maybe I dont need my magicfunction but other solution?
You can convert an integer to a hex string without the preceding '0x' by using the string formatter:
MyNumberDec = 114
MyNumberHex = '%02x' % MyNumberDec
print(MyNumberHex[0])
print(MyNumberHex[1])
This outputs:
7
2

Filtering and parsing text over Solar Region Summary files

I was trying to filter some .txt files that are named after a date in YYYYMMDD format and contain some data about active regions in the Sun. I made a code that, given a date in YYYYMMDD format, can list the files that are within a time range which I expect the active region I am looking for to be and parse the information based on that entry. An example of these txts can be seen below and more information about it (if you feel curious) can be seen on SWPC website.
:Product: 0509SRS.txt
:Issued: 2012 May 09 0030 UTC
# Prepared jointly by the U.S. Dept. of Commerce, NOAA,
# Space Weather Prediction Center and the U.S. Air Force.
#
Joint USAF/NOAA Solar Region Summary
SRS Number 130 Issued at 0030Z on 09 May 2012
Report compiled from data received at SWO on 08 May
I. Regions with Sunspots. Locations Valid at 08/2400Z
Nmbr Location Lo Area Z LL NN Mag Type
1470 S19W68 284 0030 Cro 02 02 Beta
1471 S22W60 277 0120 Cso 05 03 Beta
1474 N14W13 229 0010 Axx 00 01 Alpha
1476 N11E35 181 0940 Fkc 17 33 Beta-Gamma-Delta
1477 S22E73 144 0060 Hsx 03 01 Alpha
IA. H-alpha Plages without Spots. Locations Valid at 08/2400Z May
Nmbr Location Lo
1472 S28W80 297
1475 N05W05 222
II. Regions Due to Return 09 May to 11 May
Nmbr Lat Lo
1460 N16 126
1459 S16 110
The code I am using to parse over these txt files is:
import glob
def seeker(noaa_number, t_start, path = None):
'''
This function will open an SRS file
and look for each line if the given AR
(specified by its NOAA number) is there.
If so, this function should grab the
entries and return them.
'''
#defaulting path if none is given
if path is None:
#assigning
path = 'defaultpath'
#listing the items within the directory
files = sorted(glob.glob(path+'*.txt'))
#finding the index in the list of
#the starting time
index = files.index(path+str(t_start)+'SRS.txt')
#looping over each file
for file in files[index: index+20]:
#opening file
f = open(file, 'r')
#reading the lines
text = f.readlines()
#looping over each line in the text
for line in text:
#checking if the noaa number is mentioned
#in the given line
if noaa_number in line:
#test print
print('Original line: ', line)
#slicing the text to get the column values
nbr = line[:4]
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
MagType = line[37:]
#test prints
print('nbr: ', nbr)
print('location: ', Location)
print('Lo: ', Lo)
print('Area: ', Area)
print('Z: ', Z)
print('LL: ', LL)
print('NN: ', NN)
print('MagType: ', MagType)
return
I tested this and it is working but I fell a bit dumb for two reasons:
Despite these files being made following a standard, one extra space is all it takes to crash the code considering the way I am slicing the arrays by index. Is there a better option to that?
The information on tables IA and II are not relevant for me so, ideally, I would like to prevent my code to scan them. Since the number of lines on the first column varies, is it possible to tell the code when to stop reading a giving document?
Thanks for your time!
Robustness:
Instead of slicing by absolute position you could split the lines up into a list using the .split() method. This will be robust against extra spaces.
So instead of
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
You could use
Location = line.split()[1]
Lo = line.split()[2]
Area = line.split()[3]
Z = line.split()[4]
LL = line.split()[5]
NN = line.split()[6]
If you wanted it to be faster you could split the list once and then just pull the relevant data from the same list rather than splitting it every time:
data = line.split()
Location = data[1]
Lo = data[2]
Area = data[3]
Z = data[4]
LL = data[5]
NN = data[6]
Stopping:
To stop it from continuing reading the file after it's passed the relevant data you could just have something that exits the loop once it no longer finds the noaa_number in the line
# In the file function but before looping through the lines.
started_reading = False ## Set this to false so
## that it doesn't exit
## before it gets to the
## relevant data
for line in text:
if noaa_number in line:
started_reading = True
## Parsing stuff
elif started_reading is True:
break # exits the loop

printing column numbers in python

How can I print first 52 numbers in the same column and so on (in 6 columns in total that repeats). I have lots of float numbers and I want to keep the first 52 and so on numbers in the same column before starting new column that will as well have to contain the next 52 numbers. The numbers are listed in lines separated by one space in a file.txt document. So in the end I want to have:
1 53 105 157 209 261
2
...
52 104 156 208 260 312
313 ... ... ... ... ...
...(another 52 numbers and so on)
I have try this:
with open('file.txt') as f:
line = f.read().split()
line1 = "\n".join("%-20s %s"%(line[i+len(line)/52],line[i+len(line)/6]) for i in range(len(line)/6))
print(line1)
However this only prints of course 2 column numbers . I have try to add line[i+len()line)/52] six time but the code is still not working.
for row in range(52):
for col in range(6):
print line[row + 52*col], # Dangling comma to stay on this line
print # Now go to the next line
Granted, you can do this in more Pythonic ways, but this will show you the algorithm structure and let you tighten the code as you wish.

parsing pdb file with format change

I have a file that looks something like this:
ATOM 7748 CG2 ILE A 999 53.647 54.338 82.768 1.00 82.10 C
ATOM 7749 CD1 ILE A 999 51.224 54.016 84.367 1.00 83.16 C
ATOM 7750 N ASN A1000 55.338 57.542 83.643 1.00 80.67 N
ATOM 7751 CA ASN A1000 56.604 58.163 83.297 1.00 80.45 C
ATOM 7752 C ASN A1000 57.517 58.266 84.501 1.00 80.30 C
As you can see the " " disappears between column 4 and 5 (starting to count at 0). Thus the code below fails. I'm new to python (total time now a whole 3 days!) and was wondering what's the best way to handle this. As long as there is a space, the line.split() works. Do I have to do a character count and then parse the string with an absolute reference?
import string
visited = {}
outputfile = open(file_output_location, "w")
for line in open(file_input_location, "r"):
list = line.split()
id = list[0]
if id == "ATOM":
type = list[2]
if type == "CA":
residue = list[3]
if len(residue) == 4:
residue = residue[1:]
type_of_chain = list[4]
atom_count = int(list[5])
position = list[6:9]
if(atom_count >= 1):
if atom_count not in visited and type_of_chain == chain_required:
visited[atom_count] = 1
result_line = " ".join([residue,str(atom_count),type_of_chain," ".join(position)])
print result_line
print >>outputfile, result_line
outputfile.close()
PDB files appear to be fixed column width files, not space delimited. So if you must parse them manually (rather than using an existing tool like pdb-tools), you'll need to chop the line up using something more along the lines of:
id = line[0:4]
type = line[4:9].strip()
# ad nausium
Use string slicing:
print '0123456789'[3:6]
345
There's an asymmetry there - the first number is the 0-based index of the first character you need. The second number is the 0-based index of the first character you no longer need.
It may be worth installing Biopython as it has a module to Parse PDBs.
I used the following code on your example data:
from Bio.PDB.PDBParser import PDBParser
pdb_reader = PDBParser(PERMISSIVE=1)
structure_id="Test"
filename="Test.pdb" # Enter file name here or path to file.
structure = pdb_reader.get_structure(structure_id, filename)
model = structure[0]
for chain in model: # This will loop over every chain in Model
for residue in chain:
for atom in residue:
if atom.get_name() == 'CA': # get_name strips spaces, use this over get_fullname() or get_id()
print atom.get_id(), residue.get_resname(), residue.get_id()[1], chain.get_id(), atom.get_coord()
# Prints Atom Name, Residue Name, Residue number, Chain Name, Atom Co-Ordinates
This prints out:
CA ASN 1000 A [ 56.60400009 58.1629982 83.29699707]
I then tried it on a larger protein which has 14 chains (1aon.pdb) and it worked fine.

Categories