Fastest way to tokenize signal? - python

I need to find the fastest way to tokenize a signal. The signal is of the form:
identifier:value identifier:value identifier:value ...
identifier only consists of alphanumerics and underscores. identifier is separated from previous value by a space. Value may contain alphanumerics, various braces/brackets and spaces.
e.g.
signal_id:debug_word12_ind data:{ } virtual_interface_index:0x0000 module_id:0x0001 module_sub_id:0x0016 timestamp:0xcc557366 debug_words:[0x0006 0x0006 0x0000 0x0000 0x0000 0x0000 0xcc55 0x70a9 0x4c55 0x7364 0x0000 0x0000] sequence_number:0x0174
The best I've come up with is below. Ideally I'd like to halve the time it takes. I've tried various things with regexes but they're no better. Any suggestions?
# Convert data to dictionary. Expect data to be something like
# parameter_1:a b c d parameter_2:false parameter_3:0xabcd parameter_4:-56
# Split at colons. First part will be just parameter name, last will be just value
# everything in between will be <parameter name><space><value>
parts1 = data.split(":")
parts2 = []
for part in parts1:
# Copy first and last 'as is'
if part in (parts1[0], parts1[-1]):
parts2.append(part)
# Split everything in between at last space (don't expect parameter names to contain spaces)
else:
parts2.extend(part.rsplit(' ', 1))
# Expect to now have [parameter name, value, parameter name, value, ...]. Convert to a dict
self.data_dict = {}
for i in range(0, len(parts2), 2):
self.data_dict[parts2[i]] = parts2[i + 1]

I have optimized your solution a little:
1) Removed the check from the loop.
2) Changed a dictionary creation code: Pairs from single list.
parts1 = data.split(":")
parts2 = []
parts2.append(parts1.pop(0))
for part in parts1[0:-1]:
parts2.extend(part.rsplit(' ', 1))
parts2.append(parts1.pop())
data_dict = {k : v for k, v in zip(parts2[::2], parts2[1::2])}

Related

Split out-of-order string in python

I'm reading text files which may look like this
file1.txt:
Header A
blab
iuyt
Header B
bkas
rtyu
Header C
asdf
file2.txt:
Header B
asdw
Header A
hufd
ousu
Header C
dfsn
At the end of the file might be a newline, space, or nothing at all. The headers are the same in all the files but may be ordered differently as above.
I would like to map this so that a = blab\niuyt for the first input or a = hufd\nousu for the second.
I'm not sure I fully understand your question. It sounds to me as though you want to take an input:
XABCDE
or, equivalently (at least as far as I can tell in your notation):
BCXADE
DEBCXA
and return a mapping like
{"x": "A", "b": "C", "d": "E"}
(which is one way of representing the name-value pairs).
Is that correct? If so:
# This is the input.
c = "XABCDE"
# This is a dictionary comprehension, one way
# of creating a set of key-value pairs.
{
c[idx].lower(): c[idx + 1] # Map the first element of each pair to the second.
for idx in range(0, len(c), 2) # Iterate over the pairs in the string.
}
The question's been edited materially since my original answer, so I'm adding a separate answer.
The OP's input is given as follows: there is a file foo.txt with the following contents:
Header A
blab
iuyt
Header B
bkas
rtyu
Header C
asdf
The OP's expected output is a dictionary mapping header values to the contents (not lines) following the header, i.e.:
{
"A": "blab\niuyt",
"B": "bkas\nrtyu",
"C": "asdf"
}
Note the last trailing line delimiter (\n) before each new header should not be included.
One approach:
import re
from collections import defaultdict
# Given something like "Header A" with a trailing newline,
# this will match "A" under group "key". The header formats
# in the example are simple enough that you could fetch the
# value using _, group = line.split(" "), but this accomodates
# more complex formats. Note that this regular expression
# assumes each header will be followed by AT LEAST ONE line
# of data in a file!
PATTERN = re.compile(r"^Header\s*(?P<key>.+)(\r\n|\r|\n)")
# Using defaultdict with an str constructor means we don't have to check
# for key existence before attempting an append. Check the standard library
# documentation for more info:
# https://docs.python.org/3/library/collections.html#collections.defaultdict
structured_output = defaultdict(str)
with open("txt", "r") as handle:
last_match = None # Track the second-last match we made.
for line in handle:
maybe_match = PATTERN.match(line)
if maybe_match: # We've matched a new header group. Strip the trailing newline from preceding, if any.
# This is either (a) the FIRST header we're matching or (b) the n-th header.
# In the first case, structured_output[key] returns "" (see defaultdict), and "".rstrip("\n")
# is "". In the second case, we strip the last newline from the previous group (per the spec).
group = (last_match or maybe_match).group("key")
structured_output[group] = structured_output[group].rstrip()
# Move the last_match "pointer" forward to the
# new header.
last_match = maybe_match
else: # This is not a header, it's a line of data: append it.
structured_output[last_match.group("key")] += line
# Once we've run off the end of the file, we should still rstrip the _last_ group.
structured_output[last_match.group("key")] = structured_output[last_match.group("key")].rstrip()
Use an iterator:
it = iter(string1)
res = {c: next(it) for c in it}

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Pyparsing: Parsing semi-JSON nested plaintext data to a list

I have a bunch of nested data in a format that loosely resembles JSON:
company="My Company"
phone="555-5555"
people=
{
person=
{
name="Bob"
location="Seattle"
settings=
{
size=1
color="red"
}
}
person=
{
name="Joe"
location="Seattle"
settings=
{
size=2
color="blue"
}
}
}
places=
{
...
}
There are many different parameters with varying levels of depth--this is just a very small subset.
It also might be worth noting that when a new sub-array is created that there is always an equals sign followed by a line break followed by the open bracket (as seen above).
Is there any simple looping or recursion technique for converting this data to a system-friendly data format such as arrays or JSON? I want to avoid hard-coding the names of properties. I am looking for something that will work in Python, Java, or PHP. Pseudo-code is fine, too.
I appreciate any help.
EDIT: I discovered the Pyparsing library for Python and it looks like it could be a big help. I can't find any examples for how to use Pyparsing to parse nested structures of unknown depth. Can anyone shed light on Pyparsing in terms of the data I described above?
EDIT 2: Okay, here is a working solution in Pyparsing:
def parse_file(fileName):
#get the input text file
file = open(fileName, "r")
inputText = file.read()
#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair
#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)
#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary
#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)
#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()
This works, but when I try to write the results to a file with json.dump(result), the contents of the file are wrapped in double quotes. Also, there are \n chraacters between many of the data pairs. I tried suppressing them in the code above with LineEnd().suppress() , but I must not be using it correctly.
Parsing an arbitrarily nested structure can be done with pyparsing by defining a placeholder to hold the nested part, using the Forward class. In this case, you are just parsing simple name-value pairs, where then value could itself be a nested structure containing name-value pairs.
name :: word of alphanumeric characters
entry :: name '=' value
struct :: '{' entry* '}'
value :: real | integer | quotedstring | struct
This translates to pyparsing almost verbatim. To define value, which can recursively contain values, we first create a Forward() placeholder, which can be used as part of the definition of entry. Then once we have defined all the possible types of values, we use the '<<' operator to insert this definition into the value expression:
EQ,LBRACE,RBRACE = map(Suppress,"={}")
name = Word(alphas, alphanums+"_")
value = Forward()
entry = Group(name + EQ + value)
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)
struct = Group(LBRACE + ZeroOrMore(entry) + RBRACE)
value << (quotedString | struct | real | integer)
The parse actions on real and integer will convert these elements from strings to float or ints at parse time, so that the values can be used as their actual types immediately after parsing (no need to post-process to do string-to-other-type conversion).
Your sample is a collection of one or more entries, so we use that to parse the total input:
result = OneOrMore(entry).parseString(sample)
We can access the parsed data as a nested list, but it is not so pretty to display. This code uses pprint to pretty-print a formatted nested list:
from pprint import pprint
pprint(result.asList())
Giving:
[['company', 'My Company'],
['phone', '555-5555'],
['people',
[['person',
[['name', 'Bob'],
['location', 'Seattle'],
['settings', [['size', 1], ['color', 'red']]]]],
['person',
[['name', 'Joe'],
['location', 'Seattle'],
['settings', [['size', 2], ['color', 'blue']]]]]]]]
Notice that all the strings are just strings with no enclosing quotation marks, and the ints are actual ints.
We can do just a little better than this, by recognizing that the entry format actually defines a name-value pair suitable for accessing like a Python dict. Our parser can do this with just a few minor changes:
Change the struct definition to:
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE)
and the overall parser to:
result = Dict(OneOrMore(entry)).parseString(sample)
The Dict class treats the parsed contents as a name followed by a value, which can be done recursively. With these changes, we can now access the data in result like elements in a dict:
print result['phone']
or like attributes in an object:
print result.company
Use the dump() method to view the contents of a structure or substructure:
for person in result.people:
print person.dump()
print
prints:
['person', ['name', 'Bob'], ['location', 'Seattle'], ['settings', ['size', 1], ['color', 'red']]]
- location: Seattle
- name: Bob
- settings: [['size', 1], ['color', 'red']]
- color: red
- size: 1
['person', ['name', 'Joe'], ['location', 'Seattle'], ['settings', ['size', 2], ['color', 'blue']]]
- location: Seattle
- name: Joe
- settings: [['size', 2], ['color', 'blue']]
- color: blue
- size: 2
There is no "simple" way, but there are harder and not-so-hard ways. If you don't want to hardcode things, then at some point you're going to have to parse it as a structured format. That would involve parsing each line one-by-one, tokenizing it appropriately (for example, separating the key from the value correctly), and then determining how you want to deal with the line.
You may need to store your data in an intermediary format such as a (parse) tree in order to account for the arbitrary nesting relationships (represented by indents and braces), and then after you have finished parsing the data, take your resulting tree and then go through it again to get your arrays or JSON.
There are libraries available such as ANTLR that handles some of the manual work of figuring out how to write the parser.
Take a look at this code:
still_not_valid_json = re.sub (r'(\w+)=', r'"\1":', pseudo_json ) #1
this_one_is_tricky = re.compile ('("|\d)\n(?!\s+})', re.M)
that_one_is_tricky_too = re.compile ('(})\n(?=\s+\")', re.M)
nearly_valid_json = this_one_is_tricky.sub (r'\1,\n', still_not_valid_json) #2
nearly_valid_json = that_one_is_tricky_too.sub (r'\1,\n', nearly_valid_json) #3
valid_json = '{' + nearly_valid_json + '}' #4
You can convert your pseudo_json in parseable json via some substitutions.
Replace '=' with ':'
Add missing commas between simple value (like "2" or "Joe") and next field
Add missing commas between closing brace of a complex value and next field
Embrace it with braces
Still there are issues. In your example 'people' dictionary contains two similar keys 'person'. After parsing only one key remains in the dictionary. This is what I've got after parsing:{u'phone': u'555-5555', u'company': u'My Company', u'people': {u'person': {u'settings': {u'color': u'blue', u'size': 2}, u'name': u'Joe', u'location': u'Seattle'}}}
If only you could replace second occurence of 'person=' to 'person1=' and so on...
Replace the '=' with ':', Then just read it as json, add in trailing commas
Okay, I came up with a final solution that actually transforms this data into a JSON-friendly Dict as I originally wanted. It first using Pyparsing to convert the data into a series of nested lists and then loops through the list and transforms it into JSON. This allows me to overcome the issue where Pyparsing's toDict() method was not able to handle where the same object has two properties of the same name. To determine whether a list is a plain list or a property/value pair, the prependPropertyToken method adds the string __property__ in front of property names when Pyparsing detects them.
def parse_file(self,fileName):
#get the input text file
file = open(fileName, "r")
inputText = file.read()
#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
no = CaselessKeyword("no").setParseAction(replaceWith(False))
quotedString.setParseAction(removeQuotes)
unquotedString = Word(alphanums+"_-?\"")
comment = Suppress("#") + Suppress(restOfLine)
EQ,LBRACE,RBRACE = map(Suppress, "={}")
data = (real | integer | yes | no | quotedString | unquotedString)
#define structures
value = Forward()
object = Forward()
dataList = Group(OneOrMore(data))
simpleArray = (LBRACE + dataList + RBRACE)
propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
property = dictOf(propertyName + EQ, value)
properties = Dict(property)
object << (LBRACE + properties + RBRACE)
value << (data | object | simpleArray)
dataset = properties.ignore(comment)
#parse it
result = dataset.parseString(inputText)
#turn it into a JSON-like object
dict = self.convert_to_dict(result.asList())
return json.dumps(dict)
def convert_to_dict(self, inputList):
dict = {}
for item in inputList:
#determine the key and value to be inserted into the dict
dictval = None
key = None
if isinstance(item, list):
try:
key = item[0].replace("__property__","")
if isinstance(item[1], list):
try:
if item[1][0].startswith("__property__"):
dictval = self.convert_to_dict(item)
else:
dictval = item[1]
except AttributeError:
dictval = item[1]
else:
dictval = item[1]
except IndexError:
dictval = None
#determine whether to insert the value into the key or to merge the value with existing values at this key
if key:
if key in dict:
if isinstance(dict[key], list):
dict[key].append(dictval)
else:
old = dict[key]
new = [old]
new.append(dictval)
dict[key] = new
else:
dict[key] = dictval
return dict
def prependPropertyToken(self,t):
return "__property__" + t[0]

quick data processing with python?

I have a file in the following format:
[s1,s2,s3,s4,...] SOME_TEXT
(per line)
For example:
[dog,cat,monkey] 1,2,3
[a,b,c,d,e,f] 13,4,6
the brackets are included.
let's say I have another field like this, which contains two lines:
[banana,cat2,monkey2] 1,2,3
[a2,b2,c2,d,e,f] 13,4,6
I want to take two files of this form and align them the following way:
[dog^banana,cat^cat2,monkey^monkey2] 1,2,3
[a^a2,b^b2,c^c2,d^d2,e^e2,f^f2] 13,4,6
while making sure that "SOME TEXT" in corresponding lines (such as 1,2,3 and 13,4,6) is the same and that the number of elements in the brackets in each corresponding line is the same. What would be a quick compact way to do it?
Thanks.
def read_file(fp,hash):
for l in fp:
p = l[1:].find(']')
k = l[p+3:-1]
v = l[1:p+1].split(",")
if k not in hash:
hash[k] = v
else:
hash[k] = zip(hash[k], v)
hash = {}
for fname in ('f1.txt', 'f2.txt'):
with open(fname) as fp:
read_file(fp, hash)
for k,v in hash.items():
print "[{0}] {1}".format(",".join("^".join(vv) for vv in v), k)
This is a basic way to do it, if you need the lines in the files in the order they were read you'll have to do a bit more work.
Here's the output I get:
[a^a2,b^b2,c^c2,d^d,e^e,f^f] 13,4,6
[dog^banana,cat^cat2,monkey^monkey2] 1,2,3
Edit:
This also assumes that each key ie. 13,4,6 appears once in a file. If it can appear multiple times you'll have to change the hash[k] = zip(hash[k],v) to something more elaborate such has
if k not in hash:
hash[k] = [[vv] for vv in v]
else:
for i,vv in enumerate(v):
hash[k][i].append(vv)
I'd use a regex to chop off everything after the first ] (and hang on to it). Then another regex to explode the string into an array. Then do whatever you need to do to it with regards to merging different arrays from different files, and then piecing it all back together shouldn't be too hard. I'll leave the regex's as an exercise for the reader :-)
for l, m in zip(f1, f2):
l_head, l_tail = l.strip("[ ").split("]")
m_head, m_tail = m.strip("[ ").split("]")
l_head = l_head.split(",")
m_head = m_head.split(",")
assert len(l_head) == len(m_head)
l_tail = l_tail.split(",")
m_tail = m_tail.split(",")
assert len(l_tail) == len(m_tail)
...
I haven't given your variables good names because I don't know what they are. I would name them something more useful.
I also haven't written the code for reassembling the lines. It shouldn't be too hard...

Categories