Python regular expression to split parameterized text file - python

I'm trying to split a file that contains 'string = float' format repeatedly.
Below is how the file looks like.
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23
...
And I want them to put it a dictionary.
Like...
a={name1:32, name2:4, name3:2, name4:5, name5:2e+23}
I'm new to regular expression and having a hard time figuring out what to do.
After some googling, I tried to do as below to remove "+" character and white space..
p=re.compile('[^+\s]+')
splitted_list=p.findall(lineof_file)
But this gave me two problems..
1. when there is no space btw name and "=" sign, it doesn't spilit.
2. for numbers like 2e+23, it split the + sign in between.
I managed to parse the file as I wanted after some modification of depperm's code.
But I'm facing another problem.
To better explain my problems. Below is how my file can look like.
After + sign multiple parameter and value pair can appear with '=' sign.
The parameter name can contain alphabet and number in any position. Also value can contain +- sign with scientific notification(E/e-+). And sometimes value can be a math expression if it is single quoted.
+ abc2dfg3 = -2.3534E-03 dfe4c3= 2.000
+ abcdefg= '1.00232e-1*x' * bdfd=1e-3
I managed to parse the above using the below regex.
re.findall("(\w+)\s*=\s*([+-]?[\d+.Ee+-]+|'[^']+')",eachline)
But now my problem is sometimes like "* bdfd=1e-3", there could be some comment. Anything after *(asterisk) in my file should be treated as comment but not if * present inside single quoted string.
With above regex, it parses "bdfd=1e-3" as well but I want it to be not parsed.
I tried to find solution for hours but I couldn't find any solution so far.

I would suggest just grabbing the name and the value instead of worrying about the spaces or unwanted characters. I'd use this regex: (name\d+)\s?=\s?([\de+]+) which will get the name and then you also group the number even if it has an e or space.
import re
p=re.compile('(name\d+)\s*=\s*([\de+]+)')
a ={}
with open("file.txt", "r") as ins:
for line in ins:
splitted_list=p.findall(line)
#splitted_list looks like: [('name1', '32'), ('name2', '4')]
for group in splitted_list:
a[group[0]]=group[1]
print(a)
#{'name1': '32', 'name2': '4', 'name3': '2', 'name4': '5', 'name5': '2e+23'}

You don't need a regular expression to accomplish your goal. You can use built-in Python methods.
your_dictionary = {}
# Read the file
with open('file.txt','r') as fin:
lines = fin.readlines()
# iterate over each line
for line in lines:
splittedLine = line.split('=')
your_dictionary.push({dict.push({
key: splittedLine[0],
value: splittedLine[1]
});
print(your_dictionary)
Hope it helps!

You can combine regex with string splitting:
Create the file:
t ="""
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23"""
fn = "t.txt"
with open(fn,"w") as f:
f.write(t)
Split the file:
import re
d = {}
with open(fn,"r") as f:
for line in f: # proces each line
g = re.findall(r'(\w+ ?= ?[^ ]*)',line) # find all name = something
for hit in g: # something != space
hit = hit.strip() # remove spaces
if hit:
key, val = hit.split("=") # split and strip and convert
d[key.rstrip()] = float(val.strip()) # put into dict
print d
Output:
{'name4': 5.0, 'name5': 2e+23, 'name2': 4.0, 'name3': 2.0, 'name1': 32.0}

Related

How to split a comma-separated line if the chunk contains a comma in Python?

I'm trying to split current line into 3 chunks.
Title column contains comma which is delimiter
1,"Rink, The (1916)",Comedy
Current code is not working
id, title, genres = line.split(',')
Expected result
id = 1
title = 'Rink, The (1916)'
genres = 'Comedy'
Any thoughts how to split it properly?
Ideally, you should use a proper CSV parser and specify that double quote is an escape character. If you must proceed with the current string as the starting point, here is a regex trick which should work:
inp = '1,"Rink, The (1916)",Comedy'
parts = re.findall(r'".*?"|[^,]+', inp)
print(parts) # ['1', '"Rink, The (1916)"', 'Comedy']
The regex pattern works by first trying to find a term "..." in double quotes. That failing, it falls back to finding a CSV term which is defined as a sequence of non comma characters (leading up to the next comma or end of the line).
lets talk about why your code does not work
id, title, genres = line.split(',')
here line.split(',') return 4 values(since you have 3 commas) on the other hand you are expecting 3 values hence you get.
ValueError: too many values to unpack (expected 3)
My advice to you will be to not use commas but use other characters
"1#\"Rink, The (1916)\"#Comedy"
and then
id, title, genres = line.split('#')
Use the csv package from the standard library:
>>> import csv, io
>>> s = """1,"Rink, The (1916)",Comedy"""
>>> # Load the string into a buffer so that csv reader will accept it.
>>> reader = csv.reader(io.StringIO(s))
>>> next(reader)
['1', 'Rink, The (1916)', 'Comedy']
Well you can do it by making it a tuple
line = (1,"Rink, The (1916)",Comedy)
id, title, genres = line

How to extract values from a csv splitting in the correct place (no imports)?

How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)

Replace match with next item from a list of replacement strings

Suppose I have a text. It has some duplicate values that should be changed. Basically I want to generate an input file for a program from the data I have. The duplicates need to be replaced by string variables from the list I have.
There is a great number of examples how to do this but they mostly loop through lines. However text can occur on a line once, several times, or never. It is also important that when the duplicate appears first it is replaced by the first element in the list an so on.
A small example.
Input text:
wefhwefhef AAA fghfhrjgrjr
AAA rhrfgjhrgjrehgj r
grnggrejg
grejren AAA
ff34t r4 43r 43r 43 AAA ff34 f43f3443fgh5 AAA
List of variables:
['FIRST', 'SECOND', 'THIRD', 'FOURTH', 'FIFTH']
Expected output:
wefhwefhef FIRST fghfhrjgrjr
SECOND rhrfgjhrgjrehgj r
grnggrejg
grejren THIRD
ff34t r4 43r 43r 43 FOURTH ff34 f43f3443fgh5 FIFTH
Here is what I have done so far. It does not work correct since it loops through lines, but I need to loop through occurrences. Additionally, I think I could also use the enumerate feature.
input = open('input.txt', 'r')
output = open('output.txt', 'w')
# otput.txt is the empty file
# input.txt is given in the example code
variables = ['FIRST', 'SECOND', 'THIRD', 'FOURTH', 'FIFTH']
index = 0
for line in input:
output.write(line.replace('AAA', variables[index]))
index += 1
input.close()
output.close()
To do this you need to tell str.replace that you only want to replace a single instance at a time. Then loop through the possible terms. It might look something like this:
original = open('input.txt', 'r')
output = open('output.txt', 'w')
# output.txt is the empty file
# original.txt is given in the example code
terms = ['FIRST', 'SECOND', 'THIRD', 'FOURTH', 'FIFTH']
text = original.read()
for term in terms:
text = text.replace('AAA', term, 1)
output.write(text)
original.close()
output.close()
Output:
wefhwefhef FIRST fghfhrjgrjr
SECOND rhrfgjhrgjrehgj r
grnggrejg
grejren THIRD
ff34t r4 43r 43r 43 FOURTH ff34 f43f3443fgh5 FIFTH
str.replace() doesn't allow you to pass in a function to say what the replacement should be, but re.sub does. With that, you can do something like
import sys
import re
def nextrep():
while True:
for replacement in ['FIRST', 'SECOND', 'THIRD', 'FOURTH', 'FIFTH']:
yield replacement
replacement = nextrep()
rx = re.compile("AAA")
for line in sys.stdin:
line = rx.sub(lambda x: next(replacement), line)
print(line, end='')
This just uses sys.stdin as the read filehandle, and (implicitly) sys.stdout for writing. You can easily wrap it with different filehandles if you want to. The following would replace the final for loop from the above script;
with open(inputfile) as ifh, open(outputfile, 'w') as ofh:
for line in ifh:
ofh.write(rx.sub(lambda x: next(replacement), line)
The while True: makes sure we start over from the beginning of the list of replacements if we find more strings to replace than we have items in the list of replacements. If you want a different behavior, it should be easy enough to modify nextrep to do something else when you run past the end of the list (raise an exception? Start yielding empty strings? Or yield the input string so you replace matches with themselves?)
If you have multiple strings you need to replace, this easily extends to that use case too; just expand the regex, and pass in the string to replace as a parameter to the replacement function so it can know which list to yield from. That way, you only need a single pass over the text. Perhaps something like this:
import sys
import re
def nextrep(items):
while True:
for replacement in items:
yield replacement
replacement = {
'AAA': nextrep(['FIRST', 'SECOND', 'THIRD', 'FOURTH', 'FIFTH']),
'BBB': nextrep(['PREMIER', 'DEUXIEME', 'TROISIEME'])
}
rx = re.compile("|".join(replacement.keys()))
for line in sys.stdin:
line = rx.sub(lambda x: next(replacement[x.group(0)]), line)
print(line, end='')
This will replace the first AAA with FIRST, the first BBB with PREMIER, etc.

How to parse a value next to a specific substring in Python

I have a log file containing lines formatted as shown below. I want to parse the values right next to the substrings element=(string), time=(guint64) and ts=(guint64) and save them to a list that will contain separate lists for each line:
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
The final output would then look like: [['rawvideoparse0', 852315, 336203035], ['nvh264enc0', 6403181, 336845676]].
I should probably use Python's string split or partition methods to obtain the relevant parts in each line but I can't come up with a short solution that can be generalised for the values that I'm searching for. I also don't know how to deal with the fact that the values element and time are terminated with a comma whereas ts is terminated with a semicolon (without writing separate conditional for the two cases). How can I achieve this using the string manipulation methods in Python?
Regex was meant for this:
lines = """
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
"""
import re
pattern = re.compile(".*element-id=\\(string\\)(?P<elt_id>.*), element=\\(string\\)(?P<elt>.*), src=\\(string\\)(?P<src>.*), time=\\(guint64\\)(?P<time>.*), ts=\\(guint64\\)(?P<ts>.*);")
for l in lines.splitlines():
match = pattern.match(l)
if match:
results = match.groupdict()
print(results)
yields the following dictionaries (notice that the captured groups have been named in the regex above using (?P<name>...), thats why we have these names) :
{'elt_id': '0x55f5ca532a60', 'elt': 'rawvideoparse0', 'src': 'src', 'time': '852315', 'ts': '336203035'}
{'elt_id': '0x55f5ca53f860', 'elt': 'nvh264enc0', 'src': 'src', 'time': '6403181', 'ts': '336845676'}
You can make this regex pattern even more generic, since all the elements share a common structure <name>=(<type>)<value>:
pattern2 = re.compile("(?P<name>[^,;\s]*)=\\((?P<type>[^,;]*)\\)(?P<value>[^,;]*)")
for l in lines.splitlines():
all_interesting_items = pattern2.findall(l)
print(all_interesting_items)
it yields:
[]
[('element-id', 'string', '0x55f5ca532a60'), ('element', 'string', 'rawvideoparse0'), ('src', 'string', 'src'), ('time', 'guint64', '852315'), ('ts', 'guint64', '336203035')]
[('element-id', 'string', '0x55f5ca53f860'), ('element', 'string', 'nvh264enc0'), ('src', 'string', 'src'), ('time', 'guint64', '6403181'), ('ts', 'guint64', '336845676')]
Note that in all cases, https://regex101.com/ is your friend for debugging regex :)
Here is a possible solution using a series of split commands:
output = []
with open("log.txt") as f:
for line in f:
values = []
line = line.split("element=(string)", 1)[1]
values.append(line.split(",", 1)[0])
line = line.split("time=(guint64)", 1)[1]
values.append(int(line.split(",", 1)[0]))
line = line.split("ts=(guint64)", 1)[1]
values.append(int(line.split(";", 1)[0]))
output.append(values)
This is not the fastest solution, but this is probably how I would code it for readability.
# create empty list for output
list_final_output = []
# filter substrings
list_filter = ['element=(string)', 'time=(guint64)', 'ts=(guint64)']
# open the log file and read in the lines as a list of strings
with open('so_58272709.log', 'r') as f_log:
string_example = f_log.read().splitlines()
print(f'string_example: \n{string_example}\n')
# loop through each line in the list of strings
for each_line in string_example:
# split each line by comma
list_split_line = each_line.split(',')
# loop through each filter substring, include filter
filter_string = [x for x in list_split_line if (list_filter[0] in x
or list_filter[1] in x
or list_filter[2] in x
)]
# remove the substring
filter_string = [x.replace(list_filter[0], '') for x in filter_string]
filter_string = [x.replace(list_filter[1], '') for x in filter_string]
filter_string = [x.replace(list_filter[2], '') for x in filter_string]
# store results of each for-loop
list_final_output.append(filter_string)
# print final output
print(f'list_final_output: \n{list_final_output}\n')

python regex unicode - extracting data from an utf-8 file

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.
This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克
I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']
Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

Categories