Matching multiple regex groups and removing them

Matching multiple regex groups and removing them - python

I have been given a file that I would like to extract the useful data from. The format of the file goes something like this:
LINE: 1
TOKENKIND: somedata
TOKENKIND: somedata
LINE: 2
TOKENKIND: somedata
LINE: 3
etc...
What I would like to do is remove LINE: and the line number as well as TOKENKIND: so I am just left with a string that consists of 'somedata somedate somedata...'
I'm using Python to do this, using regular expressions (that I'm not sure are correct) to match the bits of the file I'd like removing.
My question is, how can I get Python to match multiple regex groups and ignore them, adding anything that isn't matched by my regex to my output string? My current code looks like this:
import re
import sys
ignoredTokens = re.compile('''
(?P<WHITESPACE> \s+ ) |
(?P<LINE> LINE:\s[0-9]+ ) |
(?P<TOKEN> [A-Z]+: )
''', re.VERBOSE)
tokenList = open(sys.argv[1], 'r').read()
cleanedList = ''
scanner = ignoredTokens.scanner(tokenList)
for line in tokenList:
match = scanner.match()
if match.lastgroup not in ('WHITESPACE', 'LINE', 'TOKEN'):
cleanedList = cleanedList + match.group(match.lastindex) + ' '
print cleanedList

import re
x = '''LINE: 1
TOKENKIND: somedata
TOKENKIND: somedata
LINE: 2
TOKENKIND: somedata
LINE: 3'''
junkre = re.compile(r'(\s*LINE:\s*\d*\s*)|(\s*TOKENKIND:)', re.DOTALL)
print junkre.sub('', x)

no need to use regex in Python. Its Python after all, not Perl. Think simple and use its string manipulation capabilities
f=open("file")
for line in f:
if line.startswith("LINE:"): continue
if "TOKENKIND" in line:
print line.split(" ",1)[-1].strip()
f.close()

How about replacing (^LINE: \d+$)|(^\w+:) with an empty string ""?
Use \n instead of ^ and $ to remove unwanted empty lines also.

Related

Replace decimals within a special character in file

I am currently trying to read in file and replace all the decimals that are only between the thorn character in it such that:
ie.
þ219.91þ
þ122.1919þ
þ467.426þ
þ104.351þ
þ104.0443þ
will become
þ219þ
þ122þ
þ467þ
þ104þ
þ104þ
The gist of something I'm trying to replicate that works in Notepad++ (regex replacing - below) and trying to replicate it in python (code below which is not working). Any suggestions?
In Notepad++:
Find: (\xFE\d+)\.\d+(\xFE)
Replace: $1$2
Python:
for line in file:
line = re.sub("(\xFE\d+)\.\d+(\xFE)", "\xFE\d+\xFE", line)

I don't think it would be necessary to have \xFE and this might simply work:
import re
regex = r"(þ\d+)\.\d+(þ)"
test_str = ("þ219.91þ\n"
"þ122.1919þ\n"
"þ467.426þ\n"
"þ104.351þ\n"
"þ104.0443þ")
subst = "\\1\\2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)

You're not replacing decimals: you're truncating the values. Will the mathematical treatment do for you? This assumes that all lines are of the format you show.
for line in file:
_, val, _ = line.split('þ') # null string, value, null string
line = 'þ' + str(int(val))+ 'þ'
Note that you could reduce this a little with a single line in the loop:
line = 'þ' + str(int(line.split('þ')[1]))+ 'þ'

You could use a one-liner such as:
f = ["þ219.91þ", "þ122.1919þ", "þ467.426þ", "þ104.351þ", "þ104.0443þ"]
print(["þ{}þ".format(int(float(l.strip("þ")))) for l in f])
Result:
['þ219þ', 'þ122þ', 'þ467þ', 'þ104þ', 'þ104þ']

Finding data in-between two strings in python

I have a text file which contain some format like :
PAGE(leave) 'Data1'
line 1
line 2
line 2
...
...
...
PAGE(enter) 'Data1'
I need to get all the lines in between the two keywords and save it a text file. I have come across the following so far. But I have an issue with single quotes as regular expression thinks it as the quote in the expression rather than the keyword.
My codes so far:
log_file = open('messages','r')
data = log_file.read()
block = re.compile(ur'PAGE\(leave\) \'Data1\'[\S ]+\s((?:(?![^\n]+PAGE\(enter\) \'Data1\').)*)', re.IGNORECASE | re.DOTALL)
data_in_home_block=re.findall(block, data)
file = 0
make_directory("home_to_home_data",1)
for line in data_in_home_block:
file = file + 1
with open("home_to_home_" + str(file) , "a") as data_in_home_to_home:
data_in_home_to_home.write(str(line))
It would be great if someone could guide me how to implement it..

As pointed out by #JoanCharmant, it is not necessary to use regex for this task, because the records are delimited by fixed strings.
Something like this should be enough:
messages = open('messages').read()
blocks = [block.rpartition(r"PAGE\(enter\) 'Data1'")[0]
for block in messages.split(r"PAGE\(leave\) 'Data1'")
if block and not block.isspace()]
for count, block in enumerate(blocks, 1):
with open('home_to_home_%d' % count, 'a') as stream:
stream.write(block)

If it's single quotes what worry you, you can start the regular expression string with double quotes...
'hello "howdy"' # Correct
"hello 'howdy'" # Correct
Now, there are more issues here... Even when declared asr, you still must escape your regular expression's backslashes in the .compile (see What does the "r" in pythons re.compile(r' pattern flags') mean? ) Is just that without the r, you probably would need a lot more of backslashes.
I've created a test file with two "sections":
PAGE\(leave\) 'Data1'
line 1
line 2
line 3
PAGE\(enter\) 'Data1'
PAGE\(leave\) 'Data1'
line 4
line 5
line 6
PAGE\(enter\) 'Data1'
The code below will do what you want (I think)
import re
log_file = open('test.txt', 'r')
data = log_file.read()
log_file.close()
block = re.compile(
ur"(PAGE\\\(leave\\\) 'Data1'\n)"
"(.*?)"
"(PAGE\\\(enter\\\) 'Data1')",
re.IGNORECASE | re.DOTALL | re.MULTILINE
)
data_in_home_block = [result[1] for result in re.findall(block, data)]
for data_block in data_in_home_block:
print "Found data_block: %s" % (data_block,)
Outputs:
Found data_block: line 1
line 2
line 3
Found data_block: line 4
line 5
line 6

how to manipulate SREC file

I have an S19 file looking something like below:
S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8
I want to separate the first two characters and also the next two characters, and so on... I want it to look like below (last two characters are also to be separated for each line):
S0, 03, 0000, FC
S3, 0D, 0003C000, 0F00000000000000, 20
S3, FD, 00000000, 782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, ED, 000000F8, 3D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, 15, 00000400, FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF, 7D
S3, FD, 00000410, 10B5DFF828000468012147F22C10C4F20300016047F22010C4F20300, 00
S7, 05, 00008EB4, B8
How can I do this in Python?
I have something like this:
#!/usr/bin/python
import string,os,sys,re,fileinput
print "hi"
inputfile = "k60.S19"
outputfile = "k60_out.S19"
# open the source file and read it
fh = file(inputfile, 'r')
subject = fh.read()
fh.close()
# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters
pattern2 = re.compile(r'S3')
pattern3 = re.compile(r'S7')
pattern1 = re.compile(r'S0')
# do the replace
result1 = pattern1.sub("S0, ", subject)
result2 = pattern2.sub("S3, ", subject)
result3 = pattern3.sub("S7, ", subject)
# write the file
f_out = file(outputfile, 'w')
f_out.write(result1)
f_out.write(result2)
f_out.write(result3)
f_out.close()
#EoF
but it is not working as I like!! Can someone help me with how to come up with proper regular expression use for this?

try package bincopy, maybe you need it.
bincopy - Interpret strings as packed binary data
Mangling of various file formats that conveys binary information (Motorola S-Record, Intel HEX and binary files).
import bincopy
f = bincopy.BinFile()
f.add_srec_file("path/to/your/s19/flie.s19")
f.as_binary() # print s19 as binary
or you can easily use open() for a file:
with open("path/to/your/s19/flie.s19") as s19:
for line in s19:
type = line[0:2]
count = line[2:4]
adress = line[4:12]
data = line[12:-2]
crc = line[-2:]
print type + ", "+ count + ", " + adress + ", " + data + ", " + crc + "\n"
hope it helps.
Motorola S-record file format

You can do it using a callback function as replacement with re.sub:
#!/usr/bin/python
import re
data = r'''S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8'''
pattern = re.compile(r'^(..)(..)((?:.{4}){1,2})(.*)(?=..)', re.M)
def repl(m):
repstr = ''
for g in m.groups():
if (g):
repstr += g + ', '
return repstr
print re.sub(pattern, repl, data)
However, as Mark Setchell notices it, there is probably a nice way to do it with slicing.

I know you are thinking Python and regexes, but this was made for awk and the following will maybe help you work out the way to do it using slicing:
awk '{r=length($0);print substr($0,1,2),substr($0,3,2),substr($0,5,8),substr($0,13,r-14),substr($0,r-1)}' OFS=, k60.s19
That says "get the length of the line in variable r, then print the first two characters, the next two characters, the next 8 characters and so on... and use a comma as the field separator".
EDITED
Here are a few more hints to get you started...
if you want to avoid printing line 1, you can do
awk 'FNR==1{next} ...rest of awk script above ... '
If you want to only process lines longer than 40 characters, you can do
awk 'length($0)>40 {print}' yourfile
If you only want to process lines where the second field is "xx", you can do
awk '$2 ~ "xx" {print}' yourfile

How to remove newlines and indents from a string in Python?

In my Python script, I have a SQL statement that goes on forever like so:
query = """
SELECT * FROM
many_many
tables
WHERE
this = that,
a_bunch_of = other_conditions
"""
What's the best way to get this to read like a single line? I tried this:
def formattedQuery(query):
lines = query.split('\n')
for line in lines:
line = line.lstrip()
line = line.rstrip()
return ' '.join(lines)
and it did remove newlines but not spaces from the indents. Please help!

You could do this:
query = " ".join(query.split())
but it will not work very well if your SQL queries contain strings with spaces or tabs (for example select * from users where name = 'Jura X'). This is a problem of other solutions which use string.replace or regular expressions. So your approach is not too bad, but your code needs to be fixed.
What is actually wrong with your function - you return the original, the return values of lsplit and rsplit are abandoned. You could fix it like this:
def formattedQuery(query):
lines = query.split('\n')
r = []
for line in lines:
line = line.lstrip()
line = line.rstrip()
r.append(line)
return ' '.join(r)
Another way of doing it:
def formattedQuery(q): return " ".join([s.strip() for s in q.splitlines()])

Another one line:
>>> import re
>>> re.sub(r'\s', ' ', query)
'SELECT * FROM many_many tables WHERE this = that, a_bunch_of = other_conditions'
This replaces all white spaces characters in the string query by a single ' ' white space.

string.translate can remove characters (just provide None for the second argument so it doesn't also convert characters):
import string
string.translate(query, None, "\n\t")

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z

This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...

You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.

This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching multiple regex groups and removing them - python

import re x = '''LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3''' junkre = re.compile(r'(\sLINE:\s\d\s)|(\s*TOKENKIND:)', re.DOTALL) print junkre.sub('', x)

no need to use regex in Python. Its Python after all, not Perl. Think simple and use its string manipulation capabilities f=open("file") for line in f: if line.startswith("LINE:"): continue if "TOKENKIND" in line: print line.split(" ",1)[-1].strip() f.close()

How about replacing (^LINE: \d+$)|(^\w+:) with an empty string ""? Use \n instead of ^ and $ to remove unwanted empty lines also.

Related

Replace decimals within a special character in file

Finding data in-between two strings in python

how to manipulate SREC file

How to remove newlines and indents from a string in Python?

delete only lines after match1 up to match2

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching multiple regex groups and removing them - python

import re x = '''LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3''' junkre = re.compile(r'(\s*LINE:\s*\d*\s*)|(\s*TOKENKIND:)', re.DOTALL) print junkre.sub('', x)

no need to use regex in Python. Its Python after all, not Perl. Think simple and use its string manipulation capabilities f=open("file") for line in f: if line.startswith("LINE:"): continue if "TOKENKIND" in line: print line.split(" ",1)[-1].strip() f.close()

How about replacing (^LINE: \d+$)|(^\w+:) with an empty string ""? Use \n instead of ^ and $ to remove unwanted empty lines also.

Related

Replace decimals within a special character in file

Finding data in-between two strings in python

how to manipulate SREC file

How to remove newlines and indents from a string in Python?

delete only lines after match1 up to match2

Categories

Resources

import re x = '''LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3''' junkre = re.compile(r'(\sLINE:\s\d\s)|(\s*TOKENKIND:)', re.DOTALL) print junkre.sub('', x)