Implement regular expression in Python to replace every occurence of "meshname = x" in a text file - python

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?

Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

Related

Split string if separator is not in-between two characters

I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$

How to fix 'replace' keyword when it is not working in python

I am writing a code that needs to get four individual values, and one of the values has the newline character in addition to an extra apostrophe and bracket like so: 11\n']. I only need the 11 and have been able to strip the '], but I am unable to remove the newline character.
I have tried various different set ups of strip and replace, and both strip and replace are not removing the part.
with open('gil200110raw.txt', 'r') as qcfile:
txt = qcfile.readlines()
line1 = txt[1:2]
line2 = txt[2:3]
line1 = str(line1)
line2 = str(line2)
sptline1 = line1.split(' ')
sptline2 = line2.split(' ')
totalobs = sptline1[39]
qccalc1 = sptline2[2]
qccalc2 = sptline2[9]
qccalc3 = sptline2[16]
qccalc4 = sptline2[22]
qccalc4 = qccalc4.strip("\n']")
qccalc4 = qccalc4.replace("\n", "")
I did not get an error, but the output of print(qccalc4) is 11\n. I expect the output to be 11.
Use rstrip instead!
>>> 'test string\n'.rstrip()
'test string'
You can use regex to match the outputs you're looking for.
From your description, I assume it is all integers, consider the following snippet
import re
p = re.compile('[0-9]+')
sample = '11\n\'] dwqed 12 444'
results = p.findall(sample)
results would now contain the array ['11', '12', '444'].
re is the regex package for python and p is the pattern we would like to find in our text, this pattern [0-9]+ simply means match one or more characters 0 to 9
you can find the documentation here

python regex: pattern not found

I have a pattern compiled as
pattern_strings = ['\xc2d', '\xa0', '\xe7', '\xc3\ufffdd', '\xc2\xa0', '\xc3\xa7', '\xa0\xa0', '\xc2', '\xe9']
join_pattern = '|'.join(pattern_strings)
pattern = re.compile(join_pattern)
and then I find pattern in file as
def find_pattern(path):
with open(path, 'r') as f:
for line in f:
print line
found = pattern.search(line)
if found:
print dir(found)
logging.info('found - ' + found)
and my input as path file is
\xc2d
d\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
'619d813\xa03697'
When I run this program, nothing happens.
I it not able to catch these patterns, what is am I doing wrong here?
Desired output
- each line because each line has one or the other matching pattern
Update
After changing the regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
It is still the same, no output
UPDATE
after making regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '[' + '|'.join(pattern_strings) + ']'
pattern = re.compile(join_pattern)
Things started to work, but partially, the patterns still not caught are for line
\xc2\xa0
\xc3\xa7
\xa0\xa0
for which my pattern string is ['\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0']
escape the \ in the search patterns
either with r"\xa0" or as "\\xa0"
do this ....
['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
like everyones been saying to do except the one guy you listened too...
Does your file actually contain \xc2d --- that is, five characters: a backslash followed by c, then 2, then d? If so, your regex won't match it. Each of your regexes will match one or two characters with certain character codes. If you want to match the string \xc2d your regex needs to be \\xc2d.

Python Regex MULTILINE option not working correctly?

I'm writing a simple version updater in Python, and the regex engine is giving me mighty troubles.
In particular, ^ and $ aren't matching correctly even with re.MULTILINE option. The string matches without the ^ and $, but no joy otherwise.
I would appreciate your help if you can spot what I'm doing wrong.
Thanks
target.c
somethingsomethingsomething
NOTICE_TYPE revision[] = "A_X1_01.20.00";
somethingsomethingsomething
versionUpdate.py
fileName = "target.c"
newVersion = "01.20.01"
find = '^(\s+NOTICE_TYPE revision\[\] = "A_X1_)\d\d+\.\d\d+\.\d\d+(";)$'
replace = "\\1" + newVersion + "\\2"
file = open(fileName, "r")
fileContent = file.read()
file.close()
find_regexp = re.compile(find, re.MULTILINE)
file = open(fileName, "w")
file.write( find_regexp.sub(replace, fileContent) )
file.close()
Update: Thank you John and Ethan for a valid point. However, the regexp still isn't matching if I keep $. It works again as soon as I remove $.
Change your replace to:
replace = r'\g<1>' + newVersion + r'\2'
The problem you're having is your version results in this:
replace = "\\101.20.01\\2"
which is confusing the sub call as there is no field 101. From the documentation for the Python re module:
\g<number> uses the corresponding group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0.
\20 would be interpreted as a reference to group 20, not a reference
to group 2 followed by the literal character '0'.
if you do a print replace you'll see the problem...
replace == '\\101.20.01\2'
and since you don't have a 101st match, the first portion of your line gets lost. Try this instead:
newVersion = "_01.20.01"
find = r'^(\s+NOTICE_TYPE revision\[\] = "A_X1)_\d\d+\.\d\d+\.\d\d+(";)$'
replace = "\\1" + newVersion + "\\2"
(moves a portion of the match so there is no conflict)

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories