How to fix 'replace' keyword when it is not working in python - python

I am writing a code that needs to get four individual values, and one of the values has the newline character in addition to an extra apostrophe and bracket like so: 11\n']. I only need the 11 and have been able to strip the '], but I am unable to remove the newline character.
I have tried various different set ups of strip and replace, and both strip and replace are not removing the part.
with open('gil200110raw.txt', 'r') as qcfile:
txt = qcfile.readlines()
line1 = txt[1:2]
line2 = txt[2:3]
line1 = str(line1)
line2 = str(line2)
sptline1 = line1.split(' ')
sptline2 = line2.split(' ')
totalobs = sptline1[39]
qccalc1 = sptline2[2]
qccalc2 = sptline2[9]
qccalc3 = sptline2[16]
qccalc4 = sptline2[22]
qccalc4 = qccalc4.strip("\n']")
qccalc4 = qccalc4.replace("\n", "")
I did not get an error, but the output of print(qccalc4) is 11\n. I expect the output to be 11.

Use rstrip instead!
>>> 'test string\n'.rstrip()
'test string'

You can use regex to match the outputs you're looking for.
From your description, I assume it is all integers, consider the following snippet
import re
p = re.compile('[0-9]+')
sample = '11\n\'] dwqed 12 444'
results = p.findall(sample)
results would now contain the array ['11', '12', '444'].
re is the regex package for python and p is the pattern we would like to find in our text, this pattern [0-9]+ simply means match one or more characters 0 to 9
you can find the documentation here

Related

Split string if separator is not in-between two characters

I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$

Implement regular expression in Python to replace every occurence of "meshname = x" in a text file

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

Re.search for comma/delimiter-removed substring

I have a text and used a function to extract a part of the text. However, in the returned value, delimiters (e.g ',', '-') are removed. I need to find the extracted part in the original text including substring and position.
e.g:
original_text = "xyz, 19900 Praha 9, Letnany"
(or original_text = "xyz, 19900 Praha 9 - Letnany")
extracted_text = "praha 9 letnany" (lower case, delimiters are removed)
I expect the output is the same as the ouput of re.search('praha 9, letnany', original_text) meaning getting the substring 'Praha 9, Letnany' and start of the match: 11.
Is there any regular expression to locate extracted text in the original text?
The output of the function can't be changed (up until now)
I have tried to find problems related to ignoring some character while using regex but their problems are different.
This will locate a span in the original text that matches the extracted text ignoring case & inserting delimiters at will (in this case, comma or dash):
import re
pat = ("[,-]*".join(list(extracted_text))).replace(" ","\\s")
mat = re.search( pat, original_text, re.I )
if mat:
print(mat.span())
else:
print("No match")
Same idea as #ScottHunter but process at word level instead of character level:
import re
ori_txt = '19900, Praha 7, Letnany'
extr_txt = 'praha 7 letnany'
delimiters = [',', '\s', '-']
deli = '|'.join([i for i in delimiters])
extr_arr = re.split(deli, extr_txt)
ins_c = ''.join([i for i in delimiters])
ins_c = ''.join(['[', ins_c, ']', '*'])
pat = ins_c.join(extr_arr)
mat = re.search(pat, ori_txt, re.I)
if mat:
print mat.group()
else:
print('not found')
I first want to find a regular expression to directly search for the extracted text in the original text but there seem to be no such an expression. Here is another way to solve my problem. Thank you.

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Extract substrings from logical expressions

Let's say I have a string that looks like this:
myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
What I would like to obtain in the end would be:
myStr_l1 = '(Txt_l1) or (Txt2_l1)'
and
myStr_l2 = '(Txt_l2) or (Txt2_l2)'
Some properties:
all "Txt_"-elements of the string start with an uppercase letter
the string can contain much more elements (so there could also be Txt3, Txt4,...)
the suffixes '_l1' and '_l2' look different in reality; they cannot be used for matching (I chose them for demonstration purposes)
I found a way to get the first part done by using:
myStr_l1 = re.sub('\(\w+\)','',myStr)
which gives me
'(Txt_l1 ) or (Txt2_l1 )'
However, I don't know how to obtain myStr_l2. My idea was to remove everything between two open parentheses. But when I do something like this:
re.sub('\(w+\(', '', myStr)
the entire string is returned.
re.sub('\(.*\(', '', myStr)
removes - of course - far too much and gives me
'Txt2_l2))'
Does anyone have an idea how to get myStr_l2?
When there is an "and" instead of an "or", the strings look slightly different:
myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'
Then I can still use the command from above:
re.sub('\(\w+\)','',myStr2)
which gives:
'(Txt_l1 and Txt2_l1 )'
but I again fail to get myStr2_l2. How would I do this for these kind of strings?
And how would one then do this for mixed expressions with "and" and "or" e.g. like this:
myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))'
re.sub('\(\w+\)','',myStr3)
gives me
'(Txt_l1 and Txt2_l1 ) or (Txt3_l1 and Txt4_l1 )'
but again: How would I obtain myStr3_l2?
Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). You will have to write a parser. Look at https://pyparsing.wikispaces.com/
I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis.
import re
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
if mat:
noParens.append(mat.group(1))
noParens.append(mat.group(2).replace(')',''))
print(noParens)
This takes all the parenthesis away and puts your elements in a list. Here's an alternate way of doing it without using Regular Expressions.
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []
mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()
print(noParens)

Categories