Extract a string and replace it with something between $$ - python

I have a a long string created after parsing a file and any time I encounter $$ 1903810948 $$, I need to strip the numbers from between the $$ and save it separately and remove the $$ from the string. I am trying to use regex but cannot seem to figure out a way do it in python.
Edit: The string is basically parsed from a PDF file and it has special characters to the start and end of which is either a $ or a $$. I need to remove the contents from the file and create a separate file where I will store whatever I removed. That is why I do not think split is the right way to go about it.

you can use split() method which splits a string into a list.
Syntax
string.split(separator, maxsplit)
Parameter Values
separator : Optional. Specifies the separator to use when splitting the string. By default any whitespace is a separator .
maxsplit : Optional. Specifies how many splits to do. Default value is -1, which is "all occurrences".
Here's a solution
text = '$$ 1903810948 $$
print(text.split("$$")[1].split()[0])
output
1903810948
To remove the white space use Split() without parameters
print(text.split("$$")[1].split()[0])

You can just use replace(), no need for regex.
string = '$$ 1903810948 $$'
print(string.replace('$',''))
Also this will give a little whitespace in the start and end of the string. This should fix that.
print(string.replace('$','')[1:-1])
output
1903810948

Regex may not be the best option here as replace would work quite well.
However if you wish to use regex you can use
import re
string = '$$ 1903810948 $$'
pattern = r'\$\$ (\d+) \$\$'
re.findall(pattern, string)
#['1903810948']

Related

force re.search to include # and $

I am trying to get a substring between two markers using re in Python, for example:
import re
test_str = "#$ -N model_simulation 2022"
# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))
# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))
I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.
But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.
Is there a way to force re to include those? Or is there a different way without using re?
Thanks.
You can escape both with \, for example,
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output model_simulation
You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string
# add backslash before # and $
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character
# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.
In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).
Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.
So I would write your code like this:
print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))
In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.
Example:
prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix) + '(.*)' + re.escape(postfix), tst_str).group(1))

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

Replace string between tags if string begins with "1"

I have a huge XML file (about 100MB) and each line contains something along the lines of <tag>10005991</tag>. So for example:
textextextextext<tag>10005991<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>10005993</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I want to replace any string between the tags and that begins with "1" to be replaced with a string of my choice and then write back to the file. I've tried using the line.replace function which works but only if I specify the string.
line=line.replace("<tag>10005991</tag>","<tag>YYYYYY</tag>")
Ideal output:
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I've thought about using an array to pass each string in and then replace but I'm sure there's a much simpler solution.
You can use the re module
>>> text = 'textextextextext<tag>10005991</tag>textextextextext'
>>> re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',text)
'textextextextext<tag>YYYYY</tag>textextextextext'
re.sub will replace the matched text with the second argument.
Quote from the doc
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
Usage may be like:
with open("file") as f:
for i in f:
with open("output") as f2:
f2.write(re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',i))
You can use regex but as you have a multi-line string you need to use re.DOTALL flag , and in your pattern you can use positive look-around for match string between tags:
>>> print re.sub(r'(?<=<tag>)1\d+(?=</?tag>)',r'YYYYYY',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Also as #Bhargav Rao have did in his answer you can use grouping instead look-around :
>>> print re.sub(r'<tag>(1\d+)</?tag>',r'<tag>YYYYYY</?tag>',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I think your best bet is to use ElementTree
The main idea:
1) Parse the file
2) Find the elements value
3) Test your condition
4) Replace value if condition met
Here is a good place to start parsing : How do I parse XML in Python?

Regular Expression Not matching the value

I have a file saving IP addresses to names in format
<<%#$192.168.8.40$#% %##Name_of_person##% >>
I read This file and now want to extract the list using pythons regular expressions
list=re.findall("<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>",ace)
print list
But the list is always an empty list..
can anyone tell me where is the mistake in the regular expression
edit-ace is the variable saving the contents read from the file
$ is a special character in regular expressions, meaning "end of line" (or "end of string", depending on the flavour). Your regex has other characters following the $, and as such only matches strings that have those characters after the end, which is impossible.
You will need to escape the $, like so: \$
I would suggest the following regular expression (formatted as a raw string since you are using Python):
r"<<%#\$([^$]+)\$#%\s%##([^#]+)##%\s>>"
That is, <<%#$, then one or more non-$ characters, $#%, a whitespace character, %##, one or more non-# characters, ##%, whitespace, >>.
Something like:
text = '<<%#$192.168.8.40$#% %##Name_of_person##% >>'
ip, name = [el[1] for el in re.findall(r'%#(.)(.+?)\1#%', text)]
If you can get any with just splitting on '#' and '$' then...
from itertools import itemgetter
ip, name = itemgetter(1, 3)(re.split(r'[#\$]', text))
You could also just use built-in string functions:
tmp = text.split('$')
ip, name = tmp[1], tmp[2].split('#')[1]
u use a invalid regex pattern.
you may use
r"<\%#\$(\S+)\$#\%\s\%##(\w+\s*\w*)##\%\s>>" replace
"<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>" in fandall method
good luck~!

Categories