Regex End of Line and Specific Chracters - python

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?

Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)

There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r

Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

Related

Python split by dot and question mark, and keep the character

I have a function:
with open(filename,'r') as text:
data=text.readlines()
split=str(data).split('([.|?])')
for line in split:
print(line)
This prints the sentences that we have after splitting a text by 2 different marks. I also want to show the split symbol in the output, this is why I use () but the split do not work fine.
It returns:
['Chapter 16. My new goal. \n','Chapter 17. My new goal 2. \n']
As you can see the split haven't splitted by all dots.
Try escaping the marks, as both symbols have functional meanings in RegEx. Also I'm quite not sure if the str.split method takes regex. maybe try it with split from Python's "re" module.
[\.|\?]
There are a few distinct problems, here.
1. read vs readlines
data = text.readlines()
This produces a list of str, good.
... str(data) ...
If you print this, you will see it contains
several characters you likely did not want: [, ', ,, ].
You'd be better off with just data = text.read().
2. split on str vs regex
str(data).split('([.|?])')
We are splitting on a string, ok.
Let's consult the fine documents.
Return a list of the words in the string, using sep as the delimiter string.
Notice there's no mention of a regular expression.
That argument does not appear as sequence of seven characters in the source string.
You were looking for a similar function:
https://docs.python.org/3/library/re.html#re.split
3. char class vs alternation
We can certainly use | vertical bar for alternation,
e.g. r"(cat|dog)".
It works for shorter strings, too, such as r"(c|d)".
But for single characters, a character class is
more convenient: r"[cd]".
It is possible to match three characters,
one of them being vertical bar, with r"[c|d]"
or equivalently r"[cd|]".
A character class can even have just a single character,
so r"[c]" is identical to r"c".
4. escaping
Since r".*" matches whole string,
there are certainly cases where escaping dot is important,
e.g. r"(cat|dog|\.)".
We can construct a character class with escaping:
r"[cd\.]".
Within [ ] square brackets that \ backwhack is optional.
Better to simply say r"[cd.]", which means the same thing.
pattern = re.compile(r"[.?]")
5. findall vs split
The two functions are fairly similar.
But findall() is about retrieving matching elements,
which your "preserve the final punctuation"
requirement asks for,
while split() pretty much assumes
that the separator is uninteresting.
So findall() seems a better match for your use case.
pattern = re.compile(r"[^.?]+[.?]")
Note that ^ caret usually means "anchor
to start of string", but within a character class
it is negation.
So e.g. r"[^0-9]" means "non-digit".
data = text.readlines()
split = str(data).split('([.|?])')
Putting it all together, try this:
data = text.read()
pattern = re.compile(r"[^.?]+[.?]")
sentences = pattern.findall(data)
If there's no trailing punctuation in the source string,
the final words won't appear in the result.
Consider tacking on a "." period in that case.

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?
Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.
You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

How to import Regex from external file with its original format and without extra escape characters

Hell Everyone,
I would like to request your support in the following question.
I am recently working in a Python script that is looking matches for about 15 sentences using regular expressions, in thousands of files.
The sentences that we will be looking for could be changing through the days/weeks and the script will be given to users with knowledge in regular expressions, but not programmability skills.
Then, in order to make this script more scalable I was looking to save the regexs in a different file where those users can modify the sentences without the necessity to modify the python script.
Example
This file would be modify continuously to match different sentences.
--- regex.log ---
Th\w*\s+sen\w*
\d{0,3}
--- matches.py ---
import re
with open("regexs.log", "r") as regexs:
regex = regexs.readlines()
text = "This sentence"
for reg in regex:
match = re.search(reg, text)
However, this is not working... when the regexs are exported, python is adding extra escape characters to the sentence. For instance, for the two regexs above these are imported as below:
"Th\\w*\\s+send\\w*"
"\\d{0,3}"
The back slash is duplicated, whereby, the regexs are no longer useful, since they don't longer match the sentences.
Just wondering if there is any way to import those regular expressions in its original state?
Similar operation happens if a store the regular expressions in a list:
>>> reg = ["\w+\n"]
>>> reg
['\\w+\n']
Regards.
regex = regex.readlines()
regex = regex.replace("\\", "\") # <= Add this
What this does is say "everywhere there is a \\ replace it with a \. But, if you are doing some other things with the file before it is finalized, you'll want to move replace to a more appropriate spot.
I tried replace as below:
regex = regex.replace("\\", "\")
but it returns:
SyntaxError: EOL while scanning string literal
It seems python is recognizing the "how to replace" (the second, value in replace function) as double quote symbol due to the escape sequence \" rather than a back slash.

Preserve key:value values in text while regex replacing non-word characters in keys (Notepad++)

Trying without luck in Notepad++ to replace any non-word characters \W with underscore _ from a block of multi-line text, with exception to (and right of) a colon : (which doesn't occur on every line- something of space-delineated hierarchy, terminating in a key-value pair). A python solution could be of use as well, as I'm trying to do other things with it once reformatted. Example:
This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
This_100_is_what_I_d_like: See?
Indentation_isn_t_necessary
_to_maintain_but_would_be_nice: :)<-preserved!
I_m_Mr_Conformist_over_here: |Whereas, I'm like whatever's clever.|
If_you_can_help: Thanks 100.1%!
I admit that I'm answering an off-topic question I just liked the problem. Hold CTRL+H, enable Regular Expressions in N++ then search for:
(:[^\r\n]*|^\s+)|\W(?<![\r\n])
And replace with:
(?1\1:_)
Regex has two main parts. First side of outer alternation which matches leading spaces of a line (indentation) or every thing after first occurrence of a colon, and second side which matches a non-word character except a carriage return \r or newline \n character (in negative lookbehind) to preserve linebreaks. Replacement string is a conditional block which says if first capturing group is matched replace it with itself and if not replace it with a _.
Seeing a better description of what you're trying to do, I don't think you'll be able to do it from inside notepad++ using a single regular expression. However, you could write a python script that scrolls through your document, one line at time, and sanitizes anything to the left of a colon (if one exists)
Here's a quick and dirty example (untested). This assumes doc is an open file pointer to the file you want to sanitize
import re
sanitized_lines = []
for line in doc:
line_match = re.match(r"^(\s*)([^:\n]*)(.*)", line)
indentation = line_match.group(1)
left_of_colon = line_match.group(2)
remainder = line_match.group(3)
left_of_colon = re.sub(r"\W", "_", left_of_colon)
sanitized_lines.append("".join((indentation, left_of_colon, remainder)))
sanitized_doc = "".join(sanitized_lines)
print(sanitized_doc)
You may try this python script,
ss="""This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
If you can help: Thanks 100.1%!"""
import re
splitcapture=re.compile(r'(?m)^([^:\n]+)(:[^\n]*|)$')
subregx=re.compile(r'\W+')
print(splitcapture.sub(lambda m: subregx.sub('_', m.group(1))+m.group(2), ss))
in which first I tried to match each line and capture 2 parts separately(the one part not containing ':'character is capured to group 1, and the other possible part started with ':' and goes on to the end of the line is captured to group 2), and then implemented replacing process only on group 1 captured string and finally joined 2 parts, replaced group 1 + group 2
And output is
This_100_isn_t_what_I_want_
_Yet_it_s_what_I_ve_got_currently: D#rnit :(
If_you_can_help: Thanks 100.1%!

Categories