Regex split a string and strip recurring character - python

Using python I'm parsing several strings. Sometimes the string has appended several semicolons to it.
Example strings:
s1="1;Some text"
s2="2;Some more text;;;;"
The number of appending semicolons varies, but if it's there it's never less than two.
The following pattern matches s1, with s2 it includes the appended semicolons.
How do I redo it to remove those?
pat=re.compile('(?m)^(\d+);(.*)')

You can use the str.rstrip([chars])
This method returns a copy of the string in which all chars have been stripped from the end of the string (default whitespace characters).
e.g. you can do:
s2 = s2.rstrip(";")
You can find more information here.

pat = re.compile(r'\d+;[^;]*')

Related

Python split by dot and question mark, and keep the character

I have a function:
with open(filename,'r') as text:
data=text.readlines()
split=str(data).split('([.|?])')
for line in split:
print(line)
This prints the sentences that we have after splitting a text by 2 different marks. I also want to show the split symbol in the output, this is why I use () but the split do not work fine.
It returns:
['Chapter 16. My new goal. \n','Chapter 17. My new goal 2. \n']
As you can see the split haven't splitted by all dots.
Try escaping the marks, as both symbols have functional meanings in RegEx. Also I'm quite not sure if the str.split method takes regex. maybe try it with split from Python's "re" module.
[\.|\?]
There are a few distinct problems, here.
1. read vs readlines
data = text.readlines()
This produces a list of str, good.
... str(data) ...
If you print this, you will see it contains
several characters you likely did not want: [, ', ,, ].
You'd be better off with just data = text.read().
2. split on str vs regex
str(data).split('([.|?])')
We are splitting on a string, ok.
Let's consult the fine documents.
Return a list of the words in the string, using sep as the delimiter string.
Notice there's no mention of a regular expression.
That argument does not appear as sequence of seven characters in the source string.
You were looking for a similar function:
https://docs.python.org/3/library/re.html#re.split
3. char class vs alternation
We can certainly use | vertical bar for alternation,
e.g. r"(cat|dog)".
It works for shorter strings, too, such as r"(c|d)".
But for single characters, a character class is
more convenient: r"[cd]".
It is possible to match three characters,
one of them being vertical bar, with r"[c|d]"
or equivalently r"[cd|]".
A character class can even have just a single character,
so r"[c]" is identical to r"c".
4. escaping
Since r".*" matches whole string,
there are certainly cases where escaping dot is important,
e.g. r"(cat|dog|\.)".
We can construct a character class with escaping:
r"[cd\.]".
Within [ ] square brackets that \ backwhack is optional.
Better to simply say r"[cd.]", which means the same thing.
pattern = re.compile(r"[.?]")
5. findall vs split
The two functions are fairly similar.
But findall() is about retrieving matching elements,
which your "preserve the final punctuation"
requirement asks for,
while split() pretty much assumes
that the separator is uninteresting.
So findall() seems a better match for your use case.
pattern = re.compile(r"[^.?]+[.?]")
Note that ^ caret usually means "anchor
to start of string", but within a character class
it is negation.
So e.g. r"[^0-9]" means "non-digit".
data = text.readlines()
split = str(data).split('([.|?])')
Putting it all together, try this:
data = text.read()
pattern = re.compile(r"[^.?]+[.?]")
sentences = pattern.findall(data)
If there's no trailing punctuation in the source string,
the final words won't appear in the result.
Consider tacking on a "." period in that case.

Splitting on regex and keep delimiters with following regex match with Python [duplicate]

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.
If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

how do i read a string from a file up to a set character (for example reading the words "hello world ¬ this is a string" up to the¬) in python 3

sorry if my English is off, i am making something in python and need help with fixing a problem i have encountered. the problem im having is i need to be able to take information in a txt file up to a point signalled by a key character such as ¬, and then need to be able to take the next part of the string after the 1st ¬ to the next ¬ and so on. the reason for this is because all the strings will be of various lengths that can and will change, so if i have the string
'znPbB t7<)!\oWk_feGTIT:7{.¬ZO9?S9$v9vpd}Z#EMKC¬'
in a note pad file i need it to come out as
'znPbB t7<)!\oWk_feGTIT:7{.'
and when i want the 2nd one, it should come out as
'ZO9?S9$v9vpd}Z#EMKC'
I would use split:
s = 'znPbB t7<)!\oWk_feGTIT:7{.¬ZO9?S9$v9vpd}Z#EMKC¬'
s.split('¬')
# returns
['znPbB t7<)!\\oWk_feGTIT:7{.', 'ZO9?S9$v9vpd}Z#EMKC', '']
One solution is to split the string. Another is to use the re module. I changed the ¬ character to an ö when testing:
import re
text = 'znPbB t7<)!\oWk_feGTIT:7{.öZO9?S9$v9vpd}Z#EMKCöjndIJ%349HBhslö'
regex_result = re.findall(r"(?!ö).*?(?=ö)", text)
split_result = text.split("ö")
The difference between the results of the two is that str.split includes an empty string if the character (in my example "ö") is last.
split_result[:-1] == regex_result # <--- This is True
The regular expression can be divided into three parts. (?!ö) is a negative lookahead which excludes "ö" from the results. .*? matches anything. (?=ö) is a lookahead which tells us that "ö" is needed but not to be included in the match.

Repeatedly splitting strings in python

I am looking to make a function to break a string into a list of str by breaking it at various punctuation points (e.g. , ! ?) that I specify. I know I should used the .split() function with the specific punctuation, however I can't figure out how to get iterate running the split with each punctuation character specified to produce a single list of str with made up from the original str split at every punctuation character.
To split with multiple delimiters, you should use re.split():
import re
pattern = r"[.,!?]" # etc.
new = re.split(pattern, your_current_string)
Putting that in function form should be simple enough.

Categories