Python split by dot and question mark, and keep the character - python

I have a function:
with open(filename,'r') as text:
data=text.readlines()
split=str(data).split('([.|?])')
for line in split:
print(line)
This prints the sentences that we have after splitting a text by 2 different marks. I also want to show the split symbol in the output, this is why I use () but the split do not work fine.
It returns:
['Chapter 16. My new goal. \n','Chapter 17. My new goal 2. \n']
As you can see the split haven't splitted by all dots.

Try escaping the marks, as both symbols have functional meanings in RegEx. Also I'm quite not sure if the str.split method takes regex. maybe try it with split from Python's "re" module.
[\.|\?]

There are a few distinct problems, here.
1. read vs readlines
data = text.readlines()
This produces a list of str, good.
... str(data) ...
If you print this, you will see it contains
several characters you likely did not want: [, ', ,, ].
You'd be better off with just data = text.read().
2. split on str vs regex
str(data).split('([.|?])')
We are splitting on a string, ok.
Let's consult the fine documents.
Return a list of the words in the string, using sep as the delimiter string.
Notice there's no mention of a regular expression.
That argument does not appear as sequence of seven characters in the source string.
You were looking for a similar function:
https://docs.python.org/3/library/re.html#re.split
3. char class vs alternation
We can certainly use | vertical bar for alternation,
e.g. r"(cat|dog)".
It works for shorter strings, too, such as r"(c|d)".
But for single characters, a character class is
more convenient: r"[cd]".
It is possible to match three characters,
one of them being vertical bar, with r"[c|d]"
or equivalently r"[cd|]".
A character class can even have just a single character,
so r"[c]" is identical to r"c".
4. escaping
Since r".*" matches whole string,
there are certainly cases where escaping dot is important,
e.g. r"(cat|dog|\.)".
We can construct a character class with escaping:
r"[cd\.]".
Within [ ] square brackets that \ backwhack is optional.
Better to simply say r"[cd.]", which means the same thing.
pattern = re.compile(r"[.?]")
5. findall vs split
The two functions are fairly similar.
But findall() is about retrieving matching elements,
which your "preserve the final punctuation"
requirement asks for,
while split() pretty much assumes
that the separator is uninteresting.
So findall() seems a better match for your use case.
pattern = re.compile(r"[^.?]+[.?]")
Note that ^ caret usually means "anchor
to start of string", but within a character class
it is negation.
So e.g. r"[^0-9]" means "non-digit".
data = text.readlines()
split = str(data).split('([.|?])')
Putting it all together, try this:
data = text.read()
pattern = re.compile(r"[^.?]+[.?]")
sentences = pattern.findall(data)
If there's no trailing punctuation in the source string,
the final words won't appear in the result.
Consider tacking on a "." period in that case.

Related

Splitting on regex and keep delimiters with following regex match with Python [duplicate]

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.
If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

Remove continuous occurrence of vowels together in a string using Python

I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

Regex split a string and strip recurring character

Using python I'm parsing several strings. Sometimes the string has appended several semicolons to it.
Example strings:
s1="1;Some text"
s2="2;Some more text;;;;"
The number of appending semicolons varies, but if it's there it's never less than two.
The following pattern matches s1, with s2 it includes the appended semicolons.
How do I redo it to remove those?
pat=re.compile('(?m)^(\d+);(.*)')
You can use the str.rstrip([chars])
This method returns a copy of the string in which all chars have been stripped from the end of the string (default whitespace characters).
e.g. you can do:
s2 = s2.rstrip(";")
You can find more information here.
pat = re.compile(r'\d+;[^;]*')

Categories