I am looking to make a function to break a string into a list of str by breaking it at various punctuation points (e.g. , ! ?) that I specify. I know I should used the .split() function with the specific punctuation, however I can't figure out how to get iterate running the split with each punctuation character specified to produce a single list of str with made up from the original str split at every punctuation character.
To split with multiple delimiters, you should use re.split():
import re
pattern = r"[.,!?]" # etc.
new = re.split(pattern, your_current_string)
Putting that in function form should be simple enough.
Related
I have a function:
with open(filename,'r') as text:
data=text.readlines()
split=str(data).split('([.|?])')
for line in split:
print(line)
This prints the sentences that we have after splitting a text by 2 different marks. I also want to show the split symbol in the output, this is why I use () but the split do not work fine.
It returns:
['Chapter 16. My new goal. \n','Chapter 17. My new goal 2. \n']
As you can see the split haven't splitted by all dots.
Try escaping the marks, as both symbols have functional meanings in RegEx. Also I'm quite not sure if the str.split method takes regex. maybe try it with split from Python's "re" module.
[\.|\?]
There are a few distinct problems, here.
1. read vs readlines
data = text.readlines()
This produces a list of str, good.
... str(data) ...
If you print this, you will see it contains
several characters you likely did not want: [, ', ,, ].
You'd be better off with just data = text.read().
2. split on str vs regex
str(data).split('([.|?])')
We are splitting on a string, ok.
Let's consult the fine documents.
Return a list of the words in the string, using sep as the delimiter string.
Notice there's no mention of a regular expression.
That argument does not appear as sequence of seven characters in the source string.
You were looking for a similar function:
https://docs.python.org/3/library/re.html#re.split
3. char class vs alternation
We can certainly use | vertical bar for alternation,
e.g. r"(cat|dog)".
It works for shorter strings, too, such as r"(c|d)".
But for single characters, a character class is
more convenient: r"[cd]".
It is possible to match three characters,
one of them being vertical bar, with r"[c|d]"
or equivalently r"[cd|]".
A character class can even have just a single character,
so r"[c]" is identical to r"c".
4. escaping
Since r".*" matches whole string,
there are certainly cases where escaping dot is important,
e.g. r"(cat|dog|\.)".
We can construct a character class with escaping:
r"[cd\.]".
Within [ ] square brackets that \ backwhack is optional.
Better to simply say r"[cd.]", which means the same thing.
pattern = re.compile(r"[.?]")
5. findall vs split
The two functions are fairly similar.
But findall() is about retrieving matching elements,
which your "preserve the final punctuation"
requirement asks for,
while split() pretty much assumes
that the separator is uninteresting.
So findall() seems a better match for your use case.
pattern = re.compile(r"[^.?]+[.?]")
Note that ^ caret usually means "anchor
to start of string", but within a character class
it is negation.
So e.g. r"[^0-9]" means "non-digit".
data = text.readlines()
split = str(data).split('([.|?])')
Putting it all together, try this:
data = text.read()
pattern = re.compile(r"[^.?]+[.?]")
sentences = pattern.findall(data)
If there's no trailing punctuation in the source string,
the final words won't appear in the result.
Consider tacking on a "." period in that case.
So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.
If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.
I have a a long string created after parsing a file and any time I encounter $$ 1903810948 $$, I need to strip the numbers from between the $$ and save it separately and remove the $$ from the string. I am trying to use regex but cannot seem to figure out a way do it in python.
Edit: The string is basically parsed from a PDF file and it has special characters to the start and end of which is either a $ or a $$. I need to remove the contents from the file and create a separate file where I will store whatever I removed. That is why I do not think split is the right way to go about it.
you can use split() method which splits a string into a list.
Syntax
string.split(separator, maxsplit)
Parameter Values
separator : Optional. Specifies the separator to use when splitting the string. By default any whitespace is a separator .
maxsplit : Optional. Specifies how many splits to do. Default value is -1, which is "all occurrences".
Here's a solution
text = '$$ 1903810948 $$
print(text.split("$$")[1].split()[0])
output
1903810948
To remove the white space use Split() without parameters
print(text.split("$$")[1].split()[0])
You can just use replace(), no need for regex.
string = '$$ 1903810948 $$'
print(string.replace('$',''))
Also this will give a little whitespace in the start and end of the string. This should fix that.
print(string.replace('$','')[1:-1])
output
1903810948
Regex may not be the best option here as replace would work quite well.
However if you wish to use regex you can use
import re
string = '$$ 1903810948 $$'
pattern = r'\$\$ (\d+) \$\$'
re.findall(pattern, string)
#['1903810948']
In a program that I am writing some symbols have to be replaced by another throughout the entire program. I've tried doing it this way, but it didn't work.
for letter in word:
letter = letter.replace("a","b").replace("c","d").replace("e","f")
Since I'm a beginner, I am asking for a comprehensive solution.
Thank you!
You should apply this chain of replacements to the whole string, not individual characters:
word.replace("Ä","AE").replace("Ü","UE").replace("Ö","OE").replace("ß","SS")
You don't need to split it into words for this, either.
There is another string method that you could consider when making many replacements at once.
str.replace is better when making one substitution various times in a string.
str.translate uses a mapping of such changes to make them all in one substitution.
teststring = "BAßÜKÖNÄ" # a made-up word
mapping = str.maketrans({"Ä":"AE", "Ü": "UE", "Ö":"OE", "ß":"SS"})
print(teststring.translate(mapping)) # BASSUEKOENAE
I think that using translate it is easier to check, test and maintain the changes, than using multiple replace. str.maketrans also allows to use of two strings of equal size that correspond character by character, And even a third argument that maps to the characters you wish to eliminate from the string.
Using python I'm parsing several strings. Sometimes the string has appended several semicolons to it.
Example strings:
s1="1;Some text"
s2="2;Some more text;;;;"
The number of appending semicolons varies, but if it's there it's never less than two.
The following pattern matches s1, with s2 it includes the appended semicolons.
How do I redo it to remove those?
pat=re.compile('(?m)^(\d+);(.*)')
You can use the str.rstrip([chars])
This method returns a copy of the string in which all chars have been stripped from the end of the string (default whitespace characters).
e.g. you can do:
s2 = s2.rstrip(";")
You can find more information here.
pat = re.compile(r'\d+;[^;]*')