Python str.maketrans Remove Punctuation with Empty Space - python

I am using maketrans from string module in Python 3 to do simple text preprocessing like lowering, removing digits and punctuations. The problem is that during the punctuation removal all words are attached together with no empty space! For example, let's say I have the following text:
text='[{"Hello":"List:","Test"321:[{"Hello":"Airplane Towel for Kitchen"},{"Hello":2 " Repair massive utilities "2},{"Hello":"Some 3 appliance for our kitchen"2}'
text=text.lower()
text=text.translate(str.maketrans(' ',' ',string.digits))
Works just fine, it gives:
'[{"hello":"list:","test":[{"hello":"airplane towel for kitchen"},{"hello": " repair massives utilities "},{"hello":"some appliance for our kitchen"}'
But once I want to remove the punctuations:
text=text.translate(str.maketrans(' ',' ',string.punctuation))
It gives me this:
'hellolisttesthelloairplane towel for kitchenhello nbsprepair massives utilitiesnbsphellosome appliance for our kitchen'
Ideally it should yield:
'hello list test hello airplane towel for kitchen hello nbsp repair massives utilities nbsp hello some appliance for our kitchen'
There is not specific reason I am doing it with maketrans, but I like as it is fast and easy and kind of stuck solving it. Thanks!
Disclaimer: I already know how to do it with re like the following:
import re
s = "string.]With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

well... this works
txt = text.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation))).replace(' '*4, ' ').replace(' '*3, ' ').replace(' '*2, ' ').strip()

Related

How can I split a string into a list by sentences, but keep the \n?

I want to split text into sentences but keep the \n such as:
Civility vicinity graceful is it at. Improve up at to on mention
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty
though elinor direct.
into sentences like:
['Civility vicinity graceful is it at.', 'Improve up at to on mention
perhaps raising.', 'Way building not get formerly her peculiar.', '\n
Arrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']
Right now I'm using this code with re to split the sentences:
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"
def remove_urls(text):
text = re.sub(r'http\S+', '', text)
return text
def split_into_sentences(text):
print("in")
print(text)
text = " " + text + " "
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
if "..." in text: text = text.replace("...",".<prd>")
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
print(sentences)
return sentences
However the code gets rid of the \n, which I need. I need the \n because I'm using text in moviepy, and moviepy has no built in functions to space out text with \n, so I must create my own. The only way I can do that is through having \n as a signifier in the text, but when I split my sentences it also gets rid of the \n. What should I do?
You can use (?<=...) to retain separator followed by what you want to remove by the split:
import re
s='Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly
her peculiar.\n\nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
re.split(r'(?<=\.)[ \n]', s)
output:
['Civility vicinity graceful is it at.',
'Improve up at to on mention perhaps raising.',
'Way building not get formerly her peculiar.',
'\nArrived totally in as between private.',
'Favour of so as on pretty though elinor direct.']
Use could use split by .
text = '''Civility vicinity graceful is it at. Improve up at to on mention
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty though elinor
direct.'''
text.split('.')
>>> ['Civility vicinity graceful is it at', ' Improve up at to on mention
perhaps raising', ' Way building not get formerly her peculiar', '\nArrived
totally in as between private', ' Favour of so as on pretty though elinor
direct', '']
check this Split by comma and strip whitespace in Python
I have been able to reproduce your output using this:
txt = 'Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar. \nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
Code:
updated_text = [a if a.endswith('.') else a+'.' for a in txt.split('. ')]
Output:
['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\nArrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

I'm not terribly familiar with Python regex, or regex in general, but I'm hoping to demystify it all a bit more with time.
My problem is this: given a string like ' Apple Banana Cucumber Alphabetical Fruit Whoops', I'm trying to use python's 're.findall' module to result in a list that looks like this: my_list = [' Apple', ' Banana', ' Cucumber', ' Alphabetical Fruit', ' Whoops']. In other words, I'm trying to find a regex expression that can [look for a bunch of whitespace followed by some non-whitespace], and then check if there is a single space with some more non-whitespace characters after that.
This is the function I've written that gets me cloooose but not quite:
re.findall("\s+\S+\s{1}\S*", my_list)
Which results in:
[' Apple ', ' Banana ', ' Cucumber ', ' Alphabetical Fruit']
I think this result makes sense. It first finds the whitespace, then some non-whitespace, but then it looks for at least one whitespace (which leaves out 'Whoops'), and then looks for any number of other non-whitespace characters (which is why there's no space after 'Alphabetical Fruit'). I just don't know what character combination would give me the intended result.
Any help would be hugely appreciated!
-WW
You can do:
\s+\w+(?:\s\w+)?
\s+\w+ macthes one or more whitespaces, followed by one or more of [A-Za-z0-9_]
(?:\s\w+)? is a conditional (?, zero or one) non-captured group ((?:)) that matches a whitespace (\s) followed by one or more of [A-Za-z0-9_] (\w+). Essentially this is to match Fruit in Alphabetical Fruit.
Example:
In [701]: text = ' Apple Banana Cucumber Alphabetical Fruit Whoops'
In [702]: re.findall(r'\s+\w+(?:\s\w+)?', text)
Out[702]:
[' Apple',
' Banana',
' Cucumber',
' Alphabetical Fruit',
' Whoops']
Your pattern works already, just make the second part (the 'compound word' part) optional:
\s+\S+(\s\S+)?
https://regex101.com/r/Ua8353/3/
(fixed \s{1} per #heemayl)

How to split a string on commas or periods in nltk

I want to separate a string on commas and/or periods in nltk. I've tried with sent_tokenize() but it separates only on periods.
I've also tried this code
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
class CommaPoint(PunktLanguageVars):
sent_end_chars = ('.','?','!',',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
n_w=tokenizer.tokenize(ex_sent)
print n_w
The output for the code above is
['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.\n']
When I try to give '.' without any space it is taking it as a word
I want the output as
['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.']
How about something simpler with re:
>>> import re
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
>>> re.split(r'[.,]', sent)
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', '']
To keep the delimiter, you can use group:
>>> re.split(r'([.,])', sent)
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', '']
in this case you maybe can replace all commas with dots in the string and then tokenize it:
from nltk.tokenize import sent_tokenize
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
ex_sent = ex_sent.replace(",", ".")
n_w = sent_tokenize(texto2, 'english')
print(n_w)

Python: How can I use a regex to split sentences to new lines, and then separate punctuation from words using whitespace?

I have the following input:
input = "I love programming with Python-3.3! Do you? It's great... I give it a 10/10. It's free-to-use, no $$$ involved!"
First, every sentence should be moved to a new line. Then, all of the punctuation should be separated from the words EXCEPT for "/", " ' ", "-", "+" and "$".
So the output should be:
"I love programming with Python-3 . 3 !
Do you ?
It's great . . .
I give it a 10/10 .
It's free-to-use , no $$$ involved !"
I used the following code:
>>> import re
>>> re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", input)
"I love programming with Python-3 . 3 ! Do you ? It's great ... I give it a 10/10 . It's free- to-use , no $$$ involved ! "
But the problem is that it does not separate sentences into new lines. How can I use a regex to do that before I create whitespace between punctuation and characters?
([!?.])(?=\s*[A-Z])\s*
You can use this regex to create sentences before your regex.See demo.Replace by \1\n.
https://regex101.com/r/sH8aR8/5
x="I love programming with Python-3.3! Do you? It's great... I give it a 10/10. It's free-to-use, no $$$ involved!"
print re.sub(r"([!?.])(?=\s*[A-Z])",r"\1\n",x)
EDIT:
(?<![A-Z][a-z])([!?.])(?=\s*[A-Z])\s*
Try this.See demo for your different set of data.
https://regex101.com/r/sH8aR8/9
Something like
>>> import re
>>> from string import punctuation
>>> print re.sub(r'(?<=['+punctuation+'])\s+(?=[A-Z])', '\n', input)
I love programming with Python-3.3!
Do you?
It's great...
I give it a 10/10.
It's free-to-use, no $$$ involved!

How do I replace a word in a string in python?

Let us say I have a string
c = "a string is like this and roberta a a thanks"
I want the output to be as
' string is like this and roberta thanks"
This is what I am trying
c.replace('a', ' ')
' string is like this nd robert thnks'
But this replaces each 'a' in the string
So I tried this
c.replace(' a ', ' ')
'a string is like this and roberta thanks'
But this leaves out 'a' in the starting of the string.
How do i do this?
this looks like a job for re :
import re
while re.subn('(\s+a\s+|^a\s+)',' ',txt)[1]!=0:
txt=re.subn('(\s+a\s+|^a\s+)',' ',txt)[0]
I myself figured it out.
c = "a string is like this and roberta a a thanks"
import re
re.sub('\\ba\\b', ' ', c)
' string is like this and roberta thanks'
Here you go myself! Enjoy!

Categories