Remove repeating characters from sentence but retain the words meaning - python

I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school
to I'm so happy about offline school. See, haaappppyyyy became happy and offline & school stay the same instead becoming ofline & schol
I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for
Using Regex :
tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)
output :
I'm so haappyy about offline school #it makes 2 chars for every repating chars
using itertools :
tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))
output :
I'm so hapy about ofline schol
How can I fix this? should I make a lists of words I want to exclude?
In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :
wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk
I want to make the second and the third word into the first word, the base form

You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.
See an example Python code:
import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_]+', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"
The code uses the Textblob library, but you may use any you like.
Note that ([^\W\d_])\1{2,} matches any three or more consecutive letters, [^\W\d_]+ matches one or more letters.

This answer was originally written for Regex to reduce repeated chars in a string which was closed as duplicate before I could submit my post. So I "recycled" it here.
Regex is not always the best solution
Regex for validation of formats or input
A regex is often used for low-level pattern recognition and substitution.
It may be useful for validation of formats. You can see it as "dump" automation.
Linguistics (NLP)
When it comes to natural language (NLP), or here spelling (dictionary) the semantics may play a role. Depending on the context "ass" and "as" may both be correctly spelled, although the semantics are very different.
(I apologize for the rude examples, but I am not a native-speaker and those two had the most distinct meaning depending on re-duplication).
For those cases a regex or simple pattern-recognition may be not sufficient. It can cause more effort to apply it correctly than the research for a language-specific library or solution (including a basic application).
Examples for spelling that a regex may struggle with
Like the difference between "haappy" (orthographically invalid, but only the duplicated vowels "aa", not the consonants "pp") and "yeees" (contains no duplicates in correct spelling) or "kiss" (is correctly spelled with duplicate consonants)
Spelling correction requires more
For example a dictionary to lookup if duplicate characters (vowels or consonants) are valid for correct spelling of the word in its form.
Consider a spelling-correction module
You could use textblob module for spelling correction:
To install:
pip install textblob
Example for some test-cases (independent words):
from textblob import TextBlob
incorrect_words = ["cmputr", "yeees", "haappy"] # incorrect spelling
text = ",".join(incorrect_words) # join them as comma separated list
print(f"original words: {text}")
b = TextBlob(text)
# prints the corrected spelling
print(f"corrected words: {b.correct()}")
Prints:
original words: cmputr,yeees,haappy
corrected words: computer,eyes,happy
Surprise: You might have expected "yes" (so did I). But the correction results not in removal of 2 duplicated vowels "ee", but rearrangement to keep almost all letters (5 of 6, only removed one "e").
Example for the given sentence:
from textblob import TextBlob
tweet = "I'm so haaappppyyyy about offline school" # either escape or use different quotes when a single-quote (') is enclosed
print(TextBlob(tweet).correct())
Prints:
I'm so haaappppyyyy about office school
Unfortunately quite worse:
not "happy"
semantically out-of-scope with "office" instead "offline"
Apparently a preceeding cleaning step using regex, like Wiktor suggests, may ameliorate the result.
See also:
Stackabuse: Spelling Correction in Python with TextBlob, tutorial
documentation: TextBlob: Simplified Text Processing

Well, first of all you need a list (or set) of all allowed words, to compare with.
I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list.
The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)

Related

How to replace misspelled words with words from dictionary while ignoring text reference codes?

Topic modelling case here. So I've loaded my first round of preprocessed text data into a document term matrix, however looking at the dtm, I realized that there were words like 'aacc', 'aacct', 'aaccount' and a few different variations like that which basically just means accounts. Is there a way to replace those few words variations that meant account to the word account? I've tried the following code :
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['aacc'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
but it doesn't actually give an output of the word to be 'account'. It's a little tricky as I'm looking to replace actual misspelled words from dictionary and ignore the other words that appears or seems to be misspelled but are just reference codes?
Also trying to remove duplicate characters in a string of sentence e.g 'tthe aabove aand areply tthankss'.
Hope I'm clear enough, thank you in advance.

How to capitalize every beginning of a sentence in a text in python? [duplicate]

This question already has answers here:
How to capitalize the first letter of every sentence?
(15 answers)
Closed 2 years ago.
I want to create a function that takes as an input a string which is a text, and I want to capitalize every letter that lies after a punctuation. The thing is, strings don't work like lists so I don't really know how to do it, I tried to do this, but it doesn't seem to be working :
def capitalize(strin):
listrin=list(strin)
listrin[0]=listrin[0].upper()
ponctuation=['.','!','?']
strout=''
for x in range (len(listrin)):
if listrin[x] in ponctuation:
if x!=len(listrin):
if listrin[x+1]!=" ":
listrin[x+1]=listrin[x+1].upper()
elif listrin[x+2]!=" ":
listrin[x+1]=listrin[x+1].upper()
for y in range(len(listrin)):
strout=strout+listrin[y]
return strout
For now, I am trying to solve it with this string: 'hello! how are you? please remember capitalization. EVERY time.'
I use regexp to do this.
>>> import re
>>> line = 'hi. hello! how are you? fine! me too, haha. haha.'
>>> re.sub(r"(?:^|(?:[.!?]\s+))(.)",lambda m: m.group(0).upper(), line)
'Hi. Hello! How are you? Fine! Me too, haha. Haha.'
The most basic approach is to split the sentences based on the punctuation, then you will have a list. Then loop into all the items of list, strip() them and then capitalize() them. Something like below might solve your problem:
import re
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentence = re.split(pass_your_punctuation_list_here, input_sen)
for i in sentence:
print(i.strip().capitalize(), end='')
However better to use nltk library:
from nltk.tokenize import sent_tokenize
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentences = sent_tokenize(input_sen)
sentences = [sent.capitalize() for sent in sentences]
print(sentences)
It is better to use NLTK library or some other NLP library than manually writing rules and regex because it takes care of many cases which we don't account.
It solves the problem of Sentence boundary disambiguation.
Sentence boundary disambiguation (SBD), also known as sentence
breaking, is the problem in natural language processing of deciding
where sentences begin and end. Often natural language processing tools
require their input to be divided into sentences for a number of
reasons. However sentence boundary identification is challenging
because punctuation marks are often ambiguous. For example, a period
may denote an abbreviation, decimal point, an ellipsis, or an email
address – not the end of a sentence. About 47% of the periods in the
Wall Street Journal corpus denote abbreviations. As well, question
marks and exclamation marks may appear in embedded quotations,
emoticons, computer code, and slang. Languages like Japanese and
Chinese have unambiguous sentence-ending markers.
Hope it helps.

Identify Visually Similar Strings in Python

I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?
What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0
Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.
Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.
Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

Categories