Python clean text - remove unknown characters and special characters - python

I would like to remove unknown words and characters from the sentence. The text is the output of the transformers model program. So, Sometimes it produces unknown repeated words. I have to remove those words in order to make the sentence readable.
Input
text = "This is an example sentence 098-1832-1133 and this is another sentence.WAA-FAHHaAA. This is the third sentence WA WA WA aZZ aAD"
Expected Output
text = "This is an example sentence and this is another sentence. This is the third sentence"

Related

How to split a paragraph into sentences when they contain words such as "U.S." and "Inc."

I'm writing a celebrity trivia quiz in python that takes clues from Wikipedia.
I'm using the following code to split the paragraphs into sentences:
sentences = line.split(". ")
It works for everything except when there's a word that ends in a period in the sentence. For example, "XXX is a U.S. senator." gets incorrectly split into "XXX is a U.S."
I've created a list of exceptions where I remove the period from such words:
line = line.replace("Dr. ", "Dr ").replace("Mr. ", "Mr ").replace("Gen. ", "Gen ").replace("No. ", "No ").replace("U.S. ", "US ")
But for anything not in the list (e.g. "U.K." or "Inc."), the sentence gets stopped at the word ending in a period.
I'm not sure how else I can approach this. How can I preserve these words while still splitting into sentences?
This might work:
paragraphs = full_content.split("\n\n")
Where full_content is the data you want to split into paragraphs.
You can use list of abbreviations with dot.
You should add this list into a file and check that if . belong a word at the file, skip it

How to generate alignments for word-based translation models if number of words are different in both sentences

I am working on implementing IBM Model 1. I have a parallel corpus of some 2,000,000 sentences (English to Dutch). Also, the sentences of the two docs are already aligned. The aim is to translate a Dutch sentence into English and vice-versa.
The code I am using for generating the alignments is:
A = pair_sent[0].split() # To split English sentence
B = pair_sent[1].split() # To split Dutch sentence
trips.append([zip(A, p) for p in product(B, repeat=len(A))])
Now, there are pair sentences with an unequal number of words (like 10 in English and 14 in its Dutch Translation). Our professor told us that we should use NULLs or drop a word. But I don't understand how to do that? Where to insert NULL and how to choose which word to drop.
In the end, I require the pair of sentences to have the equal number of words.
The problem is not that the sentences have a different number of words. After all, the IBM model computes for each word in a source sentence a probability distribution over all words in the target sentence and does not care how many words the target sentence has. The problem is that there might words that do not have counter-part in the target sentence.
If you append a NULL word into the target sentence (no matter where because IBM Model 1 does not consider reordering), you can also model the probability that a word does not have a counter-part in the target sentence.
The actual bilingual alignment is then done using a symmetrization heuristic from a pair of IBM models on both sides.

Python regex - How to look for an arbitrary number of sentences after a digit?

Let's say we had the following string: "1. Sentence 1. Sentence 2? Sentence 3!".
How would I go about looking for ( and returning as a string) a pattern that matches all of the following cases:
"1. Sentence 1."
"1. Sentence 1. Sentence 2?"
"1. Sentence 1. Sentence 2? Sentence 3!"
There is always a number in front of the pattern,
but there could be any number of sentences after it.
What I've tried thus far is
pattern = re.compile("\d.(\s[A-Ö][^.!?]+[.!?])+?")
and
assignmentText = "".join(pattern.findall(assignment))
where the join-method is an ugly hack used to extract the string from the list returned by findall, since list[0] doesn't seem to work ( I know there will only be a single str in the list).
However, I only ever receive the first sentence, without the digit in front.
How could this be fixed?
You can use (?:(?:\d+\.\s+)?[A-Z].*?[.!?]\s*)+.
import re
print(re.findall(r'(?:(?:\d+\.\s+)?[A-Z].*?[.!?]\s*)+', '1. Sentence 1. Sentence 2? Sentence 3!'))
This outputs:
['1. Sentence 1. Sentence 2? Sentence 3!']
Or, if you prefer separating them as 3 different items in a list:
import re
print(re.findall(r'(?:(?:\d+\.\s+)?[A-Z].*?[.!?])', '1. Sentence 1. Sentence 2? Sentence 3!'))
This outputs:
['1. Sentence 1.', 'Sentence 2?', 'Sentence 3!']

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?
It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Categories