How to split string in numpy.ndarray? - python

I have a lot of text in a numpy.ndarray that looks like this:
['This is example sentence 1.|This is example sentence 2.'
'This is example sentence 3.'
'This is example sentence 4.'
'This is example sentence 5.'
'This is example sentence 6.|This is example sentence 7.|This is example sentence 8.|This is example sentence 9.|This is example sentence 10.']
The array can have a large and varying number of elements and individual elements can have many sentences separated with "|".
How do I convert the example above into this:
['This is example sentence 1.'
'This is example sentence 2.'
'This is example sentence 3.'
'This is example sentence 4.'
'This is example sentence 5.'
'This is example sentence 6.'
'This is example sentence 7.'
'This is example sentence 8.'
'This is example sentence 9.'
'This is example sentence 10.']
Basically, I'm trying to create a 1-dimensional array that will split elements with "|" into their own separate elements. I've tried many versions of split and can't get them to work for one reason or another.
Thanks!

You can try np.char.split:
# np.concatenate or np.hstack
>>> np.concatenate(np.char.split(arr.astype(str), sep='|'))
array(['This is example sentence 1.', 'This is example sentence 2.',
'This is example sentence 3.', 'This is example sentence 4.',
'This is example sentence 5.', 'This is example sentence 6.',
'This is example sentence 7.', 'This is example sentence 8.',
'This is example sentence 9.', 'This is example sentence 10.'],
dtype='<U28')

Related

Python clean text - remove unknown characters and special characters

I would like to remove unknown words and characters from the sentence. The text is the output of the transformers model program. So, Sometimes it produces unknown repeated words. I have to remove those words in order to make the sentence readable.
Input
text = "This is an example sentence 098-1832-1133 and this is another sentence.WAA-FAHHaAA. This is the third sentence WA WA WA aZZ aAD"
Expected Output
text = "This is an example sentence and this is another sentence. This is the third sentence"

Split a string into its sentences using python

I have this following string:
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
Now, I want to split it into two sentence.
However, when I do:
string.split('.')
I get:
['This is one sentence ${w_{1},',
'',
',w_{i}}$',
' This is another sentence',
' ']
Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?
Also, how would you go about this:
string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
EDIT 1:
The desired outputs would be:
For string 1:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence']
For string 2:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! ']
For the more general case, you could use re.split like so:
import re
mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']
str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
re.split("[.!?]\s{1,}", str2)
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']
Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case
Here's a (somewhat hacky) way to get the punctuation back
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '! ']
sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! ']
You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)
This returns, for the first string:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence']
and for the second string:
['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']
Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
string.split('. ')
this returns:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

How do I nextline after a dot in a set of next lined words in .txt file

I have a problem with my code. I have a text file and inside of this text file is a thousand of tabbed/next lined words which came from a sentence. My problem is I want to revert the words inside of this text file and make it a sentence again.
I have thought of a way which is making a for loop statement and if it hits the dot . then it will store the sentence inside the list.
with('test','r') as f:
text = f.open()
sentence = []
sentences = []
for words in text:
if words != "."
sentence.append(words)
elif words == "."
sentence.append(words)
sentences.append(sentence)
sentence = []
#Sample output
#[['This', 'is', 'a', 'sentence', '.'], ['This', 'is', 'the', 'second', 'sentence', '.'],
#['This', 'is', 'the', 'third', 'sentence', '.']],
#This is the text file
This
is
a
sentence
.
This
is
the
second
sentence
.
This
is
thr
third
sentence
.
The code kinda works but it's a little bit complicated. I'm finding a much shorter and not so complicated idea. Thank you in advance.
This is pretty straightforward. Read from file, split into lines by period, split each line by any whitespace, rejoin line with single spaces, throw the period back at the end of the sentence.
sentences = [' '.join(x.split()) + '.' for x in open('test','r').read().split('.')]
You could use str.split().
For example:
text = 'First sentence. Second sentence. This is the third sentence. '
text.split('. ')[:-1]
>>> ['First sentence', 'Second sentence', 'This is the third sentence']
If you want to include the . you have to do it like this:
text = 'First sentence. Second sentence. This is the third sentence. '
split_text = [e+'.' for e in text.split('. ')][:-1]
split_text
>>> ['First sentence.', 'Second sentence.', 'This is the third sentence.']
Below is one liner for the same, Let me know if need more help:
sentences = open('test','r').read().split('\.')

Python regex - How to look for an arbitrary number of sentences after a digit?

Let's say we had the following string: "1. Sentence 1. Sentence 2? Sentence 3!".
How would I go about looking for ( and returning as a string) a pattern that matches all of the following cases:
"1. Sentence 1."
"1. Sentence 1. Sentence 2?"
"1. Sentence 1. Sentence 2? Sentence 3!"
There is always a number in front of the pattern,
but there could be any number of sentences after it.
What I've tried thus far is
pattern = re.compile("\d.(\s[A-Ö][^.!?]+[.!?])+?")
and
assignmentText = "".join(pattern.findall(assignment))
where the join-method is an ugly hack used to extract the string from the list returned by findall, since list[0] doesn't seem to work ( I know there will only be a single str in the list).
However, I only ever receive the first sentence, without the digit in front.
How could this be fixed?
You can use (?:(?:\d+\.\s+)?[A-Z].*?[.!?]\s*)+.
import re
print(re.findall(r'(?:(?:\d+\.\s+)?[A-Z].*?[.!?]\s*)+', '1. Sentence 1. Sentence 2? Sentence 3!'))
This outputs:
['1. Sentence 1. Sentence 2? Sentence 3!']
Or, if you prefer separating them as 3 different items in a list:
import re
print(re.findall(r'(?:(?:\d+\.\s+)?[A-Z].*?[.!?])', '1. Sentence 1. Sentence 2? Sentence 3!'))
This outputs:
['1. Sentence 1.', 'Sentence 2?', 'Sentence 3!']

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?
It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Categories