stopword removal using python

stopword removal using python - python

All,
I have some text that I need to clean up and I have a little algorithm that "mostly" works.
def removeStopwords(self, data):
with open(r'stopwords.txt') as stopwords:
wordList = []
for i in stopwords:
wordList.append(i.strip())
charList = list(data)
cat = ''.join(char for char in charList if not char in wordList).split()
return ' '.join(cat)
Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.
A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.
The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.
A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit
Note the word "selfcontained" was "self-contained".
EDIT: Contents of the stopwords file which is just a bunch of chars.
!
$
%
^
,
&
*
(
)
{
}
[
]
<
,
.
/
|
\
?
~
`
:
;
"
Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.
cat = ''.join(data.translate(None, string.punctuation)).split()
print ' '.join(cat).lower()

version 2.x
line = 'hello!'
line.translate(None, '!$%') #'hello'
answers

Load your stopwords/stopchars in a separate function.
Don't hard-code file names/paths.
Your wordList should be a set, not a list.
However if you are working with chars, not words, investigate str.translate.

One way to go would be to use the replace method and have an exhaustive list of characters you don't want.
for example:
c=['a','h']
a= 'john'
for item in c:
a =a.replace(item,'')
print a
prints the following:
John
Jon

Related

Regular expression to extract all sentences that start and end with the same word

Given a string of sentences, I need to extract a list of all of the sentences which start and end with the same word.
e.g.
# sample text
text = "This is a sample sentence. well, I'll check that things are going well. another sentence starting with another. ..."
# required result
[
"well, I'll check that things are going well",
"another sentence starting with another"
]
How can I make the match using back references and also capture the full sentence?
I have tried the following regex but it's not working.
re.findall("^[a-zA-Z](.*[a-zA-Z])?$", text)

text = "This is a sample sentence. going to checking whether it is well going. another
sentence starting with another."
sentences = re.split('[.!?]+', text)
result = []
for s in sentences:
words = s.split()
if len(words) > 0 and words[0] == words[-1]:
result.append(s.strip())
print(result)

You could try using a backreference to reuse the match...
import re
# sample text
text = "This is a sample sentence. Well, I'll check that things are going well. another sentence starting with another. ..."
print([match[0] for match in re.findall(r"((\b\w+\b)[^.?!]+\2[.?!])", text, re.IGNORECASE)])
This prints...
['Well, I'll check that things are going well.', 'another sentence starting with another.']
Note: I changed the case of the first "well" to "Well" for testing purposes.

How can lines with other characters than letters be removed from output?

I have a code where I extract bigrams from a large corpus, and concatenate/merge them to get unigrams. 'may', 'be' --> maybe. The corpus contains, of course, a lot of punctuations, but I also discovered that it contains other characters such as emojis... My plan was to put punctuations in a list, and if those characters are not in a line, print the line. Maybe I should change my approach and only print the lines ONLY containing letters and no other characters, since I don't know what kinds of characters are in the corpus. How can this be done? I do need to keep these other characters for the first part of the code, so that bigrams that don't actually exist are printed. The last lines of my code are at the moment:
counted = collections.Counter(grams)
for gram, count in sorted(counted.items()):
s = ''
print (s.join(gram))
And the output I get is:
!aku
!bet
!brå
!båda
These lines won't be of any use for me... Would really appreciate some help! :)

If you want to check that each string contains only letters you can probably use the isalpha() method.
>>> '!båda'.isalpha()
False
>>> 'båda'.isalpha()
True
As you can see from the example, this method should recognize any unicode letter, not just ascii.

To filter out strings that contain a non-letter character, the code can check for the existence of non-letter character in each string:
# coding=utf-8
import string
import unicodedata
source_strings = [u'aku', u'bet', u'brå', u'båda', u'!båda']
valid_chars = (set(string.ascii_letters))
valid_strings = [s for s in source_strings if
set(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')) <= valid_chars]
# valid_strings == [u'aku', u'bet', u'brå', u'båda']
# "båda" was not included.

You can use the unicodedata module to classify the characters:
import unicodedata
unigram= ''.join(gram)
if all(unicodedata.category(char)=='Ll' for char in unigram):
print(unigram)

If you want to remove from your lines only some characters, then you can filter with an easy replace your line before edit it:
sourceList = ['!aku', '!bet', '!brå', '!båda']
newList = []
for word in sourceList:
for special in ['!','&','å']:
word = word.replace(special,'')
newList.append(word)
Then you can do what is needed for your bigram exercise. Hope this help.
Second query: in case you have lots of characters then on your string you can use always the isalpha():
sourceList = ['!aku', '!bet', 'nor mal alpha', '!brå', '!båda']
newList = [word for word in sourceList if word.isalpha()]
In this case you will only check for characters. Hope this clarify second query.

Parsing a huge dictionary file with python. Simple task I cant get my head around

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:
<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>
and I want to extract every word of it to a list of words without duplicates. Lets start by my code.
f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
#<ar><k> lines
#<kref> lines
#ending to ";" - lines
for word in listofwordsfromaline:
p.write(word,"\n")
f.close()
p.close()
Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.

For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "
# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]
In the last case you can use the split method:
word_lst = line.split(";")
ans = []
for word in word_list:
start = word.find("-")
ans.append(word[start:])
ans = set(ans)

First find what defines a word for you.
Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters).
https://docs.python.org/2/howto/regex.html
If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.
Match groups in Python

Using sent_tokenize in a specific area of a file in Python using NLTK?

I have a file with thousands of sentences, and I want to find the sentence containing a specific character/word.
Originally, I was tokenizing the entire file (using sent_tokenize) and then iterating through the sentences to find the word. However, this is too slow. Since I can quickly find the indices of the words, can I use this to my advantage? Is there a way to just tokenize an area around a word (i.e. figure out which sentence contains a word)?
Thanks.
Edit: I'm in Python and using the NLTK library.

What platform are you using? On unix/linux/macOS/cygwin, you can do the following:
sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'
Which will display just the lines containing your word (and the sed will get a very rough tokenisation into sentences). If you want a solution in a particular language, you should say what you're using!
EDIT for Python:
The following will work---it only calls the tokenization if there's a regexp match on your word (this is a very fast operation). This will mean you only tokenize lines that contain the word you want:
import re
import os.path
myword = 'using'
fname = os.path.abspath('path/to/my/file')
try:
f = open(fname)
matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l))
for match in matching_lines:
#do something with matching lines
sents = sent_tokenize(match)
except IOError:
print "Can't open file "+fname
finally:
f.close()

Here's an idea that might speed up the search. You create an additional list in which you store the running total of the word counts for each sentence in your big text. Using a generator function that I learned from Alex Martelli, try something like:
def running_sum(a):
tot = 0
for item in a:
tot += item
yield tot
from nltk.tokenize import sent_tokenize
sen_list = sent_tokenize(bigtext)
wc = [len(s.split()) for s in sen_list]
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text)
word_index = #some number that you get from word index
for index,w in enumerate(runningwc):
if w > word_index:
sentnumber = index-1 #found the index of the sentence that contains the word
break
print sen_list[sentnumber]
Hope the idea helps.
UPDATE: If sent_tokenize is what is slow, then you can try avoiding it altogether. Use the known index to find the word in your big text.
Now, move forward and backward, character by character, to detect sentence end and sentence starts. Something like a "[.!?] " (a period, exclamation or a question mark, followed by a space) would signify and sentence start and end. You will only be searching in the vicinity of your target word, so it should be much faster than sent_tokenize.

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
len(re.findall(r"[^?!.][?!.]", paragraph))

Characters are easy to count.
Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.

Not 100% correct but I just gave a try. I have not taken all points by #wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.

The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/

There's already a program to count words and characters-- wc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

stopword removal using python - python

version 2.x line = 'hello!' line.translate(None, '!$%') #'hello' answers

Load your stopwords/stopchars in a separate function. Don't hard-code file names/paths. Your wordList should be a set, not a list. However if you are working with chars, not words, investigate str.translate.

One way to go would be to use the replace method and have an exhaustive list of characters you don't want. for example: c=['a','h'] a= 'john' for item in c: a =a.replace(item,'') print a prints the following: John Jon

Related

Regular expression to extract all sentences that start and end with the same word

How can lines with other characters than letters be removed from output?

Parsing a huge dictionary file with python. Simple task I cant get my head around

Using sent_tokenize in a specific area of a file in Python using NLTK?

How do the count the number of sentences, words and characters in a file?

Categories

Resources