How separate individual sentences using nltk? - python

Programmming Noob, trying to use sent_tokenize to split text into separate sentences. While it appears to be working (in console, making each sentence it's own list item), when I append it to an empty list, I end up with a list (well, a list of list of lists from the syntax) of len 1, that I cannot iterate through. Basically, I want to be able to extract each individual sentence, so that I can compare same with something i.e., i.e. the string "Summer is great." There may be a better way to accomplish this, but please try to give me a simple solution, because Noob. I imagine there is a flag at the end of every sentence I could use to append sentences one at a time, so pointing me to that might be enough.
I've reviewed the documentation and tried adding the following code, but still end up with my listz being of length 1, rather than broken into individual sentences.
import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize
listz = []
s = "Good muffins cost $3.88\nin New York. Please buy me two of
them.\n\nThanks."
listz.append([word_tokenize(t) for t in sent_tokenize(s)])
print(listz)
---
// Expenced output listz = [["Good muffins cost $3.88 in New York."],
["Please buy me two of them."], ["Thanks."]]

You should use extend:
listz.extend([word_tokenize(t) for t in sent_tokenize(s)])
But in this case, simple assignment works:
listz = [word_tokenize(t) for t in sent_tokenize(s)]

Related

is there a way to stop creation of vocabulary in gensim.WikiCorpus when reach 2000000 tokens?

I downloaded the latest wiki dump multi-stream bz2. I call the WikiCorpus class from gensim corpora and after 90000 document the vocabulary reaches the highest value (2000000 tokens).
I got this in terminal:
keeping 2000000 tokens which were in no less than 0 and no more than 580000 (=100.0%) documents
resulting dictionary: Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
adding document #580000 to Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
The WikiCorpus class continues to work until the end of the documents in my bz2.
Is there a way to stop it? or to split the bz2 file in a sample?
thanks for help!
There's no specific parameter to cap the number of tokens. But when you use WikiCorpus.get_texts(), you don't have to read them all: you can stop at any time.
If, as suggested by another question of yours, you plan to use the article texts for Gensim Word2Vec (or a similar model), you don't need the constructor to do its own expensive full-scan vocabulary-discovery. If you supply any dummy object (such as an empty dict) as the optional dictionary parameter, it'll skip this unnecessary step. EG:
wiki_corpus = WikiCorpus(filename, dictionary={})
If you also want to use some truncated version of the full set of articles, I'd suggest manually iterating over just a subset of the articles. For example if the subset will easily fit as a list in RAM, say 50000 articles, that's as simple as:
import itertools
subset_corpus = list(itertools.islice(wiki_corpus, 50000))
If you want to create a subset larger than RAM, iterate over the set number of articles, writing their tokenized texts to a scratch text file, one per line. Then use that file as your later input. (By spending the WikiCorpus extraction/tokenization effort only once, then reusing the file from disk, this can sometimes offer a performance boost even if you don't need to do it.)

Python3: Use Dateutil to make multiple vars from a long string

I have a program where I would like to randomly pull a line from a song, and string them together with other lines from other songs. Looking into I saw that the dateutil library might be able to help me parse multiple variables from a string, but it doesn't do quite what I want.
I have multiple strings like this, only much longer.
"This is the day\nOf the expanding man\nThat shape is my shade\nThere where I used to stand\nIt seems like only yesterday\nI gazed through the glass\n..."
I want to randomly pull one line from this string (To the page break) and save it as a variable but iterate this over multiple strings, any help would be much appreciated.
assuming you want to pull one line at random from the string you can use choice from the random module.
random. choice ( seq ): Return a random element from the non-empty
sequence seq. If seq is empty, raises IndexError.
from random import choice
data = "This is the day\nOf the expanding man\nThat shape is my shade\nThere where I used to stand\nIt seems like only yesterday\nI gazed through the glass\n..."
my_lines = data.splitlines()
my_line = choice(my_lines)
print(my_line)
OUTPUT
That shape is my shade

Problem with Python/NLTK Stop Words and File Write

I am trying to write to file a list of stop words from NLTK.
So, I wrote this script:
import nltk
from nltk.corpus import stopwords
from string import punctuation
file_name = 'OUTPUT.CSV'
file = open(file_name, 'w+')
_stopwords = set(stopwords.words('english')+list(punctuation))
i = 0
file.write(f'\n\nSTOP WORDS:+++\n\n')
for w in _stopwords:
i=i+1
out1 = f'{i:3}. {w}\n'
out2 = f'{w}\n'
out3 = f'{i:3}. {w}'
file.write(out2)
print(out3)
file.close()
The original program used file.write(w), but since I encountered problems, I started trying things.
So, I tried using file.write(out1). That works, but the order of the stop words appear to be random.
What's interesting is that if I use file.write(out2), I only write a random number of stop words that appear to show up in random order, always short of 211. I experience the same problem both in Visual Studio 2017 and Jupyter Notebook.
For example, the last run wrote 175 words ending with:
its
wouldn
shan
Using file.write(out1) I get all 211 words and the column ends like this:
209. more
210. have
211. ,
Has anyone run into a similar problem. Any idea of what may be going on?
I'm new to Python/NLTK so I decided to ask.
The reason you are getting a random order of stop words is due to use of set.
_stopwords = set(stopwords.words('english')+list(punctuation))
A set is an unordered collection with no duplicate elements. Read more here.
Unlike arrays, where the elements are stored as ordered list, the order of elements in a set is undefined (moreover, the set elements are usually not stored in order of appearance in the set; this allows checking if an element belongs to a set faster than just going through all the elements of the set).
You can use this simple example to check this:
test = set('abcd')
for i in test:
print(i)
It outputs different order (e.g. I tried on two different systems, this is what I got):
On Ist system
a
d
b
c
and,
on the second system
d
c
a
b
There are other alternatives for ordered sets. Check here.
Besides, I've checked that all three out1, out2, and out3 gives 211 stop words.

Python: Most efficient method to find the most common string

I want to find the 20 most common names, and their frequency, in a country.
Lets say I have lists of all residents' first name in 100 cities. Each list might contain a lot of names. Lets say we speak about 100 lists, each list with 1000 strings.
What is the most efficient method to get the 20 most common names, and their frequencies, in the entire country?
This is the direction I began with, assuming I got each city in a text file at the same directory:
Use pandas and collection modules for this.
Iterate through each city.txt, making it a string. Then, turn it into a collection using the Counter module, and then to a DataFrame (using to_dict).
Union each DataFrame with the previous one.
Then, group by and count (*) the DataFrame.
But, I'm thinking this method might not work, as the DataFrame can get too big.
Would like to hear any advice on that. Thank you.
Here is a sample code:
import os
from collections import Counter
cities = [i for i in os.listdir(".") if i.endswith(".txt")]
d = Counter()
for file in cities:
with open(file) as f:
# Adjust the code below to put the strings in a list
data = f.read().split(",")
d.update(Counter(data))
out = d.most_common(10)
print(out)
You can also use NLTK library, I was using the code below for similar purpose.
from nltk import FreqDist
fd = FreqDist(text)
top_20 = fd.most_commmon(20) # it's done, you got top 20 tokens :)

string comparison for multiple values python

I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.

Categories