Scikit CountVectorizer find least common words - python

I need to extract top X LEAST common words with CountVectorizer, however I was not able to find a way to do it.
I'm using multiple CountVectorizers in FeatureUnion.
union = FeatureUnion([('words', CountVectorizer(ngram_range=(1, 3), analyzer='word', max_features=200)),
('chars', CountVectorizer(ngram_range=(1, 4), analyzer='char', max_features=200))])
X_train = union.fit_transform(train_texts)
X_test = union.transform(test_texts)
I would need to reverse the order somehow to make CountVectorizer return least common words. Is there a way to do it? I basically need 200 least common n-gram from both word and char n-grams.

Here's an ipython demonstration of how you can determine the least common occurrences of the specified ngrams. Comments in the code describe the methodology.
$ ipython
Python 3.10.9 (main, Dec 7 2022, 00:00:00) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from faker import Faker
In [2]: faker = Faker()
In [3]: corpus = faker.sentences(10000)
In [4]: corpus[:5] # first 5 sentences of nonsense
Out[4]:
['Drug road condition space dog after key.',
'Piece myself music society.',
'Assume gas evening cut majority own.',
'This both part we.',
'Far life summer those line nature.']
In [5]: from sklearn.pipeline import FeatureUnion
In [6]: from sklearn.feature_extraction.text import CountVectorizer
In [7]: union = FeatureUnion([('words', CountVectorizer(ngram_range=(1, 3), analyzer='word')),
...: ('chars', CountVectorizer(ngram_range=(1, 4), analyzer='char'))])
In [8]: X = union.fit_transform(corpus) # matrix of ngram counts, X.shape == (10000, 91734)
In [9]: ngram_counts = X.sum(axis=0).A1 # vector of counts over all sentences, shape == (91734,)
In [10]: ngram_count_sort_indices = ngram_counts.argsort() # get indices of sort
In [11]: union.get_feature_names_out()[ngram_count_sort_indices[:20]] # show first 20 least common ngrams - change to whatever is needed
And for the faker nonsense sentences, here are the 20 least common ngrams (predictably, they are all single occurrences). The index parameter in line of code above ([:20])can be easily changed to whatever number you want.
Out[11]:
array(['words__office where candidate', 'words__protect doctor',
'words__protect do poor', 'words__protect do',
'words__protect dark according', 'words__protect dark',
'words__protect create someone', 'words__protect create',
'words__protect church', 'words__protect charge surface',
'words__protect charge', 'words__protect chance ever',
'words__protect chance', 'words__protect can air',
'words__protect can', 'words__protect author',
'words__protect doctor long', 'words__protect ago',
'words__protect drop', 'words__protect factor'], dtype=object)
If you want to strip the pipeline label from the ngrams and only get the words/chars, you could:
In [12]: least_common = union.get_feature_names_out()[ngram_count_sort_indices[:20]]
In [13]: [x.split("__")[1] for x in least_common]
Out[13]:
['office where candidate',
'protect doctor',
'protect do poor',
'protect do',
'protect dark according',
'protect dark',
'protect create someone',
'protect create',
'protect church',
'protect charge surface',
'protect charge',
'protect chance ever',
'protect chance',
'protect can air',
'protect can',
'protect author',
'protect doctor long',
'protect ago',
'protect drop',
'protect factor']

Related

vectorized way to add a calculated row to a multindex's subindex

I cannot get my head around how to use groupby to solve the following example:
df = pd.DataFrame(
{
'kitchen': ['galley', 'house', 'restaurant', 'caterer'] * 3,
'products': ['chocolate', 'tart', 'pie', ] * 4,
'menu_a': [pd.np.random.randint(100000, 999999) for _ in range(12)],
'menu_b': [pd.np.random.randint(100000, 999999) for _ in range(12)],
'menu_c': [pd.np.random.randint(100000, 999999) for _ in range(12)]
}
).set_index(['kitchen', 'products']).sort_index()
df
What I want to do is replace the "pie" and "tart" rows of each kitchen with the sum of pie+tart for each kitchen.
So for example, in the galley kitchen, the new row under products would be pastries and the value under menu_a would be 333163+612456 = 945619 for each of the kitchen x product x menus.
I've tried many versions stack() unstack() and groupby() mixed togther but cannot quite get the result. The alternative is to do this iteratively/apply()'d outside, which is gross, and this is a frequent problem I encounter. Would like to know how to do it right.
Select rows by second level, sum and add second level:
df1 = (df.loc[pd.IndexSlice[:, ['pie','tart']], :]
.sum(level=0)
.assign(products='total')
.set_index('products', append=True))
Then concat to original and remove used values by list:
df = pd.concat([df, df1]).drop(['pie','tart'], level=1).sort_index()
print (df)
menu_a menu_b menu_c
kitchen products
caterer chocolate 907615 167480 921843
total 749664 786464 872046
galley chocolate 939850 382545 525525
total 1204359 907760 1267475
house chocolate 701797 106570 572014
total 1215235 1058951 812935
restaurant chocolate 734501 637600 216367
total 1846097 345020 517969
One way using rename:
new_df = df.rename({'pie':'pastries', 'tart':'pastries'}).sum(level=[0, 1])
print(new_df)
Output:
menu_a menu_b menu_c
kitchen products
caterer chocolate 369612 505912 988729
pastries 1647943 1119303 1391204
galley chocolate 128946 196457 669335
pastries 1215293 1573815 1108319
house chocolate 397620 167103 193412
pastries 509144 741824 416904
restaurant chocolate 330240 306817 835125
pastries 584582 1395824 1098987

Mapping list of stings to 0 in Pandas Python

I am trying to study the effects of Alcohol and Drugs in car accidents using an Open BigQuery dataset. I have my dataset ready to go and am just refining it further. I want to categorize the string entries in the pandas columns.
The data frame is over 11,000 entries and there are about 44 unique values in each column. However, I just want to categorize only the entries which say 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively. I want to map any other entry to 0.
I have created a list of all the entries which I don't care about and want to get rid of and they are in a list as follows:
list_ign = ['Backing Unsafely',
'Turning Improperly', 'Other Vehicular',
'Driver Inattention/Distraction', 'Following Too Closely',
'Oversized Vehicle', 'Driver Inexperience', 'Brakes Defective',
'View Obstructed/Limited', 'Passing or Lane Usage Improper',
'Unsafe Lane Changing', 'Failure to Yield Right-of-Way',
'Fatigued/Drowsy', 'Prescription Medication',
'Failure to Keep Right', 'Pavement Slippery', 'Lost Consciousness',
'Cell Phone (hands-free)', 'Outside Car Distraction',
'Traffic Control Disregarded', 'Fell Asleep',
'Passenger Distraction', 'Physical Disability', 'Illness', 'Glare',
'Other Electronic Device', 'Obstruction/Debris', 'Unsafe Speed',
'Aggressive Driving/Road Rage',
'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
'Reaction to Other Uninvolved Vehicle', 'Steering Failure',
'Traffic Control Device Improper/Non-Working',
'Tire Failure/Inadequate', 'Animals Action',
'Driverless/Runaway Vehicle']
What could I do to just map 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively and set everything in the list shown to 0
Say your source column is named Crime:
import numpy as np
df['Illegal'] = np.where(df['Crime'].isin(['Alcohol Involvement', 'Drugs']), 1, 0)
Or,
df['Crime'] = df['Crime'].isin(['Alcohol Involvement', 'Drugs']).astype(int)
So, while the above-mentioned methods work fine. However, they were not tagging all the categories I wanted to remove later on. So, I used this method,
for word in list_ign:
df = df.replace(str(word), 'Replace')

Filling missing value with word from list on condition

i'm trying to preprocess data especially dealing with missing values.
I have a list of words and two columns having text data. If word from list is in at least one of two text-columns, i fill missing with the word
import pandas as pd
a=['coffee', 'milk', 'sugar']
test=pd.DataFrame({'col':['missing', 'missing', 'missing'],
'text1': ['i drink tea', 'i drink coffee', 'i drink whiskey'],
'text2': ['i drink juice', 'i drink nothing', 'i drink milk']
})
So the dataframe looks like and a column "col" has "missing" as a result of applying fillna("missing")
Out[19]:
col text1 text2
0 missing i drink tea i drink juice
1 missing i drink coffee i drink nothing
2 missing i drink whiskey i drink milk
I came up with such code applying loop
for word in a:
test.loc[(test["col"]=='missing') & ((test["text1"].str.count(word)>0)
| (test['text2'].str.count(word)>0)), "col"]=word
With 100 000 rows and 2000 element in the list "a" it takes around 870 seconds to finish the job.
Is there any solution to make it faster for a huge dataframe?
Thanks in advance
Some suggestions:
Why use .str.count instead of .str.contains?
Why do the fillna('missing')? pd.isnull(test["col"]) will work faster tan test["col"]=='missing'
You could also use a test to see whether all the missing fields are filled.
So this can boil down to something like this:
def fill_missing(original_df, column_name, replacements, inplace=True):
df = original_df if inplace else original_df.copy()
for word in replacements:
empty = pd.isnull(df[column_name])
if not empty.any():
return df
contained = (df.loc[empty, "text1"].str.contains(word)) | (df.loc[empty, 'text2'].str.contains(word))
df.loc[contained[contained].index, column_name] = word
return df

Calculate TF-IDF using sklearn for n-grams in python

I have a vocabulary list that include n-grams as follows.
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
I want to use these words to calculate TF-IDF values.
I also have a dictionary of corpus as follows (key = recipe number, value = recipe).
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently using the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.
feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
print(w, s)
The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.
I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.
Try increasing the ngram_range in TfidfVectorizer:
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
Edit: The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
which should yield
('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944
If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)
This results in
1 2 3
tim tam 0.000000 0.861037 0.000000
jam 0.000000 0.000000 0.000000
fresh milk 0.000000 0.000000 0.861037
chocolates 0.763228 0.508542 0.508542
biscuit pudding 0.646129 0.000000 0.000000
#user8566323 try using
df = pd.DataFrame(tfs.todense(), index=feature_names, columns=corpus_index)
instead of
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
i.e. without making a transpose (T) of matrix

How to import csv file as a training with label and testing with target data for classifier in scikit-learn?

I have two csv files for training and testing data. Both of them look like this (I only show one of them, but both of them are the same form of data and same attributes name) :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
1.538,0.769,0.769,0.769,1.538,0.769,0.769,2.586
2.188,1.875,1.875,1.875,1.875,2.188,1.875,1.667
3.333,1.333,1.333,1.333,3.333,2.0,1.333,2.8
2.5,1.667,1.667,1.667,2.222,1.944,1.667,2.481
I'm a newbie in scikit-learn. I learn the example of training+label and testing+target data input are like this :
X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"the big apple is great",
"new york is also called the big apple",
"nyc is nice",
"people abbreviate new york city as nyc",
"the capital of great britain is london",
"london is in the uk",
"london is in england",
"london is in great britain",
"it rains a lot in london",
"london hosts the british museum",
"new york is great and so is london",
"i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
["new york"],["london"],["london"],["london"],["london"],
["london"],["london"],["new york","london"],["new york","london"]]
X_test = np.array(['nice day in nyc',
'welcome to london',
'london is rainy',
'it is raining in britian',
'it is raining in britian and the big apple',
'it is raining in britian and nyc',
'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']
is it possible to import my csv files that contain of float numbers as a training with label and testing with target as data input? Also, I want to make Gold Standard attribute as my label for training data and target for testing data. If it's possible, how to make that input? Thanks
As suggested in #Vivek Kumar's comment, you could get the job done by using pandas' csv_read and iloc like this:
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: df = pd.read_csv('train.txt')
In [15]: X_train = np.asarray(df.iloc[:, :-1])
In [16]: y_train = np.asarray(df.iloc[:, -1])
In [17]: X_train
Out[17]:
array([[ 1.667, 0.476, 0.952, ..., 1.429, 0.952, 0.476],
[ 3.056, 1.111, 1.667, ..., 3.056, 1.389, 1.111],
[ 1.765, 1.176, 1.176, ..., 1.765, 1.176, 1.176],
...,
[ 2.188, 1.875, 1.875, ..., 1.875, 2.188, 1.875],
[ 3.333, 1.333, 1.333, ..., 3.333, 2. , 1.333],
[ 2.5 , 1.667, 1.667, ..., 2.222, 1.944, 1.667]])
In [18]: y_train
Out[18]: array([ 2.345, 1.9 , 2.2 , 0. , 2.586, 1.667, 2.8 , 2.481])
Please notice I have previously saved the data you provided to file train.txt.

Categories