I want to derive bigrams and used the following code to do so:
from sklearn.feature_extraction.text import CountVectorizer
def create_vectorizer():
return CountVectorizer(lowercase=False, stop_words=['a', 'an','the','The'], ngram_range=(1, 3))
reviews_english["Review Gast"] = reviews_english["Review Gast"].astype(str).str.lower()
res = [(x, i.split()[j + 1]) for i in reviews_english["Review Gast"]
for j, x in enumerate(i.split()) if j < len(i.split()) - 1]
res
I got the following results:
However, I would like to get the bigrams per row rather than for the whole list.
How can I do this?
Thanks
You can use CountVectorizer to fit_transform per row. However since it requires a corpus/list of text you will have to convert your string in the row to a list of single string.
Sample
df = pd.DataFrame({
'text': ["a cat on the table",
"a dog under the table",
"an apple over the tree"]
})
cv = CountVectorizer(analyzer='word', ngram_range=(2, 2))
bigrams = []
for txt in df["text"].astype(str).str.lower():
cv.fit_transform([txt])
bigrams.append(cv.get_feature_names())
df['bigrams'] = bigrams
print (df)
output:
text bigrams
0 a cat on the table [cat on, on the, the table]
1 a dog under the table [dog under, the table, under the]
2 an apple over the tree [an apple, apple over, over the, the tree]
Related
I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column...
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it's getting stuck in endless processing.
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"\b({'|'.join(re.escape(k) for k in keywords)})\b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN
I have a csv file that contains list of sentences in rows, i wanted to find out if there are any stopwords in each rows, return 1 if exist else return 0. And if return 1, i want to count the stopwords. Below are my codes so far, i was only able to get all of the stopwords that exist in the csv, but not for each rows.
import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize,word_tokenize
from collections import Counter
from nltk.corpus import stopwords
nltk.download('stopwords')
top_N = 10
news=pd.read_csv("split.csv",usecols=['STORY'])
newss = news.STORY.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(newss)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
This is the truncated csv file
id STORY
0 In the bag
1 What is your name
2 chips, bag
I would like to save the output to a new csv file, the expected output should look like this
id STORY exist How many
0 In the bag 1 2
1 What is your name 1 4
2 chips bag 0 0
df = pd.DataFrame({"story":['In the bag', 'what is your name', 'chips, bag']})
stopwords = nltk.corpus.stopwords.words('english')
df['clean'] = df['story'].apply(lambda x : nltk.tokenize.word_tokenize(x.lower().replace(r',', ' ')))
df
story clean
0 In the bag [in, the, bag]
1 what is your name [what, is, your, name]
2 chips, bag [chips, bag]
df['clean'] = df.clean.apply(lambda x : [y for y in x if y in stopwords])
df['exist'] = df.clean.apply(lambda x : 1 if len(x) > 0 else 0)
df['how many'] = df.clean.apply(lambda x : len(x))
df
story clean exist how many
0 In the bag [in, the] 1 2
1 what is your name [what, is, your] 1 3
2 chips, bag [] 0 0
Note: You can change regex as per your requirements. you can drop clean column or keep it if you need it later.
I have a vocabulary list that include n-grams as follows.
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
I want to use these words to calculate TF-IDF values.
I also have a dictionary of corpus as follows (key = recipe number, value = recipe).
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently using the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.
feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
print(w, s)
The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.
I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.
Try increasing the ngram_range in TfidfVectorizer:
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
Edit: The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
which should yield
('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944
If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)
This results in
1 2 3
tim tam 0.000000 0.861037 0.000000
jam 0.000000 0.000000 0.000000
fresh milk 0.000000 0.000000 0.861037
chocolates 0.763228 0.508542 0.508542
biscuit pudding 0.646129 0.000000 0.000000
#user8566323 try using
df = pd.DataFrame(tfs.todense(), index=feature_names, columns=corpus_index)
instead of
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
i.e. without making a transpose (T) of matrix
I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:
ID TEXT
265 The farmer plants grain. The fisher catches tuna.
456 The sky is blue.
434 The sun is bright.
921 I own a phone. I own a book.
I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?
When I try:
df.TEXT.apply(nltk.sent_tokenize)
The output is unchanged from the original dataframe. My desired output is:
TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.
In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):
ID TEXT
265 'farmer', 'plants', 'grain'
265 'fisher', 'catches', 'tuna'
456 'sky', 'blue'
434 'sun', 'bright'
921 'I', 'own', 'phone'
921 'I', 'own', 'book'
This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!
edit: as a result of warranted prodding by #alexis here is a better response
Sentence Tokenization
This should get you a DataFrame with one row for each ID & sentence:
sentences = []
for row in df.itertuples():
for sentence in row[2].split('.'):
if sentence != '':
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
Whose output looks like this:
split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:
sentences = []
for row in df.itertuples():
for sentence in sent_tokenize(row[2]):
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
This produces the following output:
If you want to quickly remove periods from these lines you could do something like:
new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))
Which would yield:
You can also take the apply -> map approach (df is your original table):
df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))
Yielding:
Continuing:
sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]
This yields:
As our indices have not changed, we can join this back into our original table:
df = df.join(sentences)
Word Tokenization
Continuing with df from above, we can extract the tokens in a given sentence as follows:
df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)
This is a little complicated. I apply sentence tokenization first then go through each sentences and remove words from remove_words list and remove punctuation for each word inside.
import pandas as pd
from nltk import sent_tokenize
from string import punctuation
remove_words = ['the', 'an', 'a']
def remove_punctuation(chars):
return ''.join([c for c in chars if c not in punctuation])
# example dataframe
df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."],
[456, "The sky is blue."],
[434, "The sun is bright."],
[921, "I own a phone. I own a book."]], columns=['sent_id', 'text'])
df.loc[:, 'text_split'] = df.text.map(sent_tokenize)
sentences = []
for _, r in df.iterrows():
for s in r.text_split:
filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words]
# or using nltk.word_tokenize
# filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation]
sentences.append({'sent_id': r.sent_id,
'text': s.strip('.'),
'words': filtered_words})
df_words = pd.DataFrame(sentences)
Output
+-------+--------------------+--------------------+
|sent_id| text| words|
+-------+--------------------+--------------------+
| 265|The farmer plants...|[farmer, plants, ...|
| 265|The fisher catche...|[fisher, catches,...|
| 456| The sky is blue| [sky, is, blue]|
| 434| The sun is bright| [sun, is, bright]|
| 921| I own a phone| [I, own, phone]|
| 921| I own a book| [I, own, book]|
+-------+--------------------+--------------------+
I have a large pandas dataframe with a lot of documents:
id text
1 doc2 Google i...
2 doc3 Amazon...
3 doc4 This was...
...
n docN nice camara...
How can I stack all the documents into sentences carrying out their respective id?:
id text
1 doc1 Google is a great company.
2 doc1 It is in silicon valley.
3 doc1 Their search engine is the best
4 doc2 Amazon is a great store.
5 doc2 it is located in Seattle.
6 doc2 its new product is alexa.
5 doc2 its expensive.
5 doc3 This was a great product.
...
n docN nice camara I really liked it.
I tried to:
import nltk
def sentence(document):
sentences = nltk.sent_tokenize(document.strip(' '))
return sentences
df['sentece'] = df['text'].apply(sentence)
df.stack(level=0)
However, it did not worked. Any idea of how to stack the sentences carrying out their id of provenance?.
There is a solution to the problem that is similar to yours here: pandas: When cell contents are lists, create a row for each element in the list. Here's my interpretation of it with respect to your particular task:
df['sents'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))
s = df.apply(lambda x: pd.Series(x['sents']), axis=1).stack().\
reset_index(level=1, drop=True)
s.name = 'sents'
df = df.drop(['sents','text'], axis=1).join(s)
This iterates over each sentences with apply so that it can use nltk.sent_tokenize. Then it converts all the sentences into their own columns using the Series constructor.
df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack()
Example with fake data
df=pd.DataFrame({'id':['doc1', 'doc2'], 'text' :['This is a sentence. And another. And one more. cheers',
'here are more sentences. yipee. woop.']})
df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack().reset_index().drop('level_1', axis=1)
id 0
0 doc1 This is a sentence.
1 doc1 And another.
2 doc1 And one more.
3 doc1 cheers
4 doc2 here are more sentences.
5 doc2 yipee.
6 doc2 woop.
I think you would find this a lot easier if you kept your corps not in pandas. Here is my solution. I fit it back into a pandas data frame in the end. I think this is probably the most scalable solution.
def stack(one, two):
sp = two.split(".")
return [(one, a.strip()) for a in sp if len(a.strip()) > 0]
st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])
df2 = pd.DataFrame(st)
df2.columns = ['id','text']
If you want to add a sentence Id column you can make a small tweak.
def stack(one, two):
sp = two.split(".")
return [(one, b, a.strip()) for a,b in zip(sp,xrange(1,len(sp)+1)) if len(a.strip()) > 0]
st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])
df2 = pd.DataFrame(gen)
df2.columns = ['id','sentence_id','text']