How to get bag of words from textual data? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.
What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

Using the collections.Counter class
>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
'John also likes to watch football games.']
>>> bagsofwords = [collections.Counter(re.findall(r'\w+', txt))
for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps
tokenizing
counting
normalizing
Limitations to keep in mind:
1. Cannot capture phrases or multi-word expressions
2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,
e.g.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.",
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus)
print(X.toarray())
print(vectorizer.get_feature_names())

Bag-of-words model is a nice method for text representation to be applied in different machine learning tasks. But in the first step you need to clean up data from unnecessary data for example punctuation, html tags, stop-words,... For these tasks you may can easily exploit libraries like Beautiful Soup (to remove HTML Markups) or NLTK (to remove stop-words) in Python.
After cleaning your data you need to create a vector features (numerical representation of data for machine learning) this is where Bag-Of-Words plays the role. scikit-learn has a module (feature_extraction module) which can help you create the bag-of-words features.
You may find all you need in detail in this tutorial also this one can be very helpful. I found both of them very useful.

As others already mentioned, using nltk would be your best option if you want something stable, and scalable. It's highly configurable.
However, it has the downside of having a quite steep learning curve, if you want to tweak the defaults.
I once encountered a situation where I wanted to have a bag of words. Problem was, it concerned articles about technologies with exotic names full of -, _, etc. Such as vue-router or _.js etc.
The default configuration of nltk's word_tokenize is to split vue-router into two separate vue and router words, for instance. I'm not even talking about _.js.
So for what it's worth, I ended up writing this little routine to get all the words tokenized into a list, based on my own punctuation criteria.
import re
punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]
print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']
This routine can be easily combined with Patty3118 answer about collections.Counter, which could lead you to know which number of times _.js was mentioned in the article, for instance.

From a book "Machine learning python":
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['blablablatext'])
bag = count.fit_transform(docs)

Related

Python - counting frequency of a non-ngram sequence in a list of strings efficiently

As I've stated in the title, I'm trying to calculate the phrase frequency of a given list of sequences that appear in in a list of strings. The problem is that the words in phrases do not have to appear next to the other ones, there may be one or more words in between.
Example:
Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234"
I remove the stopwords (NLTK stopwords), remove punctuation, lowercase all letters and tokenize the sentence, so the processed sequence looks like ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']. I have about 30,000 sequences varying in length from 1 (single words) to 3, and I'm searching in almost 6,000 short sentences. My current approach is presented below:
from collections import Counter
from tqdm import tqdm
import nltk
# Get term sequency per sentence
def get_bow(sen, vocab):
vector = [0] * len(vocab)
tokenized_sentence = nltk.word_tokenize(sen)
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
for el in combined_sentence:
if el in vocab:
cnt = combined_sentence.count(el)
idx = vocab.index(el)
vector[idx] = cnt
return vector
sentence_vectors = []
for sentence in tqdm(text_list):
sent_vec = get_bow
sentence_vectors.append(get_bow(sentence, phrase_list))
phrase_list is a list of tuples with the sequences, text_list is a list of strings. Currently, the frequency takes over 1 hour to calculate and I'm trying to find more efficient way to get the list of frequencies associated with the given terms. I've also tried using sklearn's CountVectorizer, but there's a problem with processing sequences with gaps and they're not calculated at all.
I'd be grateful if anyone would try to give me some insight about how to make my script more efficient. Thanks in advance!
EDIT:
Example of phrase_list: [('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]
Example of text_list: ['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']
Expected output: [2, 0, 0, 1, 0] - a vector with occurrence count of each phrase, the order of values should be the same as in phrase_list. My code returns the vector of a phrase occurence per sentence, because I was trying to implement something like a bag-of-words.
There are many aspects that could be made faster, but here is the main problem:
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
You generate all potential combinations of 1,2 or 3 words of the sentence. This is always bad, no matter what you want to do.
Sentence: "Master Yoda about sentence structure care does not."
You really want to treat this sentence as if it contained "Yoda does not", then you should still not generate all combinations. There are much faster ways, but I will only spend time on this, if that indeed is your goal.
If you would want to treat this sentence as a sentence that does NOT contain "Yoda does not", then I think you can figure out yourself how to speed up your code. Maybe look here.
I hope this helped. Let me know in case you need option 1.

Calculate tf-idf in Gensim for my vocabulary

I have a set of words (n-grams) where I need to calculate tf-idf values. These words are;
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
My corpus looks as follows.
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently getting tf-idf values for my n-grams in myvocabulary using sklearn as follows.
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
However, I am interested in doing the same in Gensim. Forall the examples I came across in Gensim;
uses only unigrams ( iwant it for bigrams and trigrams as well)
calculated for all the words (I only want to calculate for the words in myvocabulary)
Hence, please help me to find out how to do the above two things in Gensim.
In gensim, for a dictionary, you should use gensim.corpora.Dictionary class, look at examples
Unfortunately, we have no support ngrams in general, only bigrams for words with Phrases class

String to phrase replacement python

I have a text string and I want to replace two words with a single word. E.g. if the word is artificial intelligence, I want to replace it with artificial_intelligence. This needs to be done for a list of 200 words and on a text file of size 5 mb.
I tried string.replace but it can work only for one element, not for the list.
Example
Text='Artificial intelligence is useful for us in every situation of deep learning.'
List a : list b
Artificial intelligence: artificial_intelligence
Deep learning: deep_ learning
...
Text.replace('Artificial intelligence','Artificial_intelligence') is working.
But
For I in range(len(Lista)):
Text=Text.replace(Lista[I],List b[I])
doesn't work.
I would suggest using a dict for your replacements:
text = "Artificial intelligence is useful for us in every situation of deep learning."
replacements = {"Artificial intelligence" : "Artificial_intelligence",
"deep learning" : "deep_learning"}
Then your approach works (although it is case-sensitive):
>>> for rep in replacements:
text = text.replace(rep, replacements[rep])
>>> print(text)
Artificial_intelligence is useful for us in every situation of deep_learning.
For other approaches (like the suggested regex-approach), have a look at SO: Python replace multiple strings.
Since you have a case problem between your list entries and your string, you could use the re.sub() function with IGNORECASE flag to obtain what you want:
import re
list_a = ['Artificial intelligence', 'Deep learning']
list_b = ['artificial_intelligence', 'deep_learning']
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in zip(list_a, list_b):
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)
# artificial_intelligence is useful for us in every situation of deep_learning.
Note the use of the zip() function wich allows to iterate over the two lists in the same time.
Also note that Christian is right, a dict would be more suitable for your substitution data. The previous code would then be the following for the exact same result:
import re
subs = {'Artificial intelligence': 'artificial_intelligence',
'Deep learning': 'deep_learning'}
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in subs.items():
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)

How to un-stem a word in Python?

I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.

Is there any classifiier which works at both word and sentence level?

In scikit learn or nltk classifier generally consider term frequency or TF-IDF.
I want to consider term frequency as well, sentence structure for classification. I have 15 categories of question. Each with text file containing sentence with new lines.
Category city contains this sentence:
In which city Obama was born?
If I consider on term frequency, then following might not be considered. because obama or city in dataset do not match with query sentence
1. In which place Hally was born 2. In which city Hally was born?
Is there any classifier which consider both term frequency as well sentence structure. So when trained, it classify input query with similar sentence structure too
You could train the tf-idf on ngrams as well, in addition to the unigrams.
In Scikit Learn you can specify the ngram_range that will be taken into account: if you set it to train on up to 3-grams, you would end up storing the frequency for combinations of words such as "In which place", which is pretty indicative about the type of question that is asked.
As drekyn said you can use the Scikit learn for features extraction here are some examples:
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
... token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
... ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
Source

Categories