How to avoid Gensim Simple Preprocess to remove digits? - python

I am having some problems in preprocessing some data with gensim.utils.simple_preprocess.
In a few words, I noticed that the simple_preprocess function removes the digits from my text, but I don't want to!
For instance, I have this code:
import gensim
from gensim.utils import simple_preprocess
my_text = ["I am doing activity number 1", "Instead, I am doing the number 2"]
def gen_words(texts):
final = []
for text in texts:
new = gensim.utils.simple_preprocess(text, deacc=True, min_len=1)
final.append(new)
return (final)
solution = gen_words(my_text)
print (solution)
The output is the following:
[['i', 'am', 'doing', 'activity', 'number'], ['instead', 'i', 'am', 'doing', 'the', 'number']]
I would like instead to have this as a solution:
[['i', 'am', 'doing', 'activity', 'number', '1'], ['instead', 'i', 'am', 'doing', 'the', 'number', '2']]
How to avoid seeing the digits erased from my code? I have also tried setting the min_len=0 but still is not working.

The simple_preprocess() function is just one rather simple convenience option for tokenizing text from a string, into a list-of-tokens.
It's not especially well-tuned for any particular need – and it has no configurable option to retain tokens that don't match its particular hardcoded pattern (PAT_ALPHABETIC) which rules-out tokens with leading digits.
Many projects will want to apply their own tokenization/preprocessing instead, better suited to their data & problem domain. If you need ideas for how to start, youc can consult the actual source code for simple_preprocess() (and other functions it relies upon like tokenize() & simple_tokenize()) that Gensim uses:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

Related

Tokenizing words by preserving certain words with arithmetic and logical operators in python 3?

While tokenizing multiple sentences from a large corpus, I need to preserve certain words as in its original form like .Net, C#, C++. I also want to remove the punctuation marks (.,!_-()=*&^%$#~ etc.) but need to preserve the words like .net, .htaccess, .htpassword, c++ etc.
I have tried both nltk.word_tokenize and nltk.regexp_tokenize, but I am not getting the expected output.
Please help me in fixing the aforementioned issue.
The code:
import nltk
from nltk import regexp_tokenize
from nltk.corpus import stopwords
def pre_data():
tokenized_sentences = nltk.sent_tokenize(tokenized_raw_data)
sw0 = (stopwords.words('english'))
sw1 = ["i.e", "dxint", "hrangle", "idoteq", "devs", "zero"]
sw = sw0 + sw1
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|\d|[^.+#\w a-z]", gaps=True)] for word in tokenized_sentences]
print(tokens)
pre_data()
The tokenized_raw_data is a normal text file. It contains multiple sentences with white spaces in between and consisting of words like .blog, .net, c++, c#, asp.net, .htaccess etc.
Example
['.blog is a generic top-level domain intended for use by blogs'.,
'C# is a general-purpose, multi-paradigm programming language'.,
'C++ is object-oriented programming language'.]
This solution covers the given examples and preserves words like C++, C# asp.net and so on while removing normal punctuation.
import nltk
paragraph = (
'.blog is a generic top-level domain intended for use by blogs. '
'C# is a general-purpose, multi-paradigm programming language. '
'C++ is object-oriented programming language. '
'asp.net is something very strange. '
'The most fascinating language is c#. '
'.htaccess makes my day!'
)
def pre_data(raw_data):
tokenized_sentences = nltk.sent_tokenize(raw_data)
tokens = [nltk.regexp_tokenize(sentence, pattern='\w*\.?\w+[#+]*') for sentence in tokenized_sentences]
return tokens
tokenized_data = pre_data(paragraph)
print(tokenized_data)
Out
[['.blog', 'is', 'a', 'generic', 'top', 'level', 'domain', 'intended', 'for', 'use', 'by', 'blogs'],
['C#', 'is', 'a', 'general', 'purpose', 'multi', 'paradigm', 'programming', 'language'],
['C++', 'is', 'object', 'oriented', 'programming', 'language'],
['asp.net', 'is', 'something', 'very', 'strange'],
['The', 'most', 'fascinating', 'language', 'is', 'c#'],
['.htaccess', 'makes', 'my', 'day']]
However, this simple regular expression will probably not work for all technical terms in your texts. Provide full examples for a more general solution.

Regular expression matches a few symbols but not includes some

There is paragraph, and I want to use regular expression to extract all the words inside.
a bdag agasg it's the cookies for dogs',don't you think so? the word 'wow' in english means.you hey b 097 dag final
I have tried several regexes with re.findall(regX,str), and found one that can match most words.
regX = "[ ,\.\?]?([a-z]+'?[a-z]?)[ ,\.\?]?"
['a', 'bdag', 'agasg', "it's", 'the', 'cookies', 'for', "dogs'", "don't", 'you', 'think', 'so', 'the', 'word', "wow'", 'in', 'english', 'means', 'you', 'hey', 'b', 'dag', 'final']
All are good except **wow'**.
I wonder if regular expression could explain the logic "it can be a comma/space/period/etc but can't be an apostrophe".
Can someone advise?
Try:
[ ,\.\?']?([a-z]*('\w)?)[\' ,\.\?]?
Added another group so you'll have to select only group 1.
I didn't fully understand what you wanted the output to be but,
try this:
[ ,\.\?]?(["-']?+[a-z]+["-']?[a-z]?)[ ,\.\?]?
using this regex lets you get the ' and " within the text.
if this still was not what you wanted please let me know so I can update my answer.

Convert list of string representations of sentences into vocabulary set

I have a list of string representations of sentences that looks something like this:
original_format = ["This is a question", "This is another question", "And one more too"]
I want to convert this list into a set of unique words in my corpus. Given the above list, the output would look something like this:
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
I've figured out a way to do this, but it takes a very long time to run. I am interested in a more efficient way of converting from one format to another (especially since my actual dataset contains >200k sentences).
FYI, what I'm doing right now is creating an empty set for the vocab and then looping through each sentence (split by spaces) and unioning with the vocab set. Using the original_format variable as defined above, it looks like this:
vocab = set()
for q in original_format:
vocab = vocab.union(set(q.split(' ')))
Can you help me run this conversion more efficiently?
You can use itertools.chain with set. This avoids nested for loops and list construction.
from itertools import chain
original_format = ["This is a question", "This is another question", "And one more too"]
res = set(chain.from_iterable(i.split() for i in original_format))
print(res)
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
Or for a truly functional approach:
from itertools import chain
from operator import methodcaller
res = set(chain.from_iterable(map(methodcaller('split'), original_format)))
Using a simple set comprehension:
{j for i in original_format for j in i.split()}
Output:
{'too', 'is', 'This', 'And', 'question', 'another', 'more', 'one', 'a'}

how to write a Python program that reads from a text file, and builds a dictionary which maps each word

I am having difficulties with writing a Python program that reads from a text file, and builds a dictionary which maps each word that appears in the file to a list of all the words that immediately follow that word in the file. The list of words can be in any order and should include duplicates.
For example,the key "and" might have the list ["then", "best", "after", ...] listing all the words which came after "and" in the text.
Any idea would be great help.
A couple of ideas:
Set up a collections.defaultdict for your output. This is a dictionary with a default value for keys that don't yet exist (in this case, as aelfric5578 suggests, an empty list);
Build a list of all the words in your file, in order; and
You can use zip(lst, lst[1:]) to create pairs of consecutive list elements.
Welcome on stackoverflow.com
Are you sure you need a dictionary ?
It will takes a lot of memory if the text is long, just to repeat several times the same data for several entries.
While if you use a function, it will give you the desired list(s) at will.
For example:
s = """In Newtonian physics, free fall is any motion
of a body where its weight is the only force acting
upon it. In the context of general relativity where
gravitation is reduced to a space-time curvature,
a body in free fall has no force acting on it and
it moves along a geodesic. The present article
concerns itself with free fall in the Newtonian domain."""
import re
def say_me(word,li=re.split('\s+',s)):
for i,w in enumerate(li):
if w==word:
print '\n%s at index %d followed by\n%s' % (w,i,li[i+1:])
say_me('free')
result
free at index 3 followed by
['fall', 'is', 'any', 'motion', 'of', 'a', 'body', 'where', 'its', 'weight', 'is', 'the', 'only', 'force', 'acting', 'upon', 'it.', 'In', 'the', 'context', 'of', 'general', 'relativity', 'where', 'gravitation', 'is', 'reduced', 'to', 'a', 'space-time', 'curvature,', 'a', 'body', 'in', 'free', 'fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 38 followed by
['fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 58 followed by
['fall', 'in', 'the', 'Newtonian', 'domain.']
The assignement li=re.split('\s+',s) is a manner to bind the parameter li to the object re.split('\s+',s) passed as argument.
This binding is done only one time: at the moment where the definition of the function is read by the interpreter to create the function object. It as a parameter defined with a default argument.
Here was I would do :
from collections import defaultdict
# My example line :
s = 'In the face of ambiguity refuse the temptation to guess'
# Previous string is quite easy to tokenize but in real world, you'll have to :
# Remove comma, dot, etc...
# Probably encode to ascii (unidecode 3rd party module can be helpful)
# You'll also probably want to normalize case
lst = s.lower().split(' ') # naive tokenizer
ddic = defaultdict(list)
for word1, word2 in zip(lst, lst[1:]):
ddic[word1].append(word2)
# ddic contains what you want (but is a defaultdict)
# if you want to work with "classical" dictionnary, just cast it :
# (Often it's not needed)
dic = dict(ddic)
Sorry if I seem to steal commentators ideas, but this is almost the same code that I used in some of my projects (similar document algorithms pre-computation)

How to format tweets using python through twitter api?

I collected some tweets through twitter api. Then I counted the words using split(' ') in python. However, some words appear like this:
correct!
correct.
,correct
blah"
...
So how can I format the tweets without punctuation? Or maybe I should try another way to split tweets? Thanks.
You can do the split on multiple characters using re.split...
from string import punctuation
import re
puncrx = re.compile(r'[{}\s]'.format(re.escape(punctuation)))
print filter(None, puncrx.split(your_tweet))
Or, just find words that contain certain contiguous characters:
print re.findall(re.findall('[\w##]+', s), your_tweet)
eg:
print re.findall(r'[\w##]+', 'talking about #python with #someone is so much fun! Is there a 140 char limit? So not cool!')
# ['talking', 'about', '#python', 'with', '#someone', 'is', 'so', 'much', 'fun', 'Is', 'there', 'a', '140', 'char', 'limit', 'So', 'not', 'cool']
I did originally have a smiley in the example, but of course these end up getting filtered out with this method, so that's something to be wary of.
Try removing the punctuation from the string before doing the split.
import string
s = "Some nice sentence. This has punctuation!"
out = s.translate(string.maketrans("",""), string.punctuation)
Then do the split on out.
I would advice to clean text from special symbols before splitting it using this code:
tweet_object["text"] = re.sub(u'[!?##$.,#:\u2026]', '', tweet_object["text"])
You would need to import re before using function sub
import re

Categories