Efficient fuzzy string comparison over thousands of text files

Efficient fuzzy string comparison over thousands of text files - python

I need to search several thousand plaintext files for a set of names. I'm generating trigrams to retain context. I need to account for minor misspellings, so I'm using a Levenshtein distance calculation, function lev(). I need the final result to return the name with a hit, the filename the hit was in, and the trigram that was marked a hit. My python program works as expected, but very slowly. I'm searching for a faster way to do this search, preferably in python, but my Googlefu has failed me. A generic-ified version of the program is below:
from sklearn.feature_extraction.text import CountVectorizer
import os
textfiles = []
newgrams = set()
ngrams = []
hitlist = []
path = 'path of folder of textfiles'
names = ['john james doe', 'jane jill doe']
vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3),
strip_accents='unicode', stop_words='english',
token_pattern='[a-zA-Z\-]\\w*',
encoding='utf-8', decode_error = 'replace', lowercase = True)
ngramer = vectorizer.build_analyzer()
for dirpath, dirnames, filenames in os.walk(path):
for files in filenames:
if files.endswith('.txt'):
textfiles.append(files)
ctFiles = len(textfiles)
ctNames = len(names)
for i in range(ctFiles):
newgrams = set(ngramer(path+'/'+textfiles[i]))
ngrams.append(newgrams)
for i in range(ctNames):
splitname = names[i].split()
for j in range(ctFiles):
tempset = set()
for k in range(len(splitname)):
if k == 0:
## subset only the trigrams that "match" first name
for trigram in ngrams[j]:
for word in trigram.split():
if lev(splitname[k], word) < 2:
tempset.add(trigram)
else:
## search that subset for middle/last name
if len(tempset) > 0:
for trigram in tempset:
for word in trigram.split():
if lev(splitname[k], word) < 2:
hitlist.append([names[i], textfiles[j], trigram])
print(hitlist) ## eventually save to CSV

I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy

Levenshtein is very expensive, and I wouldn't recommend using it in fuzzy matching on this many documents (unless you want to build a levenshtein automata to generate an index of tokens n steps away from every word in your files).
Trigram indexing should be fast and accurate on its own for words of a certain length, although you mention names and if that means multiple words, chunks need to be indexed as well then that needs to be implemented.
If you try trigram indexing on its own, and aren't satisfied with the accuracy, you can try adding a trigram chunk index aka (Ban, ana, nan) as a tuple in addition to Ban, nan, and ana as individual trigrams, but in a separate index. This will have an even larger decrease in accuracy as character length decreases, and so that should be accounted for.
Key here is that levenshtein executes at O(length of query^length of word in file*number of words in files) while token/trigram/chunk indexing executes at O(log(number of words in files)*number of tokens/chunks/trigrams in query).

Related

How to remove a word completely from a Word2Vec model in gensim?

Given a model, e.g.
from gensim.models.word2vec import Word2Vec
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
texts = [d.lower().split() for d in documents]
w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)
It's possible to remove the word from the w2v vocabulary, e.g.
# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]
>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)
# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]
# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"
But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.
>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]
How to remove a word completely from a Word2Vec model in gensim?
Updated
To answer #vumaasha's comment:
could you give some details as to why you want to delete a word
Lets say my universe of words in all words in the corpus to learn the dense relations between all words.
But when I want to generate the similar words, it should only come from a subset of domain specific word.
It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.
It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.
def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []
for i in range(len(w2v.vocab)):
word = w2v.index2entity[i]
vec = w2v.vectors[i]
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)
w2v.vocab = new_vocab
w2v.vectors = new_vectors
w2v.index2entity = new_index2entity
w2v.index2word = new_index2entity
w2v.vectors_norm = new_vectors_norm
It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.
Usage:
w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")
[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]
restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")
[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.
The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done
limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
dists = dot(limited, mean)
if not topn:
return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
Update:
limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited
the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)
so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work

Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.
Suppose you only want to keep the top 5000 words in your model.
wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case
# words_to_trim = ['graph']
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]
for w in words_to_trim:
del wv.vocab[w]
wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)
for i in sorted(ids_to_trim, reverse=True):
del(wv.index2word[i])
This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.
The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.

Have tried and felt that the most straightforward way is as follows:
Get the Word2Vec embeddings in text file format.
Identify the lines corresponding to the word vectors that you would like to keep.
Write a new text file Word2Vec embedding model.
Load model and enjoy (save to binary if you wish, etc.)...
My sample code is as follows:
line_no = 0 # line0 = header
numEntities=0
targetLines = []
with open(file_entVecs_txt,'r') as fp:
header = fp.readline() # header
while True:
line = fp.readline()
if line == '': #EOF
break
line_no += 1
isLatinFlag = True
for i_l, char in enumerate(line):
if not isLatin(char): # Care about entity that is Latin-only
isLatinFlag = False
break
if char==' ': # reached separator
ent = line[:i_l]
break
if not isLatinFlag:
continue
# Check for numbers in entity
if re.search('\d',ent):
continue
# Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
if re.match('^ENTITY/.*#',ent):
continue
targetLines.append(line_no)
numEntities += 1
# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)
# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)
line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
while ptr < len(targetLines):
target_line_no = targetLines[ptr]
while (line_no != target_line_no):
fp.readline()
line_no+=1
line = fp.readline()
line_no+=1
ptr+=1
txtAppend(line,file_entVecs_SHORT_txt)
FYI. FAILED ATTEMPT I tried out #zsozso's method (with the np.array modifications suggested by #Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). This is perhaps because I have a lot of entities... But my text-file method works within an hour.
FAILED CODE
# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []
print('Building new vocab..')
for i in range(len(w2v.vocab)):
if (i%int(1e6)==0) and (i!=0):
print(f'working on {i}')
word = w2v.index2entity[i]
vec = np.array(w2v.vectors[i])
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)
print('Assigning new vocab')
w2v.vocab = new_vocab
print('Assigning new vectors')
w2v.vectors = np.array(new_vectors)
print('Assigning new index2entity, index2word')
w2v.index2entity = new_index2entity
w2v.index2word = new_index2entity
print('Assigning new vectors_norm')
w2v.vectors_norm = np.array(new_vectors_norm)

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))

I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!

I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

How do I optimize the speed of my python compression code?

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:
(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)
import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added
size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file, INCLUDING SPACES
print('The size of your file is ', size1,)
words = re.findall('\w+', open('file.txt').read())
wordcounts = Counter(words) #turns all words into array, even capitals
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words
keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
keyword.append(key) #adds each keyword to the array called keywords
kcount.append(value)
characters =['$','#','#','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use
symbols_words = []
char = 0
for i in common100:
symbols_words.append(characters[char]) #makes the array literally contain 0 values
char = char + 1
print("Compression has now started")
f = 0
g = 0
no = 0
while no < 100:
for i in common100:
for w in words:
if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
place = words.index(i)#find exactly where the most common words are in the text
symbols = symbols_words[common100.index(i)] #assigns one character with one common word
words[place] = symbols # replaces the word with the symbol
g = g + 1
no = no + 1
string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()
size2 = os.path.getsize('compression.txt')
no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
'The size of your file now is ', size2)

Using something like
word_substitutes = dict(zip(common100, characters))
will give you a dict that maps common words to their corresponding symbol.
Then you can simply iterate over the words:
# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
# If the current word is in the `word_substitutes` dict, then we know its in the
# 'common' words, and can be replaced by the symbol
if word in word_substitutes:
# Replaces the word in-place
replacement_symbol = word_substitutes[word]
words[word_idx] = replacement_symbol
This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that.

The first thing I see that is bad for performance is:
for i in common100:
for w in words:
if i == w and len(i)>1:
...
What you are doing is seeing if the word w is in your list of common100 words. However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word.
common_words = set(common100)
for w in words:
if w in common_words:
...

Generally you would do the following:
Measure how much time each "part" of your program needs. You could use a profiler (e.g. this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. Then you know which part of your code is slow.
See if you can improve the algorithm of the slow part. gnicholas answer shows one possibility to speed things up. The while no<=100 seems suspiciously, maybe that can be improved. This step needs understanding of the algorithms you use. Be careful to select the best data structures for your use case.
If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. Numerical stuff benefits from numpy, with cython you can basically compile python code to C and numba uses LLVM to compile.

Finding document frequency using Python

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.

DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

If i want to access certain comments in a file which have specific words below them. How do i do that?

Basically i want to compare three sets of lists to another file. These lists are differentiated by comments. How do i compare it to each of these lists? DO i need to make three separate files for these lists?
example: Words have prefixes roots and suffixes. Example would be contradict. the prefix is con, the suffix is dict. I have a list of these prefix, suffixes etc. I need to know how to compare that list to the pile of words and basically count the number of roots,prefixes ad suffixes that exist in that file.

The following might help to get you started. It uses Python's ConfigParser to load a file which contains all of your lists. This file needs to be formatted as follows:
vocab.txt
[prefixes]
inter
con
mis
[roots]
cred
duct
equ
[suffixes]
dict
ment
ible
Each list of words gets loaded into the variables prefixes, roots and suffixes accordingly (with any duplicates removed). It then loads a source file called input.txt and splits this into a list of words called words. Each word is lowercased to make sure it matches one of the prefixes, roots or suffixes.
For each word a simple test is made to see if it matches any of your lists. Each match is displayed a counted. A total for each is also kept and displayed at the end.
import ConfigParser
import re
vocab = ConfigParser.ConfigParser(allow_no_value=True)
vocab.read('vocab.txt')
def get_section(section):
return set(v[0].lower() for v in vocab.items(section))
prefixes = get_section('prefixes')
roots = get_section('roots')
suffixes = get_section('suffixes')
total_prefixes = 0
total_roots = 0
total_suffixes = 0
with open('input.txt', 'r') as f_input:
text = f_input.read()
words = [word.lower() for word in re.findall('\w+', text)]
for word in words:
word_prefixes = [p for p in prefixes if word.startswith(p)]
total_prefixes += len(word_prefixes)
word_roots = [r for r in roots if r in word[1:]]
total_roots += len(word_roots)
word_suffixes = [s for s in suffixes if word.endswith(s)]
total_suffixes += len(word_suffixes)
print '{:15} Prefixes {} {}, Roots {} {}, Suffixes {} {}' .format(word,
len(word_prefixes), word_prefixes, len(word_roots), word_roots, len(word_suffixes), word_suffixes)
print
print 'Totals:\n Prefixes {}, Roots {}, Suffixes {}'.format(total_prefixes, total_roots, total_suffixes)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient fuzzy string comparison over thousands of text files - python

I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy

Related

How to remove a word completely from a Word2Vec model in gensim?

Trying to read text file and count words within defined groups

How do I optimize the speed of my python compression code?

Finding document frequency using Python

If i want to access certain comments in a file which have specific words below them. How do i do that?

Categories

Resources