Finding document frequency using Python - python

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.

DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

Related

Tokenize a corpus of 10 documents in Python

I am new to coding in Python so figuring out how to code more advanced actions has become a challenge for me.
My assignment is to compute the TF-IDF of a corpus of 10 documents. But I am stuck on how to tokenize the corpus and print out the number of tokens and number of unique tokens.
If anyone can help or even step step guide me in the right direction, it would be so greatly appreciated!
This might help.
I have a collection of individual text file which I want to ingest and fit transform to the TfidfVectorizer. This will walk through the process of ingesting the files and using TfidfVectorizer.
I went to kaggle to get some example data about movie reviews
I used the negative (neg) reviews. For my purposes, it doesn't matter what the data is, I just need some textual data.
Import required packages
import pandas as pd
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
How will these packages be used?
we're going to use pandas to stage the data for TfidfVectorizer
glob will be used to gather the file directory locations
TfidfVectorizer is the star of the show
Gather the file locations using Glob
ls_documents = []
for name in glob.glob('/location/to/folder/with/document/files/*'):
ls_documents.append(name)
This will produce a list of file locations.
Read the data from the first 10 files
ls_text = []
for document in ls_documents[:10]:
f = open(document,"r")
ls_text.append(f.read())
We now have a list of text.
Import into Pandas
df_text = pd.DataFrame(ls_text)
Rename the column to make it easier to work with
df_text.columns = ['raw_text']
Clean the data by removing any rows with nulls
df_text['clean_text'] = df_text['raw_text'].fillna('')
You might chose to do some other cleaning. It is useful to keep the raw data and create a separate 'clean' columns.
Create a tfidf object - I'm going to provide it with english stop words
tfidf = TfidfVectorizer(stop_words='english')
fit and transform the clean_text we created above by passing tfidf the clean_text series
tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])
You can see the feature names from tfidf
tfidf.get_feature_names()
You'll see something which looks like this
['10',
'13',
'14',
'175',
'1960',
'1990s',
'1997',
'20',
'2001',
'20th',
'2176',
'60',
'80',
'8mm',
'90',
'90s',
'_huge_',
'aberdeen',
'able',
'abo',
'accent',
'accentuate',
'accident',
'accidentally',
'accompany',
'accurate',
'accused',
'acting',
'action',
'actor',
....
]
You can look at the shape of the matrix
tfidf_matrix.shape
In my example, I get a shape of
(10, 1733)
Roughly this means 1733 words (i.e tokens) describe 10 documents
Not being sure what you're looking to do with this you might find the two articles useful.
This article from DataCamp uses tfidf in a recommendation system
This article from DataCamp has some general NLP processes
techniques
I kind of took a fun approach to this. I'm using the same data as #the_good_pony provided, so I'll use the same path.
We'll use os and re modules, because regular expressions are fun and challenging!
import os
import re
# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files
# Instantiate an empty dictonary
ddict = {}
# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
# For each sub directory ('neg' and 'pos,' in this case)
for d in subdirs:
# Create a NEW dictionary with the subdirectory name as key
ddict[d] = {}
# Create a path to the subdirectory
subroot = os.path.join(root, d)
# Get a list of files for the directory
# Save time by creating a new path for each file
file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]
# For each file in the filelist, open and read the file into the
# subdictionary
for f in file_list:
# Basename = root name of path to file, or the filename
fkey = os.path.basename(f)
# Read file and set as subdictionary value
with open(f, 'r') as f:
ddict[d][fkey] = f.read()
f.close()
Sample counts:
len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory
# sample file content
# use two keys (subdirectory name and filename)
dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]
Output:
'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'
### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
# Instanciate a new dictionary
if output_dict is None:
output_dict = dict()
# Check if element in dictionary
# Add 1 if yes, or set to 1 if no
for i in iterable:
if i in output_dict.keys():
output_dict[i] += 1
else:
output_dict[i] = 1
return output_dict
Using regular expressions, which I've overkilled here, we can clean up our text from each corpus and capture alphanumeric items to a list. I've added an option to include small words (1 character in this case), but getting stopwords wouldn't be too tough.
def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
if count_dict is None:
count_dict = dict()
get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
p = re.compile(get_words_pat)
def clean_corpus(x):
# Replace all whitespace with single-space
clear_ws_pat = r'\s+'
# Find nonalphanumeric characters
remove_punc_pat = r'[^\w+]'
tmp1 = re.sub(remove_punc_pat, ' ', x)
# Respace whitespace and return
return re.sub(clear_ws_pat, ' ', tmp1)
# List of our files from the subdirectory
keylist = list(corpus[dirname])
for k in keylist:
cleand = clean_corpus(corpus[dirname][k])
# Tokenize based on size
if keep_small_words:
tokens = p.findall(cleand)
else:
# limit to results > 1 char in length
tokens = [i for i in p.findall(cleand) if len(i) > 1]
for i in tokens:
if i in count_dict.keys():
count_dict[i] += 1
else:
count_dict[i] = 1
# Return dictionary once complete
return count_dict
### Dictionary sorted lambda function
dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))
# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)
Final processing and printing:
# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)
# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)
# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)
# Top-half of results. We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}
# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
pct_ = round(v / tot_count, 4)
print('Word: ', k, ', count: ', v, ', %-age: ', pct_)

Running loop over documents in python

I've got one folder on my desktop called 'companyfollowerstweets' which contains 91 folders, each called 'followerstweets(company name)', which all contain 200 csv files, each containing the most recent Tweets of a follower of that Company on Twitter. I'm performing sentiment analysis on the first 200 rows containting Tweets of each of the 200 followers per company, of which the results are being added to one list which eventually gives me one outcome per company; the percentage of negative Tweets and Positive Tweets of all 40,000 Tweets (200 Tweets for each of 200 followers). Hope this makes sense. Right now I have only managed to run a loop over those 200 csv files per folder, where I manually enter the company's name each time. However, I want that to run over each of those 91 folders without me having to enter the company name. Here's my code:
import nltk
import csv
import sklearn
import nltk, string, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer
columns = defaultdict(list)
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfTransformer
import math
import sentiment_mod as s
import glob
import itertools
lijst = glob.glob('companyfollowerstweets/followerstweetsCisco/*.csv')
tweets1 = []
sent1 = []
print(lijst[0])
for item in lijst:
stopwords_set = set(stopwords.words("english"))
with open(item, encoding = 'latin-1') as d:
reader1=csv.reader(d)
next(reader1)
for row in itertools.islice(reader1,200):
tweets1.extend([row[2]])
words_cleaned = [" ".join([words for words in sentence.split() if 'http' not in words and not words.startswith('#')]) for sentence in tweets1]
words_filtered = [e.lower() for e in words_cleaned]
words_without_stopwords = [word for word in words_filtered if not word in stopwords_set]
tweets1 = words_without_stopwords
tweets1 = list(filter(None, tweets1))
for d in tweets1:
new1 = s.sentiment(d)
sent1.extend(new1)
total1 = len(sent1)/2
neg_percentage1 = (sent1.count("neg")/total1)*100
pos_percentage1 = (sent1.count("pos")/total1)*100
res = sum(sent1[1::2])/total1
low = min(sent1[1::2])
high = max(sent1[1::2])
print("% of negative Tweets:", neg_percentage1)
print("% of positive Tweets:", pos_percentage1)
print("Total number of Tweets:", total1)
print("Average confidence:", res)
print("min confidence:", low)
print("max confidence:", high)
This specific example is for the company 'Cisco' as you can see. How do I keep this code running for every one of the 91 folders like this one?
os.walk() on your current directory (os.getcwd()) is what you want. That will recursively iterate over everything in your current working directory.
You can use a nested glob like this:
from glob import glob
[glob(i+'/*.csv') for i in glob('companyfollowerstweets/followerstweets*')]
This will return a list of lists (list of tweets for each company).
Note that this won't have any particular order.
You can either do two loops or use os.walk, the latter is cleaner:
import os
company_results={}
for root, dirs, files in os.walk('companyfollowerstweets'):
if len(dirs)==0:
results=do_analysis(files)
company_results[root]=results
I suggest to put all your analysis in a function, much cleaner code. Then you can get a dictionary of all the results with the above code.

Efficient fuzzy string comparison over thousands of text files

I need to search several thousand plaintext files for a set of names. I'm generating trigrams to retain context. I need to account for minor misspellings, so I'm using a Levenshtein distance calculation, function lev(). I need the final result to return the name with a hit, the filename the hit was in, and the trigram that was marked a hit. My python program works as expected, but very slowly. I'm searching for a faster way to do this search, preferably in python, but my Googlefu has failed me. A generic-ified version of the program is below:
from sklearn.feature_extraction.text import CountVectorizer
import os
textfiles = []
newgrams = set()
ngrams = []
hitlist = []
path = 'path of folder of textfiles'
names = ['john james doe', 'jane jill doe']
vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3),
strip_accents='unicode', stop_words='english',
token_pattern='[a-zA-Z\-]\\w*',
encoding='utf-8', decode_error = 'replace', lowercase = True)
ngramer = vectorizer.build_analyzer()
for dirpath, dirnames, filenames in os.walk(path):
for files in filenames:
if files.endswith('.txt'):
textfiles.append(files)
ctFiles = len(textfiles)
ctNames = len(names)
for i in range(ctFiles):
newgrams = set(ngramer(path+'/'+textfiles[i]))
ngrams.append(newgrams)
for i in range(ctNames):
splitname = names[i].split()
for j in range(ctFiles):
tempset = set()
for k in range(len(splitname)):
if k == 0:
## subset only the trigrams that "match" first name
for trigram in ngrams[j]:
for word in trigram.split():
if lev(splitname[k], word) < 2:
tempset.add(trigram)
else:
## search that subset for middle/last name
if len(tempset) > 0:
for trigram in tempset:
for word in trigram.split():
if lev(splitname[k], word) < 2:
hitlist.append([names[i], textfiles[j], trigram])
print(hitlist) ## eventually save to CSV
I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy
Levenshtein is very expensive, and I wouldn't recommend using it in fuzzy matching on this many documents (unless you want to build a levenshtein automata to generate an index of tokens n steps away from every word in your files).
Trigram indexing should be fast and accurate on its own for words of a certain length, although you mention names and if that means multiple words, chunks need to be indexed as well then that needs to be implemented.
If you try trigram indexing on its own, and aren't satisfied with the accuracy, you can try adding a trigram chunk index aka (Ban, ana, nan) as a tuple in addition to Ban, nan, and ana as individual trigrams, but in a separate index. This will have an even larger decrease in accuracy as character length decreases, and so that should be accounted for.
Key here is that levenshtein executes at O(length of query^length of word in file*number of words in files) while token/trigram/chunk indexing executes at O(log(number of words in files)*number of tokens/chunks/trigrams in query).

Creating a simple searching program

Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)
Here is a start:
from collections import Counter
def retrieve():
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
print retrieve()

UPDATE: Calculate vector length according to str value in specific column in Python

I am trying to measure the length of vectors based on a value of the first column of my input data.
For instance: my input data is as follows:
dog nmod+n+-n 4
dog nmod+n+n-a-commitment-n 6
child into+ns-j+vn-pass-rb-divide-v 3
child nmod+n+ns-commitment-n 5
child nmod+n+n-pledge-n 3
hello nmod+n+ns 2
The value that I want to calculate is based on an identical value in the first column. For instance, I would calculate a value based on all rows in which dog is in the first column, then I would calculate a value based on all rows in which child is in the first column... and so on.
I have worked out the mathematics to calculate the vector length (Euc. norm). However, I am unsure how to base the calculation based on grouping the identical values in the first column.
So far, this is the code that I have written:
#!/usr/bin/python
import os
import sys
import getopt
import datetime
import math
print "starting:",
print datetime.datetime.now()
def countVectorLength(infile, outfile):
with open(infile, 'rb') as inputfile:
flem, _, fw = next(inputfile).split()
current_lem = flem
weights = [float(fw)]
for line in inputfile:
lem, _, w = line.split()
if lem == current_lem:
weights.append(float(w))
else:
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
current_lem = lem
weights = [float(w)]
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
print "Finish:",
print datetime.datetime.now()
path = '/Path/to/Input/'
pathout = '/Path/to/Output'
listing = os.listdir(path)
for infile in listing:
outfile = 'output' + infile
print "current file is:" + infile
countVectorLength(path + infile, pathout + outfile)
This code outputs the length of vector of each individual lemma. The above data gives me the following output:
dog 7.211102550927978
child 6.48074069840786
hello 2
UPDATE
I have been working on it and I have managed to get the following working code, as updated in the code sample above. However, as you will be able to see. The code has a problem with the output of the very last line of each file --- which I have solved rather rudimentarily by manually adding it. However, because of this problem, it does not permit a clean iteration through the directory -- outputting all of the results of all files in an appended > document. Is there a way to make this a bit cleaner, pythonic way to output directly each individual corresponding file in the outpath directory?
First thing, you need to transform the input into something like
dog => [4,2]
child => [3,5,3]
etc
It goes like this:
from collections import defaultdict
data = defaultdict(list)
for line in file:
line = line.split('\t')
data[line[0]].append(line[2])
Once this is done, the rest is obvious:
def vector_len(vec):
you already got that
vector_lens = {name: vector_len(values) for name, values in data.items()}

Categories