I am new to coding in Python so figuring out how to code more advanced actions has become a challenge for me.
My assignment is to compute the TF-IDF of a corpus of 10 documents. But I am stuck on how to tokenize the corpus and print out the number of tokens and number of unique tokens.
If anyone can help or even step step guide me in the right direction, it would be so greatly appreciated!
This might help.
I have a collection of individual text file which I want to ingest and fit transform to the TfidfVectorizer. This will walk through the process of ingesting the files and using TfidfVectorizer.
I went to kaggle to get some example data about movie reviews
I used the negative (neg) reviews. For my purposes, it doesn't matter what the data is, I just need some textual data.
Import required packages
import pandas as pd
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
How will these packages be used?
we're going to use pandas to stage the data for TfidfVectorizer
glob will be used to gather the file directory locations
TfidfVectorizer is the star of the show
Gather the file locations using Glob
ls_documents = []
for name in glob.glob('/location/to/folder/with/document/files/*'):
ls_documents.append(name)
This will produce a list of file locations.
Read the data from the first 10 files
ls_text = []
for document in ls_documents[:10]:
f = open(document,"r")
ls_text.append(f.read())
We now have a list of text.
Import into Pandas
df_text = pd.DataFrame(ls_text)
Rename the column to make it easier to work with
df_text.columns = ['raw_text']
Clean the data by removing any rows with nulls
df_text['clean_text'] = df_text['raw_text'].fillna('')
You might chose to do some other cleaning. It is useful to keep the raw data and create a separate 'clean' columns.
Create a tfidf object - I'm going to provide it with english stop words
tfidf = TfidfVectorizer(stop_words='english')
fit and transform the clean_text we created above by passing tfidf the clean_text series
tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])
You can see the feature names from tfidf
tfidf.get_feature_names()
You'll see something which looks like this
['10',
'13',
'14',
'175',
'1960',
'1990s',
'1997',
'20',
'2001',
'20th',
'2176',
'60',
'80',
'8mm',
'90',
'90s',
'_huge_',
'aberdeen',
'able',
'abo',
'accent',
'accentuate',
'accident',
'accidentally',
'accompany',
'accurate',
'accused',
'acting',
'action',
'actor',
....
]
You can look at the shape of the matrix
tfidf_matrix.shape
In my example, I get a shape of
(10, 1733)
Roughly this means 1733 words (i.e tokens) describe 10 documents
Not being sure what you're looking to do with this you might find the two articles useful.
This article from DataCamp uses tfidf in a recommendation system
This article from DataCamp has some general NLP processes
techniques
I kind of took a fun approach to this. I'm using the same data as #the_good_pony provided, so I'll use the same path.
We'll use os and re modules, because regular expressions are fun and challenging!
import os
import re
# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files
# Instantiate an empty dictonary
ddict = {}
# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
# For each sub directory ('neg' and 'pos,' in this case)
for d in subdirs:
# Create a NEW dictionary with the subdirectory name as key
ddict[d] = {}
# Create a path to the subdirectory
subroot = os.path.join(root, d)
# Get a list of files for the directory
# Save time by creating a new path for each file
file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]
# For each file in the filelist, open and read the file into the
# subdictionary
for f in file_list:
# Basename = root name of path to file, or the filename
fkey = os.path.basename(f)
# Read file and set as subdictionary value
with open(f, 'r') as f:
ddict[d][fkey] = f.read()
f.close()
Sample counts:
len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory
# sample file content
# use two keys (subdirectory name and filename)
dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]
Output:
'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'
### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
# Instanciate a new dictionary
if output_dict is None:
output_dict = dict()
# Check if element in dictionary
# Add 1 if yes, or set to 1 if no
for i in iterable:
if i in output_dict.keys():
output_dict[i] += 1
else:
output_dict[i] = 1
return output_dict
Using regular expressions, which I've overkilled here, we can clean up our text from each corpus and capture alphanumeric items to a list. I've added an option to include small words (1 character in this case), but getting stopwords wouldn't be too tough.
def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
if count_dict is None:
count_dict = dict()
get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
p = re.compile(get_words_pat)
def clean_corpus(x):
# Replace all whitespace with single-space
clear_ws_pat = r'\s+'
# Find nonalphanumeric characters
remove_punc_pat = r'[^\w+]'
tmp1 = re.sub(remove_punc_pat, ' ', x)
# Respace whitespace and return
return re.sub(clear_ws_pat, ' ', tmp1)
# List of our files from the subdirectory
keylist = list(corpus[dirname])
for k in keylist:
cleand = clean_corpus(corpus[dirname][k])
# Tokenize based on size
if keep_small_words:
tokens = p.findall(cleand)
else:
# limit to results > 1 char in length
tokens = [i for i in p.findall(cleand) if len(i) > 1]
for i in tokens:
if i in count_dict.keys():
count_dict[i] += 1
else:
count_dict[i] = 1
# Return dictionary once complete
return count_dict
### Dictionary sorted lambda function
dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))
# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)
Final processing and printing:
# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)
# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)
# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)
# Top-half of results. We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}
# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
pct_ = round(v / tot_count, 4)
print('Word: ', k, ', count: ', v, ', %-age: ', pct_)
Related
This is my code below and I would like to write new column in my original csv , the columns are supposed to contain the values of each dictionary created during my code and I would like for the last dictionary since it contains 3 values , that each values is inserted in a single column. The code to write in the csv is at the end but maybe there is a way to write the values at each time i am producing a new dictionary.
My code for the csv route : I cannot figure it out how to add without deleting the content of the original file
# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict
#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)
# extraction colonne verbatim
d_verbatim = {}
with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
csv_file.readline()
for line in csv_file:
token = line.split(';')
try:
d_verbatim[token[0]] = token[1]
except:
print(line)
#print(d_verbatim)
#Using treetagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items():
newvalues = tagger.tag_text(val)
d_tag[key] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
for k, v in d_tag.items():
for p in v:
parts = p.split('\t')
try:
if parts[2] == '':
d_lemma[k].append(parts[0])
else:
d_lemma[k].append(parts[2])
except:
print(parts)
#print(d_lemma)
stopWords = set(stopwords.words('french'))
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}
print(d_filtered_words)
d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
for word in v:
if word in dico_lexique:
if word
print(word, dico_lexique[word])
your edit seemed to make things worse, you've ended up deleting a lot of relevant context. I think I've pieced together what you are trying to do. the core of it seems to be a routine that is performing sentiment analysis on text.
I'd start by creating a class that keeps track of this, e.g:
class Sentiment:
__slots__ = ('positive', 'neutral', 'negative')
def __init__(self, positive=0, neutral=0, negative=0):
self.positive = positive
self.neutral = neutral
self.negative = negative
def __repr__(self):
return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'
def __add__(self, other):
return Sentiment(
self.positive + other.positive,
self.neutral + other.neutral,
self.negative + other.negative,
)
which will allow you to replace your convoluted bits of code like [a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])] with score += sentiment in the function below, and allows us to refer to the various values by name
I'd then suggest preprocessing your pickled data, so you don't have to convert things to ints in the middle of unrelated code, e.g:
with open("dict_pickle", "rb") as fd:
dico_lexique = {}
for word, (pos, neu, neg) in pickle.load(fd):
dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))
this puts them directly into the above class and seems to match up with other constraints in your code. but I don't have your data, so can't check.
after pulling apart all your comprehensions and loops, we are left with a single nice routine for processing a single piece of text:
def process_text(text):
"""process the specified text
returns (words, filtered words, total sentiment score)
"""
words = []
filtered = []
score = Sentiment()
for tag in make_tags(tagger.tag_text(text)):
word = tag.lemma
words.append(word)
if word not in stopWords and lemma.isalpha():
filtered.append(word)
sentiment = dico_lexique.get(word)
if sentiment is not None:
score += sentiment
return words, filtered, score
and we can put this into a loop that reads lines from the input and sends them to an output file:
filename = sys.argv[1]
tempname = filename + '~'
with open(filename) as fdin, open(tempname, 'w') as fdout:
inp = csv.reader(fdin, delimiter=';')
out = csv.writer(fdout, delimiter=';')
# get the header, and blindly append out column names
header = next(inp)
out.writerow(header + [
'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
])
for row in inp:
# assume that second item contains the text we want to process
words, filtered, score = process_text(row[1])
extra_values = [
words, filtered,
score.positive, score.neutal, score.negative,
]
# add the values and write out
assert len(row) == len(header), "code needed to pad the columns out"
out.writerow(row + extra_values)
# only replace if everything succeeds
os.rename(tempname, filename)
we write out to a different file and only rename on success, this means that if the code crashes it won't leave partially written files around. I'd discourage working like though, and tend to make my scripts read from stdin and write to stdout. that way I can run as:
$ python script.py < input.csv > output.csv
when all is OK, but also lets me run as:
$ head input.csv | python script.py
if I just want to test with the first few lines of input, or:
$ python script.py < input.csv | less
if I want to checkout output as it's generated
note that none of this code has been run, so there are probably bugs in it, but I can actually see what the code is trying to do like this. comprehensions and 'functional' style code is great, but it can easily get unreadable if you're not careful
I need to search several thousand plaintext files for a set of names. I'm generating trigrams to retain context. I need to account for minor misspellings, so I'm using a Levenshtein distance calculation, function lev(). I need the final result to return the name with a hit, the filename the hit was in, and the trigram that was marked a hit. My python program works as expected, but very slowly. I'm searching for a faster way to do this search, preferably in python, but my Googlefu has failed me. A generic-ified version of the program is below:
from sklearn.feature_extraction.text import CountVectorizer
import os
textfiles = []
newgrams = set()
ngrams = []
hitlist = []
path = 'path of folder of textfiles'
names = ['john james doe', 'jane jill doe']
vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3),
strip_accents='unicode', stop_words='english',
token_pattern='[a-zA-Z\-]\\w*',
encoding='utf-8', decode_error = 'replace', lowercase = True)
ngramer = vectorizer.build_analyzer()
for dirpath, dirnames, filenames in os.walk(path):
for files in filenames:
if files.endswith('.txt'):
textfiles.append(files)
ctFiles = len(textfiles)
ctNames = len(names)
for i in range(ctFiles):
newgrams = set(ngramer(path+'/'+textfiles[i]))
ngrams.append(newgrams)
for i in range(ctNames):
splitname = names[i].split()
for j in range(ctFiles):
tempset = set()
for k in range(len(splitname)):
if k == 0:
## subset only the trigrams that "match" first name
for trigram in ngrams[j]:
for word in trigram.split():
if lev(splitname[k], word) < 2:
tempset.add(trigram)
else:
## search that subset for middle/last name
if len(tempset) > 0:
for trigram in tempset:
for word in trigram.split():
if lev(splitname[k], word) < 2:
hitlist.append([names[i], textfiles[j], trigram])
print(hitlist) ## eventually save to CSV
I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy
Levenshtein is very expensive, and I wouldn't recommend using it in fuzzy matching on this many documents (unless you want to build a levenshtein automata to generate an index of tokens n steps away from every word in your files).
Trigram indexing should be fast and accurate on its own for words of a certain length, although you mention names and if that means multiple words, chunks need to be indexed as well then that needs to be implemented.
If you try trigram indexing on its own, and aren't satisfied with the accuracy, you can try adding a trigram chunk index aka (Ban, ana, nan) as a tuple in addition to Ban, nan, and ana as individual trigrams, but in a separate index. This will have an even larger decrease in accuracy as character length decreases, and so that should be accounted for.
Key here is that levenshtein executes at O(length of query^length of word in file*number of words in files) while token/trigram/chunk indexing executes at O(log(number of words in files)*number of tokens/chunks/trigrams in query).
Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.
DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).
Basically i want to compare three sets of lists to another file. These lists are differentiated by comments. How do i compare it to each of these lists? DO i need to make three separate files for these lists?
example: Words have prefixes roots and suffixes. Example would be contradict. the prefix is con, the suffix is dict. I have a list of these prefix, suffixes etc. I need to know how to compare that list to the pile of words and basically count the number of roots,prefixes ad suffixes that exist in that file.
The following might help to get you started. It uses Python's ConfigParser to load a file which contains all of your lists. This file needs to be formatted as follows:
vocab.txt
[prefixes]
inter
con
mis
[roots]
cred
duct
equ
[suffixes]
dict
ment
ible
Each list of words gets loaded into the variables prefixes, roots and suffixes accordingly (with any duplicates removed). It then loads a source file called input.txt and splits this into a list of words called words. Each word is lowercased to make sure it matches one of the prefixes, roots or suffixes.
For each word a simple test is made to see if it matches any of your lists. Each match is displayed a counted. A total for each is also kept and displayed at the end.
import ConfigParser
import re
vocab = ConfigParser.ConfigParser(allow_no_value=True)
vocab.read('vocab.txt')
def get_section(section):
return set(v[0].lower() for v in vocab.items(section))
prefixes = get_section('prefixes')
roots = get_section('roots')
suffixes = get_section('suffixes')
total_prefixes = 0
total_roots = 0
total_suffixes = 0
with open('input.txt', 'r') as f_input:
text = f_input.read()
words = [word.lower() for word in re.findall('\w+', text)]
for word in words:
word_prefixes = [p for p in prefixes if word.startswith(p)]
total_prefixes += len(word_prefixes)
word_roots = [r for r in roots if r in word[1:]]
total_roots += len(word_roots)
word_suffixes = [s for s in suffixes if word.endswith(s)]
total_suffixes += len(word_suffixes)
print '{:15} Prefixes {} {}, Roots {} {}, Suffixes {} {}' .format(word,
len(word_prefixes), word_prefixes, len(word_roots), word_roots, len(word_suffixes), word_suffixes)
print
print 'Totals:\n Prefixes {}, Roots {}, Suffixes {}'.format(total_prefixes, total_roots, total_suffixes)
I am reading a .csv file and saving it to a matrix called csvfile, and the matrix contents look like this (abbreviated, there are dozens of records):
[['411-440854-0', '411-440824-0', '411-441232-0', '394-529791', '394-529729', '394-530626'], <...>, ['394-1022430-0', '394-1022431-0', '394-1022432-0', '***another CN with a switch in between'], ['394-833938-0', '394-833939-0', '394-833940-0'], <...>, ['394-1021830-0', '394-1021831-0', '394-1021832-0', '***Sectionalizer end connections'], ['394-1022736-0', '394-1022737-0', '394-1022738-0'], <...>, ['394-1986420-0', '394-1986419-0', '394-1986416-0', '***weird BN line check'], ['394-1986411-0', '394-1986415-0', '394-1986413-0'], <...>, ['394-529865-0', '394-529686-0', '394-530875-0', '***Sectionalizer end connections'], ['394-830900-0', '394-830904-0', '394-830902-0'], ['394-2350772-0', '394-2350776-0', '394-2350774-0', '***Sectionalizer present but no end break'], <...>]
and I am reading a text file into a variable called textfile and the content looks like this:
...
object underground_line {
name SPU123-394-1021830-0-sectionalizer;
phases AN;
from SPU123-391-670003;
to SPU123-395-899674_sectionalizernode;
length 26.536;
configuration SPU123-1/0CN15-AN;
}
object underground_line {
name SPU123-394-1021831-0-sectionalizer;
phases BN;
from SPU123-391-670002;
to SPU123-395-899675_sectionalizernode;
length 17.902;
configuration SPU123-1/0CN15-BN;
}
object underground_line {
name SPU123-394-1028883-0-sectionalizer;
phases CN;
from SPU123-391-542651;
to SPU123-395-907325_sectionalizernode;
length 771.777;
configuration SPU123-1CN15-CN;
}
...
I want to see if a portion of name line in textfile (anything after SPU123- and before -0-sectionalizer) exists in csvfile matrix. If it does not exist, I want to do something (increment a counter) and I tried several ways including below:
counter = 0
for noline in textfile:
if 'name SPU123-' in noline:
if '-' in noline[23]:
if ((noline[13:23] not in s[0]) and (noline[13:23] not in s[1]) and (noline[13:23] not in s[2]) for s in csvfile):
counter = counter+1
else:
if ((noline[13:24] not in s[0]) and (noline[13:24] not in s[1]) and (noline[13:-24] not in s[2]) for s in csvfile):
counter = counter+1
print counter
This is not working. I also tried with if any((noline......) in the above code sample and it doesn't work either.
Checking for a string s in a list of lists l:
>>> l = [['str', 'foo'], ['bar', 'so']]
>>> s = 'foo'
>>> any(s in x for x in l)
True
>>> s = 'nope'
>>> any(s in x for x in l)
False
Implementing this into your code (assuming that noline[13:23] is the string your are wanting search for, and then increment counter if it is not in csvfile):
counter = 0
for noline in textfile:
if 'name SPU123-' in noline:
if '-' in noline[23]: noline[13:23]:
if not any(noline[13:23] in x for x in csvfile) and not any(noline[13:23] + '-0' in x for x in csvfile):
counter += 1
else:
if not any(noline[13:24] in x for x in csvfile) and not any(noline[13:24] + '-0' in x for x in csvfile):
counter += 1
Since your matrix includes loads upon loads of values, it's very slow to iterate over it all each time.
Assemble your values into a mapping instead (a set in this case since there are no associated data) since hash table lookups are very fast:
s = {v for r in matrix for v in r if re.match(r'\d[-\d]+]\d$',v)} #or any filter more appropriate for your notion of valid identifiers
if noline[13:23] in s: #parsing the identifiers instead would be more fault-tolerant
#do something
Due to the preliminary step, this will only start outperforming the brute-force approach beyond a certain scale.
import re, itertools
Flatten csvfile -- data is an iterator
data = itertools.chain.from_iterable(csvfile)
Extract relevant items from data and make it a set for performance (avoid iterating over data multiple times)
data_rex = re.compile(r'\d{3}-\d+')
data = {match.group() for match in itertools.imap(data_rex.match, data) if match}
Quantify the the names that are not in data.
def predicate(match, data = data):
'''Return True if match not found in data'''
return match.group(1) not in data
# after SPU123- and before -0-
name = re.compile(r'name SPU123-(\d{3}-\d+)-')
names = name.finditer(textfile)
# quantify
print sum(itertools.imap(predicate, names))