So, I'm looping through a bunch of docs creating a list of all unique words and sequential groups of words in the documents (obviously, the strings I'm looking at are pretty short).
globallist=[]
for filename in glob.glob(os.path.join(path, '*.html')):
mystr = "some text I want"
stuff = re.sub("[^\w]", " ", mystr).split()
wordlist = [''.join(stuff[i:j]) for i in range(len(stuff)) for j in range(i+1, len(stuff)+1)]
globallist = set.union(set(globallist), set(wordlist))
I want to track occurrences with globallist as I go so that at the end I'll already have a count of how many documents contain each string in the list. I plan to delete any element that only occurs in one document. What's the best way to do this?
The script below should help to give you some ideas.
You are attempting to parse HTML files, so ideally you need to extract only the text from each file without any of the HTML markup. This can be done using a library such as BeautifulSoup. Next it is best to lowercase all of the words to ensure you catch words using different case. Python's collections.Counter can be used to count all of the words, and from that a list containing only words with a count of one can be constructed. Finally a count of your phrases can be made.
All of this information can then be stored on a per file basis into file_stats. The results are then displayed at the end.
From that, you will be able to see how many of the documents contain the text you were looking for.
from bs4 import BeautifulSoup
import collections
import glob
import re
import os
path = r'mypath'
file_stats = []
search_list = ['some text I want', 'some other text']
search_list = [phrase.lower() for phrase in search_list] # Ensure list is all lowercase
for filename in glob.glob(os.path.join(path, '*.html')):
with open(filename, 'r') as f_input:
html = f_input.read()
soup = BeautifulSoup(html, 'html.parser')
# Remove style and script sections from the HTML
for script in soup(["style", "script"]):
script.extract()
# Extract all text
text = soup.get_text().encode('utf-8')
# Create a word list in lowercase
word_list = [word.lower() for word in re.sub("[^\w]", " ", text).split()]
# Search for matching phrases
phrase_counts = dict()
text = ' '.join(word_list)
for search in search_list:
phrase_counts[search] = text.count(search)
# Calculate the word counts
word_counts = collections.Counter(word_list)
# Filter unique words
unique_words = sorted(word for word, count in word_counts.items() if count == 1)
# Create a list of unique words and phrase matches for each file
file_stats.append([filename, unique_words, phrase_counts])
# Display the results for all files
for filename, unique_words, phrase_counts in file_stats:
print '{:30} {}'.format(filename, unique_words)
for phrase, count in phrase_counts.items():
print ' {} : {}'.format(phrase, count)
Create a set of words for each document, and update a collections.Counter with the per file words. The set per file avoiding counting words more than once per file, the Counter seamlessly sums across files. For a super simple example counting individual words (without tracking which file they came from):
from collections import Counter
totals = Counter()
for file in allfiles:
with open(file) as f:
totals.update(set(f.read().split()))
Related
I would like to first
count if the word in a list of words appear in multiple textfiles in a directory
count how many times the word in a list of words appear in multiple textiles in a directory
I have already a "list_words" that is a string list of words that i am interested in.
then, I call each file using a loop,
for i in range(1,10):
filename=open(directory+'text'+str(i),'r')
textfile=filename.read()
words_in_textfile=re.split('\W{1,}',textfile)
and i am stuck how to go about this, i would really appreciate help!!
I would suggest using the re module, the Counter object from the collections module, and the Path object from the pathlib module.
import re
from collections import Counter
from pathlib import Path
counter = Counter() #Create a counter object for keeping track of wordcounts.
for word in your_list_of_words: #iterate through your list of words and for each word...
for file in Path("your_directory").glob("*.txt"): #Iterate through all .txt files in "your_directory"
with open(file, 'r') as stream: #open the file
counter.update(re.findall(word, stream.read())) #Update your counter object with the count of all the instances of 'word' found in 'file'.
This will give you a total count of each word across all files. If you want a count of each word for each file you may want to use a dictionary and update it each time. e.g.
counter = {}
for word in your_list_of_words:
counter[word] = {}
for file in Path("your_directory").glob("*.txt"):
with open(file, 'r') as stream:
counter[word][file] = len(re.findall(word, stream.read()))
Worth noting this will find all instances of that word, even if it's in the middle of another word e.g.
re.findall('cat', "catastrophic catcalling cattery cats")
returns
['cat', 'cat', 'cat', 'cat']
so you may want to play with the regex, e.g.
word = 'cat'
re.findall(fr"\b{word}\b", "catastrophic catcalling cattery cats"))
returns []
which may be more what you're looking for.
I am working a script to clean up a .txt file, create a list, count frequencies of unique words, and output a .csv file with the frequencies. I would like to open multiple files and combine them to still output a single .csv file.
Would it be more efficient to write code that would combine the text across the .txt files first or to read/clean all the unique files and combine the lists/dictionaries afterwards? What would the syntax look like for the optimal scenario?
I have been trying to research it on my own but have very limited coding skills and can't seem to find an answer that fits my specific question. I appreciate any and all input. Thanks!
import re
filename = 'testtext.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
import re
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
from collections import Counter
countlist = Counter(stripped)
import csv
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])
If you want to count the frequencies of words for multiple files and outputting it into one CSV file, you won't need to do much to your code, just add a loop in your code e.g.:
import re
import string
from collections import Counter
import csv
files = ['testtext.txt', 'testtext2.txt', 'testtext3']
stripped = []
for filename in files:
file = open(filename, 'rt')
text = file.read()
file.close()
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
table = str.maketrans('', '', string.punctuation)
stripped += [w.translate(table) for w in words] # concatenating parsed data
countlist = Counter(stripped)
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])
I don't know if this is the most optimal way of doing it.
It will depend on factors like: how big are the files? and how many files do you want to parse? and how frequently do you want to parse x files of y size? etc etc.
When you have figured that out, you can start thinking of ways to optimize the process.
if you need calculate the frequency, it's better to combine the strings from multiple .txt files at first, to know the performance, you can write the datetime function at the begining and end of the processing.
I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.
So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!
Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))
Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.
def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')
First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!
I am relatively new to python and am trying to write a basic filter for a text file, then count the frequency of words found in the filtered lines. I've attempted to apply a stopword list to it. So far I have this:
import sys, re
from collections import Counter
from nltk.corpus import stopwords
reload(sys)
sys.setdefaultencoding('utf8')
term = sys.argv[2].lower()
empty = []
count = 0
# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
for line in f:
for text in line.lower().split("\n"):
if term in text:
empty.append(text)
count += 1
print text
# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []
# apply stopword list to items in list containing lines matching term
for y in empty:
for t in stop:
if t not in y:
stoplist.append(y)
# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)
print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)
This works fine (but is probably incredibly messy/inefficient) but seemingly returns a count of the words filtered which is much too high, ten times as high really.
I was just wondering if anyone could spot something obvious I'm missing and point me in the right direction of how to fix it. Any help would really be appreciated, thanks!
I've written a short program will take an input file, remove punctuation, sort the contents by number of occurrences per word and then write the 100 most common results to an output file.
I had some trouble on the last part (writing the results to an output file), and though I've fixed it, I don't know what the problem was.
The full code looks like so:
from collections import Counter
from itertools import chain
import sys
import string
wordList = []
#this file contains text from a number of reviews
file1 = open('reviewfile', 'r+')
reviewWords = file1.read().lower()
#this file contains a list of the 1000 most common English words
file2 = open('commonwordsfile', 'r')
commonWords = file2.read().lower()
#remove punctuation
for char in string.punctuation:
reviewWords = reviewWords.replace(char, " ")
#create a list of individual words from file1
splitWords = reviewWords.split()
for w in splitWords:
if w not in commonWords and len(w)>2:
wordList.append(w)
#sort the resulting list by length
wordList = sorted(wordList, key=len)
#return a list containing the 100
#most common words and number of occurrences
words_to_count = (word for word in wordList)
c = Counter(words_to_count)
commonHundred = c.most_common(100)
#create new file for results and write
#the 100 most common words to it
fileHandle = open("outcome", 'w' )
for listItem in commonHundred:
fileHandle.write (str(listItem) + "\n")
fileHandle.close()
I previously had this following code snippet attempting to write the 100 most common terms to a .txt file, but it didn't work. Can anyone explain why not?
makeFile = open("outputfile", "w")
for item in CommonHundred:
makeFile.write("[0]\n".format(item))
makeFile.close()
Those should be curly braces, like:
makefile.write("{0}\n".format(item))
Run this and see what happens:
a = "[0]".format("test")
print(a)
b = "{0}".format("test")
print(b)
Then go search for "Format String Syntax" here if you'd like to know more: http://docs.python.org/3/library/string.html.