Python character and word counts - python

I'm a beginner with python and would like to know how to use two txt files to count the characters as well as counter the 10 most common characters. also how to convert all characters in the file to lower case and eliminate all characters other than a-z
here's what i've tried and had no luck with:
from string import ascii_lowercase
from collections import Counter
with open ('document1.txt' , 'document2.txt') as f:
print Counter(letter for line in f
for letter in line.lower()
if letter in ascii_lowercase)

try like this :
>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', "{} {}".format(open('your_file1').read().lower(), open('your_file2').read().lower()))
>>> Counter(words).most_common(10)

Here is a simple example. You can adapt this code to fit your needs
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data: #opening an reading file one
file1 = file1data.read().lower() #convert the entire file contents to lower
with open('file2.txt', 'r') as file2data: #opening an reading file two
file2 = file2data.read().lower()
#The contents of both file 1 and 2 are stored in fil1 and file2 variables
#Examples of how to work with one file repeat for two files
file1_list = []
for ch in file1:
if ch in ascii_lowercase: #makes sure only lowercase alphabet is appended. All Non alphabet characters are removed
file1_list.append(ch)
elif ch in [" ", ".", ",", "'"]: #remove this elif block is you just want the letters
file1_list.append(ch) #make sure basic punctionation is kept
print "".join(file1_list) #this line is not needed. Just to show what the text looks like now
print Counter(file1_list).most_common(10) #prints the top ten
print Counter(file1_list) #prints the number of characters and how many times they repeat
Now that you've reviewed that mess above and have an idea of what each line is doing, here is a cleaner version, that does what you were looking for.
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data:
file1 = file1data.read().lower()
with open('file2.txt', 'r') as file2data:
file2 = file2data.read().lower()
file1_list = []
for ch in file1:
if ch in ascii_lowercase:
file1_list.append(ch)
file2_list = []
for ch in file2:
if ch in ascii_lowercase:
file2_list.append(ch)
all_counter = Counter(file1_list + file2_list)
top_ten_counter = Counter(file1_list + file2_list).most_common(10)
print sorted(all_counter.items())
print sorted(top_ten_counter)

Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.
To find most common words from multiple files,
from collections import Counter
import re
with open(''document1.txt'') as f1, open(''document1.txt'') as f2:
words = re.findall(r'\w+', f1.read().lower()) + re.findall(r'\w+', f2.read().lower())
>>>Counter(words).most_common(10)
"wil give you most 10 common words"
If you want most 10 common characters
>>>Counter(f1.read() + f2.read()).most_common(10)

Related

Counting number of words in Python file

I'm trying to count the number of instances several words appear in a file.
Here is my code:
#!/usr/bin/env python
file = open('my_output', 'r')
word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))
The issue in the code is that it only counts the number of instances of word1. How can this code be fixed to count word2 and word3?
Thanks!
i think instead of continuously reading and splitting the file , this code would work better if you did : [ this way you could find the term frequency of any number of words you find in the file ]
file=open('my_output' , 'r')
s=file.read()
s=s.split()
w=set(s)
tf={}
for i in s:
tf[i]=s.count(i)
print(tf)
The main problem is that file.read() consumes the file. Thus the second time you search you end up searching an empty file. The simplest solution is to read the file once (if it is not too large) and then just search the previously read text:
#!/usr/bin/env python
with open('my_output', 'r') as file:
text = file.read()
word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))
To improve performance it is also possible to split only once:
#!/usr/bin/env python
with open('my_output', 'r') as file:
split_text = file.read().split()
word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))
Using with will also ensure that the file closes correctly after being read.
Can you try this:
file = open('my_output', 'r')
splitFile = file.read().split()
lst = ['wordA','wordB','wordC']
for wrd in lst:
print(wrd, splitFile.count(wrd))
Short solution using collections.Counter object:
import collections
with open('my_output', 'r') as f:
wordnames = ('wordA', 'wordB', 'wordC')
counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
for c in counts:
print(c[0], c[1])
For the following sample text line:
'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'
we would obtain the output:
wordB 1
wordC 2
wordA 3
in your code the file is consumed (exhausted) in the first line so the next lines will not return anything to count: the first file.read() reads the whole contents of the file and returns it as a string. the second file.read() has nothing left to read and just returns an empty string '' - as does the third file.read() .
this is a version that should do what you want:
from collections import Counter
counter = Counter()
with open('my_output', 'r') as file:
for line in file:
counter.update(line.split())
print(counter)
you may have to do some preprocessing (in order to get rid of special characters and , and . and what not).
Counter is in the python standard library and is very useful for exactly that kind of thing.
note that this way you iterate once over the file only and you do not have to store the whole file in memory at any time.
if you only want to keep track of certain words you could select only them instead of passing the whole line to a counter:
from collections import Counter
import string
counter = Counter()
words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)
with open('my_output', 'r') as file:
for line in file:
line = line.translate(chars_to_remove)
w = (word for word in line.split() if word in words)
counter.update(w)
print(counter)
i also included an example of what i meant with preprocessing: punctuation will be removed before counting.
from collections import Counter
#Create a empty word_list which stores each of the words from a line.
word_list=[]
#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')
#read all the lines in a file
for line in file_handle.readlines():
#get each line,
#split each line into list of words
#extend those returned words into the word_list
word_list.extend(line.split())
# close the file object
file_handle.close()
#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)
print dictionary_of_words

Iterating through a file and replacing strings, leaving the number of characters intact

I'm attempting to anonymize a file so that all the content except certain keywords are replaced with gibberish, but the format is kept the same (including punctuation, length of string and capitalization). For example:
I am testing this, check it out! This is a keyword: long
Wow, another line.
should turn in to:
T ad ehistmg ptrs, erovj qo giw! Tgds ar o qpyeogf: long
Yeg, rmbjthe yadn.
I am attempting to do this in python, but i'm having no luck in finding a solution. I have tried replacing via tokenization and writing to another file, but without much success.
Initially let's disregard the fact that we have to preserve some keywords. We will fix that later.
The easiest way to perform this kind of 1-to-1 mapping is to use the method str.translate. The string module also contains constants that contain all ASCII lowercase and uppercase characters, and random.shuffle can be used to obtain a random permutation.
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
random.shuffle(random_caps)
random.shuffle(random_lows)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
with open('the-file-i-want.txt', 'r') as f:
contents = f.read()
translated_contents = contents.translate(translation_table)
with open('the-file-i-want.txt', 'w') as f:
f.write(translated_contents)
In python 2 the str.maketrans is a function in the string module instead of a static method of str.
The translation_table is a mapping from characters to characters, so it will map every single ASCII character to an other one. The translate method simply applies this table to each character in the string.
Important note: the above method is actually reversible, because each letter its mapped to a unique other letter. This means that using a simple analysis over the frequency of the symbols it's possible to reverse it.
If you want to make this harder or impossible, you could re-create the translation_table for every line:
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
with open('the-file-i-want.txt', 'r') as f:
translated_lines = []
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
translated_lines.append(line.translate(translation_table))
with open('the-file-i-want.txt', 'w') as f:
f.writelines(translated_lines)
Also note that you could translate and save the file line by line:
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
o.write(line.translate(translation_table))
Which means you can translate huge files with this code, as far as the lines themselves are not insanely long.
The code above messing all characters, without taking into account such keywords.
The simplest way to handle the requirement is to simply check for each line whether one of keywords occur and "reinsert" it there:
import re
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
keywords = ['long'] # add all the possible keywords in this list
keyword_regex = re.compile('|'.join(re.escape(word) for word in keywords))
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
matches = keyword_regex.finditer(line)
translated_line = list(line.translate(translation_table))
for match in matches:
translated_line[match.start():match.end()] = match.group()
o.write(''.join(translated_line))
Sample usage (using the version that prevserves keywords):
$ echo 'I am testing this, check it out! This is a keyword: long
Wow, another line.' > the-file-i-want.txt
$ python3 trans.py
$ cat output.txt
M vy hoahitc hfia, ufoum ih pzh! Hfia ia v modjpel: long
Ltj, fstkwzb hdsz.
Note how long is preserved.

How to check how many times a word appears in a text file

I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.
So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!
Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))
Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.
def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')
First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!

Most frequent word in one file which is not found in another file using Python

I am trying to write a program where I count the most frequently used words from one file but those words should not be available in another file. So basically I am reading data from test.txt file and counting the most frequently used word from that file, but that word should not be found in test2.txt file.
Below are sample data files, test.txt and test2.txt
test.txt:
The Project is for testing. doing some testing to find what's going on. the the the.
test2.txt:
a
about
above
across
after
afterwards
again
against
the
Below is my script, which parses files test.txt and test2.txt. It finds the most frequently used words from test.txt, excluding words found in test2.txt.
I thought I was doing everything right, but when I execute my script, it gives "the" as the most frequent word. But actually, the result should be "testing", as "the" is found in test2.txt but "testing" is not found in test2.txt.
from collections import Counter
import re
dgWords = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
sWords = [line.strip() for line in f]
print(len(dgWords));
for sWord in sWords:
print (sWord)
print (dgWords)
while sWord in dgWords: dgWords.remove(sWord)
print(len(dgWords));
mostFrequentWord = Counter(dgWords).most_common(1)
print (mostFrequentWord)
Here's how I'd go about it - using sets
all_words = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
stop_words = [line.strip() for line in f]
set_all = set(all_words)
set_stop = set(stop_words)
all_only = set_all - set_stop
print Counter(filter(lambda w:w in all_only, all_words)).most_common(1)
This should be slightly faster as well as you do a counter on only 'all_only' words
I simply changed the following line of your original code
f = open('test2.txt', 'rb')
to
f = open('test2.txt', 'r')
and it worked. Simply read your text as string instead of binaries. Otherwise they won't match in regex. Tested on python 3.4 eclipse PyDev Win7 x64.
OFFTOPIC:
It's more pythonic to open files using with statements. In this case, write
with open('test2.txt', 'r') as f:
and indent file processing statements accordingly. That should keep you away from forgetting to close the filestream.
import re
from collections import Counter
with open('test.txt') as testfile, open('test2.txt') as stopfile:
stopwords = set(line.strip() for line in stopfile)
words = Counter(re.findall(r'\w+', open('test.txt').read().lower()))
for word in stopwords:
if word in words:
words.pop(word)
print("the most frequent word is", words.most_common(1))

Why did my method of writing list items to a .txt file not work?

I've written a short program will take an input file, remove punctuation, sort the contents by number of occurrences per word and then write the 100 most common results to an output file.
I had some trouble on the last part (writing the results to an output file), and though I've fixed it, I don't know what the problem was.
The full code looks like so:
from collections import Counter
from itertools import chain
import sys
import string
wordList = []
#this file contains text from a number of reviews
file1 = open('reviewfile', 'r+')
reviewWords = file1.read().lower()
#this file contains a list of the 1000 most common English words
file2 = open('commonwordsfile', 'r')
commonWords = file2.read().lower()
#remove punctuation
for char in string.punctuation:
reviewWords = reviewWords.replace(char, " ")
#create a list of individual words from file1
splitWords = reviewWords.split()
for w in splitWords:
if w not in commonWords and len(w)>2:
wordList.append(w)
#sort the resulting list by length
wordList = sorted(wordList, key=len)
#return a list containing the 100
#most common words and number of occurrences
words_to_count = (word for word in wordList)
c = Counter(words_to_count)
commonHundred = c.most_common(100)
#create new file for results and write
#the 100 most common words to it
fileHandle = open("outcome", 'w' )
for listItem in commonHundred:
fileHandle.write (str(listItem) + "\n")
fileHandle.close()
I previously had this following code snippet attempting to write the 100 most common terms to a .txt file, but it didn't work. Can anyone explain why not?
makeFile = open("outputfile", "w")
for item in CommonHundred:
makeFile.write("[0]\n".format(item))
makeFile.close()
Those should be curly braces, like:
makefile.write("{0}\n".format(item))
Run this and see what happens:
a = "[0]".format("test")
print(a)
b = "{0}".format("test")
print(b)
Then go search for "Format String Syntax" here if you'd like to know more: http://docs.python.org/3/library/string.html.

Categories