Counting different strings in multiple files

Counting different strings in multiple files - python

I want to count a list of smileys in a list of files (.txt) in my path /test/.
Here is my approach to count a smiley in all files.
def count_string_occurrence():
import os
total = 0
x = 0
for file in os.listdir("C:/users/M/Desktop/test"):
if file.endswith(".txt"):
string = ":)" #define search term
f=open(file,encoding="utf8")
contents = f.read()
f.close()
x=contents.count(string)
total +=int(x) #calculate occurance of smiley in all files
print("Number of " + string + " in all files equals " + str(total))
count_string_occurrence()
How can I now loop different smileys and print the result for each smiley seperately? Since I already loop through different files it gets complicated.

About your question: you can keep a dictionary with the count of each string and return that. But if you keep your current structure, it's not going to be nice to keep track of it.
Which leads to my suggestions:
You're keeping the whole file in memory for no apparent reason, you can go through it line by line and check the strings in the current line.
You're also reading the same files multiple times, while you could read them only once and check if the strings are there.
You're checking the extension of the file, which sounds like a job for glob.
You can use a defaultdict so you don't need to care if the count was initially 0 or not.
Modified code:
from collections import defaultdict
import glob
SMILIES = [':)', ':P', '=]']
def count_in_files(string_list):
results = defaultdict(int)
for file_name in glob.iglob('*.txt'):
print(file_name)
with open(file_name) as input_file:
for line in input_file:
for s in string_list:
if s in line:
results[s] += 1
return results
print(count_in_files(SMILIES))
Lastly, with this approach, if you're using Python >= 3.5, you can change the glob call to for file_name in glob.iglob('**/*.txt', recursive=True) so it will search recursively, in case you need it.
This will print something like:
defaultdict(<class 'int'>, {':P': 2, ':)': 1, '=]': 1})

You could make your search string a function parameter and then call your function multiple times with different search terms.
def count_string_occurrence(string):
import os
total = 0
x = 0
for file in os.listdir("C:/users/M/Desktop/test"):
if file.endswith(".txt"):
f=open(file,encoding="utf8")
contents = f.read()
f.close()
x=contents.count(string)
total +=int(x) #calculate occurance of smiley in all files
return total
smilies = [':)', ':P', '=]']
for s in smilies =
total = count_string_occurrence(s)
print("Number of {} in all files equals {}".format( s, total ))
A different approach would be to pass a list of smilies to your function and then do the iteration inside the if block. Maybe store the result in a dict in the form { ':)': 5, ':P': 4, ... }

Related

Build a dictionary from .txt files analysis

I have a basic program that can count the number of words in a given text file. I am trying to turn this into a program that can take in several different .txt files, with an arbitrary amount of keywords within those file analyzed, and output a dictionary within a list of the results (or similar object).
The output I am looking for is a list of dictionaries wherein the list keys are the names of the .txt files in the filenames list, and the dictionary keys-values are the arbitrary words within the first function and their words counts, respectively.
I have two function that I have created and cannot seem to get any out whatsoever - which means that somethin n.
Code:
def word_count(filename, *selected_words):
"""Count the approximate number of words in a file."""
with open(filename,"r",encoding='utf-8') as f_obj:
contents = f_obj.read()
filename = {}
filename['document'] = filename
filename['total_words'] = len(contents.split())
for word in selected_words:
count = contents.lower().count(word)
filename[word] = count
return filename
def analysis_output():
for file in files:
word_count(file, 'the', 'yes') #WORD_COUNT FUNCTION
files = ['alice.txt', 'siddhartha.txt',
'moby_dick.txt', 'little_women.txt']
analysis_output()
When I run this, I am not getting any output - no errors telling me the code has run (likely improperly). Any advice on how to turn this into a a list of dictionaries is helpful!

You simply forgot to define a variable to receive the output from word_count. In fact, you can do it this way:
def word_count(filename, *selected_words):
"""Count the approximate number of words in a file."""
with open(filename,"r",encoding='utf-8') as f_obj:
contents = f_obj.read()
results_dict = {}
results_dict['document'] = filename
results_dict['total_words'] = len(contents.split())
for word in selected_words:
count = contents.lower().count(word)
results_dict[word] = count
return results_dict
def analysis_output():
output = []
for file in files:
output.append(word_count(file, 'the', 'yes')) #WORD_COUNT FUNCTION
return output
files = ['alice.txt', 'siddhartha.txt',
'moby_dick.txt', 'little_women.txt']
final_result = analysis_output()

My solution below solves your problem in a slightly different way. I am using lists and strings only, no dictionaries. I've entered extra comments, if needed - I hope you will find it useful.
def get_words_string(file_name):
"""get a lower-case string of all words from a file"""
try:
with open(file_name,"r",encoding='utf-8') as file_object:
contents = file_object.read().lower()
return contents
except FileNotFoundError:
print(f'File not found')
def count_words(words_string, *words_to_count):
'''counts a number of each *words_to_count in a words_string'''
for word in words_to_count:
print(f'{word} occurs {words_string.count(word)} times')
files = [
'text files/alice.txt',
'text files/moby_dick.txt',
'text files/pride_and_pre.txt',
]
for file in files:
print(file)
#try block just in case if file is missing
#so the program can continue
try:
count_words(get_words_string(file), 'yes', 'the', 'cat', 'honour')
except:
pass

adding multiple values to a dictionary key

I have a list of files where each file has two columns.
The 1st column contains words, and the 2nd column contains numbers.
I want to extract all the unique words from the files, and sum the numbers in them. This I was able to do...
The second task is to count the number of files in which the words were found. I am having trouble in this part... I am using a dictionary for this.
Here is my code:
import os
from typing import TextIO
currentdir = " " #CHANGE INPUT PATH
resultdir = " " #CHANGE OUTPUT ACCORDINGLY
if not os.path.exists(resultdir):
os.makedirs(resultdir)
systemcallcount ={}
for root, dirs, files in os.walk(currentdir):
for name in files:
outfile2 = open(root+"/"+name,'r')
for line in outfile2:
words=line.split(" ")
if words[0] not in systemcallcount:
systemcallcount[words[0]]=int(words[1])
else:
systemcallcount[words[0]]+=int(words[1])
outfile2.close()
for keys,values in systemcallcount.items():
print(keys)
print(values)
for example I have two files -
file1 file2
a 2 a 3
b 3 b 1
c 1
so the output would be -
a 5 2
b 4 2
c 1 1
To explain second column of output a is 2 because it is occuring in both the files whereas c is 1 as it is appearing in only file1.

I hope this helps
This code takes a string and checks in a folder for files that contain it
# https://www.opentechguides.com/how-to/article/python/59/files-containing-text.html
search_string="python"
search_path="C:\Users\You\Desktop\Project\Files"
extension="txt" # files extension
# loop through files in the path specified
for fname in os.listdir(search_path):
if fname.endswith(file_type):
# Open file for reading
fo = open(search_path + fname)
# Read the first line from the file
line = fo.readline()
# Initialize counter for line number
line_no = 1
# Number of files found is 0
files_no=0;
# Loop until EOF
while line != '' :
# Search for string in line
index = line.find(search_str)
if ( index != -1) :
# print the occurence
print(fname, "[", line_no, ",", index, "] ", line, sep="")
# Read next line
line = fo.readline()
# Increment line counter
line_no += 1
# Increment files counter
files_no += 1
# Close the files
fo.close()

One way is to use collections.defaultdict. You can create a set of words and then increment your dictionary counter for each file, for each word.
from collections import defaultdict
d = defaultdict(int)
for root, dirs, files in os.walk(currentdir):
for name in files:
with open(root+'/'+name,'r') as outfile2:
words = {line.split()[0] for line in outfile2}
for word in words:
d[words[0]] += 1

Another way is to use Pandas to work on both of your tasks.
Read the files into a table
Note the source file in a separate column.
Apply functions to get unique words, sum the numbers, and count the source files for each word.
Here is the code:
import pandas as pd
import sys,os
files = os.listdir(currentdir)
dfs = []
for f in files:
df = pd.read_csv(currentdir+"/"+f,sep='\t',header=None)
df['source_file'] = f
dfs.append(df)
def concat(x):
return pd.Series(dict(A = x[0].unique()[0],
B = x[1].sum(),
C = len(x['source_file'])))
df = pd.concat(dfs,ignore_index=True).groupby(0).apply(concat)
# Print result to standard output
df.to_csv(sys.stdout,sep='\t',header=None,index=None)
You may refer here: Pandas groupby: How to get a union of strings

It appears that you want to parse the file into a dictionary of lists, so that for the input you provided:
file1 file2
a 2 a 3
b 3 b 1
c 1
... you get the following data structure after parsing:
{'a': [2, 3], 'b': [3, 1], 'c': [1]}
From that, you can easily get everything you need.
Parsing this way should be rather simple using a defaultdict:
parsed_data = defaultdict(list)
for filename in list_of_filenames:
with open(filename) as f:
for line in f:
name, number = line.split()
parsed_data[name].append(int(number))
After that, printing the data you are interested in should be trivial:
for name, values in parsed_data.items():
print('{} {} {}'.format(name, sum(values), len(values)))
The solution assumes that the same name will not appear twice in the same file. It is not specified what should happen in that case.
TL;DR: The solution for your problems is defaultdict.

Python multiple file input

I'm working on a python program that prints the words that are in the last file entered from the command line. The words can't be in any of the preceding files. So for example if I input 2 files from the command line and
File 1 contains: "We are awesome" and File 2(the last file entered) contains: "We are really awesome"
My final list should only contain: "really"
Right now my code is set up to only look at the last file entered, how can I look at all of the preceding files and compare them in the context of what I am trying to do? Here is my code:
UPDATE
import re
import sys
def get_words(filename):
test_file = open(filename).read()
lower_split = test_file.lower()
new_split = re.split("[^a-z']+", lower_split)
really_new_split = sorted(set(new_split))
return really_new_split
if __name__ == '__main__':
bag = []
for filename in sys.argv[1:]:
bag.append(get_words(filename))
unique_words = bag[-1].copy()
for other in bag[:-1]:
unique_words -= other
for word in unique_words:
print(word)
Also:
>>> set([1,2,3])
{1, 2, 3}

There is really not a lot missing: Step 1 put your code in a function so you can reuse it. You are doing the same thing (parsing a text file) several times so why not put the corresponding code in a reusable unit.
def get_words(filename):
test_file = open(filename).read()
lower_split = test_file.lower()
new_split = re.split("[^a-z']+", lower_split)
return set(new_split)
Step 2: Set up a loop to call your function. In this particular case we could use a list comprehension but maybe that's too much for a rookie. You'll come to that in good time:
bag = []
for filename in sys.argv[x:] # you'll have to experiment what to put
# for x it will be at least one because
# the first argument is the name of your
# program
bag.append(get_words(filename))
Now you have all the words conveniently grouped by file. As I said, you can simply take the set difference. So if you want all the words that are only in the very last file:
unique_words = bag[-1].copy()
for other in bag[:-1]: loop over all the other files
unique_words -= other
for word in unique_words:
print(word)
I didn't test it, so let me know whether it runs.

Consider simplifying by using Set's difference operation, to 'subtract' the sets of words in your files.
import re
s1 = open('file1.txt', 'r').read()
s2 = open('file2.txt', 'r').read()
set(re.findall(r'\w+',s2.lower())) - set(re.findall(r'\w+',s1.lower()))
result:
{'really'}

How to write multiple files from a dictionary in python

I have a dictionary with 400,000 items in it, whose keys are DNA names and values are DNA sequences.
I want to divide the dictionary into 40 text files with 10,000 items in each of the files.
Here are my codes:
record_dict # my DNA dictionary
keys_in_dict #the list of the keys
for keys in keys_in_dict:
outhandle = open("D:\\Research\\Transcriptome_sequences\\input{0}.fasta".format (?????), "w")
What should I put in place of (?????)? How do I finish this loop?
UPDATE:
Hey fellows,
Thank you for your help. Now I can make multiple files from a dictionary. However, when I tried to make multiple files directly from the original file instead of making a dictionary first, I had problems. The codes only generate one file with the first item in it. What did I do wrong? Here are my codes:
from Bio import SeqIO
handle = open("D:/Research/Transcriptome_sequences/differentially_expressed_genes.fasta","rU")
filesize = 100 # number of entries per file
filenum = 0
itemcount = 0
for record in SeqIO.parse(handle, "fasta") :
if not itemcount % filesize:
outhandle = open("D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum), "w")
SeqIO.write(record, outhandle, "fasta")
filenum += 1
itemcount += 1
outhandle.close()

n = 10000
sections = (record_dict.items()[i:i+n] for i in xrange(0,len(record_dict),n))
for ind, sec in enumerate(sections):
with open("D:/Research/Transcriptome_sequences/input{0}.fasta".format(ind), "w") as f1:
for k,v in sec:
f1.write("{} {}\n".format(k,v))

It will not be the fastest solution, but I think the most straightforwared way is to keep track of lines and open a file every 10,000 iterations through loop.
I assume you are writing out fasta or something.
Otherwise, you could slice the list [:10000] beforehand and generate a chunk of output to write all at once with one command (which would be much faster). Even as it is, you might want to build up the string by concatenating through the loop and then writing that one monstrous string out with a single .write command for each file.
itemcount=0
filesize = 10000
filenum = 0
filehandle = ""
for keys in keys_in_dict:
# check if it is time to open a new file,
# whenever itemcount/filesize has no remainder
if not itemcount % filesize:
if filehandle:
filehandle.close()
filenum+=1
PathToFile = "D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum)
filehandle = open(PathToFile,'w')
filehandle.write(">{0}\n{1}\n".format(keys,record_dict[keys])
itemcount += 1
filehandle.close()
EDIT: Here is a more efficient way to do it (time-wise, not memory-wise), only writing once per file (40x total) instead of with each line (400,000 times). As always, check your output, especially making sure that the first and last sequences are included in the output and the last file is written properly.
filesize = 10 # number of entries per file
filenum = 0
filehandle = ""
OutString = ""
print record_dict
for itemcount,keys in enumerate(keys_in_dict):
# check if it is time to open a new file,
# whenever itemcount/filesize has no remainder
OutString += ">{0}\n{1}\n".format(keys,record_dict[keys])
if not itemcount % filesize:
if filehandle:
filehandle.write(OutString)
filehandle.close()
OutString =""
filenum+=1
PathToFile = "D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum)
filehandle = open(PathToFile,'w')
filehandle.write(OutString)
filehandle.close()

Making use of the built-in module/function, itertools.tee, could solve this elegantly.
import itertools
for (idx, keys2) in enumerate(itertools.tee(keys_in_dict, 40)):
with open('filename_prefix_%02d.fasta' % idx, 'w') as fout:
for key in keys2:
fout.write(...)
Quoted from the doc for your reference:
itertools.tee(iterable[, n=2]) Return n independent iterators from a
single iterable.
Once tee() has made a split, the original iterable should not be used
anywhere else; otherwise, the iterable could get advanced without the
tee objects being informed.
This itertool may require significant auxiliary storage (depending on
how much temporary data needs to be stored). In general, if one
iterator uses most or all of the data before another iterator starts,
it is faster to use list() instead of tee().

how to find frequency of the keys in a dictionary across multiple text files?

I am supposed to count the frequency of all the key values of dictionary "d" across all the files in the document "individual-articles'
Here,the document "individual-articles' has around 20000 txt files,with filenames 1,2,3,4...
for ex: suppose d[Britain]=[5,76,289] must return the number of times Britain appears in the files 5.txt,76.txt,289.txt belonging to the document "induvidual articles", and also i need to find its frequency across all the files in the same document. i need to store these values in another d2
for the same example,
d2 must contain (Britain,26,1200) where 26 is the frequency of the word Britain in the files 5.txt,76.txt and 289.txt and 1200 is the frequency of the word Britain in all the files.
I am a python newbie, and i have tried little! please help!!
import collections
import sys
import os
import re
sys.stdout=open('dictionary.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
folderpath='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
d2={}
with open('topics.txt') as f:
d = collections.defaultdict(list)
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
d[key].append(value)
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
for k in d.keys():
d2[k] = words.count(k)
for i in d2.items():
print(i)

Well, I'm not exactly sure what you mean by all the files in the document "X" but I assume it's analogous to pages in a book. With this interpretation, I would do my best to store the the data in the easiest way. Putting data in easily manipulable adds efficiency later, because you can always just add a method for accomplishing and type of output you want.
Since it seems the main key you're looking at is keyword, I would create a nested python dictionary with this structure
dict = (keyword:{file:count})
Once it's in this form, you can do any type of manipulation on the data really easily.
To create this dict,
import os
# returns the next word in the file
def words_generator(fileobj):
for line in fileobj:
for word in line.split():
yield word
word_count_dict = {}
for dirpath, dnames, fnames in os.walk("./"):
for file in fnames:
f = open(file,"r")
words = words_generator(f)
for word in words:
if word not in word_count_dict:
word_count_dict[word] = {"total":0}
if file not in word_count_dict[word]:
word_count_dict[word][file] = 0
word_count_dict[word][file] += 1
word_count_dict[word]["total"] += 1
This will create an easily parsable dictionary.
Want the number of total words Britain?
word_count_dict["Britain"]["total"]
Want the number of times Britain is in files 74.txt and 75.txt?
sum([word_count_dict["Britain"][file] if file in word_count_dict else 0 for file in ["74.txt", "75.txt"]])
Want to see all files that the word Britain shows up in?
[file for key in word_count_dict["Britain"]]
You can of course write functions that perform these operations with a simple call.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting different strings in multiple files - python

Related

Build a dictionary from .txt files analysis

adding multiple values to a dictionary key

Python multiple file input

How to write multiple files from a dictionary in python

how to find frequency of the keys in a dictionary across multiple text files?

Categories

Resources