Python find matches and count hits

Python find matches and count hits - python

I have a code to go through text files in a folder and look for specific word matches and count those. For example in file 1.txt I have word 'one' mentioned two times. So, my output should be:
1.txt | 2
print >> out, paper + "|" + str(hit_count)
Does not return me anything. Maybe str(hit_count) is not the right variable to print?
Any advise? Thanks.
for word in text:
if re.match("(.*)(one|two)(.*)", word)
hit_count = hit_count + 1
print >> out, paper + "|" + str(hit_count)

If I understand what you are trying to do, you don't really need a regex.
import glob
#glob.glob the directory to get a list of files - you didn't specify
for fname in file_list:
with open(fname,'r') as f:
# if files are very long consider line by line
# for line in f:
file_content = f.read()
count = file_content.count('one')
print '{0} | {1}'.format(fname, count)

Related

Retrieve next line in splitted text

for root, dirs, files in os.walk(path):
for file in files:
print(os.path.join(d, file))
for i in xrange(0, len(files)):
for files[i] in files:
corpus = open(os.path.join(d, files[i]), 'rb')
corpus = corpus.read()
# corpus = [line.lstrip() for line in corpus.split("\n")]
lne = []
# print(lne)
for line in corpus.split("\n"):
line = re.sub(' +', ' ', line)
line = line.upper()
lne.append(line.lstrip())
I tried line2 = next(iter(line))
But it does not produce the result I want. As I have split up the text corpus into newlines, I would expect something like next(iter(line)) to work. So what I want is to get the line of code that it loops, but also get one line after 'line'.

I start with just two files:
one.text
this + that
then now
and two.text
science poetry
pigs + cows
... in the folder "C:\scratch\sample\two.txt".
The main thing I'd like to mention is the availability of a relatively new way of processing the contents of files and folders in Python, the module pathlib, which is documented in Chapter 11. It usually makes life easier.
>>> from pathlib import Path
>>> for file_name in Path('c:/scratch/sample').glob('*'):
... with open(str(file_name)) as f:
... result_line = []
... for line in f.readlines():
... result_line.append(line.strip().upper().replace(' +', ' '))
... print (' '.join(result_line))
...
THIS THAT THEN NOW
SCIENCE POETRY PIGS COWS
I understood you to mean that you want to replace occurrences of ' +' with just one blank, and to turn entire lines into uppercase.
I want to mention also that: (a) it's best to avoid using names like file that might (or might not be) special words in the Python language because using them can make debugging difficult, (b) it's a good idea to use with when you open a file because then the system arranges to close the file when you leave the scope of the with, and (c) the one nuisance I find with using pathlib is that one must use something like str with a result (in this case file_name) to turn it into a file name that open can use.
I hope this is useful information.

Use an index to access the list.
for root, dirs, files in os.walk(path):
for file in files:
print(os.path.join(d, file))
for i in xrange(0, len(files)):
for files[i] in files:
corpus = open(os.path.join(d, files[i]), 'rb')
corpus = corpus.read()
lne = []
lines = corpus.split("\n")
for i in xrange(0, len(lines) - 1):
line = re.sub(' +', ' ', lines[i])
line = line.upper()
lne.append(line.lstrip())
line2 = lines[i+1]
Here i is a value between 0 and number of lines - 2. Thus in the loop you can access:
line = lines[i]
line2 = lines[i + 1]

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?

I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()

What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.

Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")

Merge 3 Textfiles with python

Im really new to programming and couldn´t find a satisfying answer so far. Im using python and I want to merge three textfiles receive all possible word combinations. I have 3 files:
First file:
line1
line2
line3
Second file(prefix):
pretext1
pretext2
pretext3
Third file(suffix):
suftext1
suftext2
suftext3
I already used .read() and have my variables containing the list for each textfile. Now I want to write a function to merge this 3 files to 1 and it should look like this:
outputfile:
pretext1 line1 suftext1 #this is ONE line(str)
pretext2 line1 suftext1
pretext3 line1 suftext1
pretext1 line1 suftext2
pretext1 line1 suftext3
and so on, you get the idea
I want all possible combinations in 1 textfile as output. I guess I have to use a loop within a loop?!

Here it is, if I got your question right.
First you have to focus into the correct folder with the os package.
import os
os.chdir("The_path_of_the_folder_containing_the_files")
Then you open you three files, and put the words into lists:
file_1 = open("file_1.txt")
file_1 = file_1.read()
file_1 = file_1.split("\n")
file_2 = open("file_2.txt")
file_2 = file_2.read()
file_2 = file_2.split("\n")
file_3 = open("file_3.txt")
file_3 = file_3.read()
file_3 = file_3.split("\n")
You create the text you want in your output file with loops:
text_output = ""
for i in range(len(file_2)):
for j in range(len(file_1)):
for k in range(len(file_3)):
text_output += file_2[i] + " " + file_1[j] + " " + file_3 [k] + "\n"
And you enter that text into your output file (if that file does not exist, it will be created).
file_output = open("file_output.txt","w")
file_output.write(text_output)
file_output.close()

While the existing answer may be correct, I think this is a case where bringing in a library function is definitely the way to go.
import itertools
with open('lines.txt') as line_file, open('pretext.txt') as prefix_file, open('suftext.txt') as suffix_file:
lines = [l.strip() for l in line_file.readlines()]
prefixes = [p.strip() for p in prefix_file.readlines()]
suffixes = [s.strip() for s in suffix_file.readlines()]
combos = [('%s %s %s' % (x[1], x[0], x[2]))
for x in itertools.product(lines, prefixes, suffixes)]
for c in combos:
print c

String.count() returning 0 even though word exists in file

So this is a simple code trying to found the frequency of occurrences of a phrase ("every kind of asset") in a number of files.
import codecs
import glob
import os.path
filelocation = "C:\\Users\\Shoi\\Desktop\\mark project\\BITs\\*.txt"
for filepath in glob.glob(filelocation): # for each file
FILE = codecs.open(filepath, 'r', encoding="utf-8")
if ("every kind of asset" in FILE.read().lower()):
print ("Found in " + os.path.basename(filepath))
freq = FILE.read().lower().count("every kind of asset")
print(freq)
else:
print ("not found in " + os.path.basename(filepath))
However, even though the phrase is being found in some files ("Found in " file is printed) - the count function is returning and printing 0 always.
This code is searching for only a single phrase. When I iterate over a list of phrases, searching for each phrase in all the files - the count function returns perfectly correct frequency results for some phrases but returns 0 for others, even though the phrase exists in the file and it prints "found"
Please help.

You've got two calls to FILE.read(). After the first one, the cursor will be at the end of the file, so the second call will return an empty string, which does not contain the string you're looking for at all.
Read the contents once and store them in a variable instead:
for filepath in glob.glob(filelocation): # for each file
FILE = codecs.open(filepath, 'r', encoding="utf-8")
contents = FILE.read().lower()
if "every kind of asset" in contents:
print("Found in " + os.path.basename(filepath))
freq = contents.count("every kind of asset")
print(freq)
else:
print("not found in " + os.path.basename(filepath))

Wordcounts line breaks in python

I'm trying to write a script to pull the word count of many files within a directory. I have it working fairly close to what I want, but there is one part that is throwing me off. The code so far is:
import glob
directory = "/Users/.../.../files/*"
output = "/Users/.../.../output.txt"
filepath = glob.glob(directory)
def wordCount(filepath):
for file in filepath:
name = file
fileO = open(file, 'r')
for line in fileO:
sentences = 0
sentences += line.count('.') + line.count('!') + line.count('?')
tempwords = line.split()
words = 0
words += len(tempwords)
outputO = open(output, "a")
outputO.write("Name: " + name + "\n" + "Words: " + str(words) + "\n")
wordCount(filepath)
This writes the word counts to a file named "output.txt" and gives me output that looks like this:
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 10
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 0
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 3
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 0
Name: /Users/..../..../files/Bush1989.02.9.txt
Words: 4821
And this repeats for each file in the directory. As you can see, it gives me multiple counts for each file. The files are formatted such as:
Address on Administration Goals Before a Joint Session of Congress
February 9, 1989
Mr. Speaker, Mr. President, and distinguished Members of the House and
Senate...
So, it seems that the script is giving me a count of each "part" of the file, such as the 10 words on the first line, 0 on the line break, 3 on the next, 0 on the next, and then the count for the body of the text.
What I'm looking for is a single count for each file. Any help/direction is appreciated.

The last two lines of your inner loop, which print out the filename and word count, should be part of the outer loop, not the inner loop - as it is, they're being run once per line.
You're also resetting the sentence and word counts for each line - these should be in the outer loop, before the start of the inner loop.
Here's what your code should look like after the changes:
import glob
directory = "/Users/.../.../files/*"
output = "/Users/.../.../output.txt"
filepath = glob.glob(directory)
def wordCount(filepath):
for file in filepath:
name = file
fileO = open(file, 'r')
sentences = 0
words = 0
for line in fileO:
sentences += line.count('.') + line.count('!') + line.count('?')
tempwords = line.split()
words += len(tempwords)
outputO = open(output, "a")
outputO.write("Name: " + name + "\n" + "Words: " + str(words) + "\n")
wordCount(filepath)

Isn't your identation wrong? I mean, the last lines are called once per line, but you really mean once per file, don't you?
(besides, try to avoid "file" as an identifier - it is a Python type)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python find matches and count hits - python

Related

Retrieve next line in splitted text

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

Merge 3 Textfiles with python

String.count() returning 0 even though word exists in file

Wordcounts line breaks in python

Categories

Resources