Can't get unique word/phrase counter to work - Python - python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.

Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)

Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Can you spot the problem with this REGEX statement?

Im running .txt files through a for loop which should slice out keywords and .append them into lists. For some reason my REGEX statements are returning really odd results.
My first statement which iterates through the full filenames and slices out the keyword works well.
# Creates a workflow list of file names within target directory for further iteration
stack = os.listdir(
"/Users/me/Documents/software_development/my_python_code/random/countries"
)
# declares list, to be filled, and their associated regular expression, to be used,
# in the primary loop
names = []
name_pattern = r"-\s(.*)\.txt"
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# extraction of country name from file name into `names` list
name_match = re.search(name_pattern, entry)
name = name_match.group(1)
names.append(name)
This works fine and creates the list that I expect
However, once I move on to a similar process with the actual contents of files, it no longer works.
religions = []
reli_pattern = r"religion\s=\s(.+)."
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# opens and reads file within `contents` variable
file_path = (
"/Users/me/Documents/software_development/my_python_code/random/countries" + "/" + entry
)
selection = open(file_path, "rb")
contents = str(selection.read())
# extraction of religion type and placement into `religions` list
reli_match = re.search(reli_pattern, contents)
religion = reli_match.group(1)
religions.append(religion)
The results should be something like: "therevada", "catholic", "sunni" etc.
Instead i'm getting seemingly random pieces of text from the document which have nothing to do with my REGEX like ruler names and stat values that do not contain the word "religion"
To try and figure this out I isolated some of the code in the following way:
contents = "religion = catholic"
reli_pattern = r"religion\s=\s(.*)\s"
reli_match = re.search(reli_pattern, contents)
print(reli_match)
And None is printed to the console so I am assuming the problem is with my REGEX. What silly mistake am I making which is causing this?
Your regular expression (religion\s=\s(.*)\s) requires that there be a trailing whitespace (the last \s there). Since your string doesn't have one, it doesn't find anything when searching thus re.search returns None.
You should either:
Change your regex to be r"religion\s=\s(.*)" or
Change the string you're searching to have a trailing whitespace (i.e 'religion = catholic' to 'religion = catholic ')

How to make searching a string in text files quicker

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?
Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
list comprehension to loop thought search_lists and though the files.
compound statements to loop only through the files in a directory (and not through sub directories).
method chaining to directly call a method of an object that is returned.
Tip for reading the list comprehension: try reading it form bottom to top, so:
I convert all items in search_list to lower using list comprehension.
Then I loop over that list (for s in...)
Then I filter out the directory entries that are not files using a compound statement (if isfile...)
Then I loop over all files (for filename...)
In the top line, I create the sublist containing three items:
filename
s, that is the lower case search string
a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.
This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

Python: Group a list of file names according to common name identifier

In a directory I have some files:
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu05_les_spec_r0.0350.0
temperature_Resu05_les_spec_r0.0400.0
temperature_Resu05_les_spec_r0.0450.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0350.0
temperature_Resu06_les_spec_r0.0400.0
temperature_Resu06_les_spec_r0.0450.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0350.0
temperature_Resu07_les_spec_r0.0400.0
temperature_Resu07_les_spec_r0.0450.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0350.0
temperature_Resu08_les_spec_r0.0400.0
temperature_Resu08_les_spec_r0.0450.0
temperature_Resu09_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0350.0
temperature_Resu09_les_spec_r0.0400.0
temperature_Resu09_les_spec_r0.0450.0
I need a list of all the files that have the same identifier XXXX as in _rXXXX. For example one such list would be composed of
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0300.0
I don't know a priori what the XXXX values are going to be so I can't iterate through them and match like that. Im thinking this might best be handles with a regular expression. Any ideas?
Yes, regular expressions are a fun way to do it! It could look something like this:
results = {}
for fname in fnames:
id = re.search('.*_r(.*)', fname).group(1) # grabs whatever is after the final "_r" as an identifier
if id in results:
results[id] += fname
else:
results[id] = [fname]
The results will be stored in a dictionary, results, indexed by the id.
I should add that this will work as long as all file names reliably have the _rXXXX structure. If there's any chance that a file name will not match that pattern, you will have to check for it and act accordingly.
No a regex is not the best way, you pattern is very straight forward, just str.rsplit on the _r and use the right element of the split as the key to group the data with. A defaultdict will do the grouping efficiently:
from collections import defaultdict
with open("yourfile") as f:
groups = defaultdict(list)
for line in f:
groups[line.rsplit("_r",1)[1]].append(line.rstrip())
from pprint import pprint as pp
pp(groups.values())
Which for your sample will give you:
[['temperature_Resu09_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0300.0',
'temperature_Resu06_les_spec_r0.0300.0',
'temperature_Resu07_les_spec_r0.0300.0',
'temperature_Resu08_les_spec_r0.0300.0',
'temperature_Resu09_les_spec_r0.0300.0'],
['temperature_Resu05_les_spec_r0.0400.0',
'temperature_Resu06_les_spec_r0.0400.0',
'temperature_Resu07_les_spec_r0.0400.0',
'temperature_Resu08_les_spec_r0.0400.0',
'temperature_Resu09_les_spec_r0.0400.0'],
['temperature_Resu05_les_spec_r0.0450.0',
'temperature_Resu06_les_spec_r0.0450.0',
'temperature_Resu07_les_spec_r0.0450.0',
'temperature_Resu08_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0350.0',
'temperature_Resu06_les_spec_r0.0350.0',
'temperature_Resu07_les_spec_r0.0350.0',
'temperature_Resu08_les_spec_r0.0350.0',
'temperature_Resu09_les_spec_r0.0350.0']]

Categories