I'm working on a project to parse out unique words from a large number of text files. I've got the file handling down, but I'm trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system.
The parser should walk through each line, and check each word against 3 criteria:
Longer than two characters
Not in a predefined dictionary set dict_file
Not already in the word list
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.
My working code's below, but it's slow and kludgy, what am I missing?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+.
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha's recommendations.
One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.
Example:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done:
report_list = list(report_set)
Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it's legal to do for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?
So, get rid of that whole while 1 loop, and change the next loop to for line in report:.
Also, you don't really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.
Also, with a set you don't actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it's already in the set it won't be added again!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case.
So, put all the above together and you get:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))
Try replacing report_list with a dictionary or set.
word_check not in report_list works slow if report_list is a list
Related
I am new to Python. My problem here is that:
I want to match a pattern against a large file and return matching lines(not just the matched string) from it. I DO NOT want a FOR loop for this as my file is huge. I am using mmap for reading the file.
in the above file, if I search for bhuvi, I should get 2 rows, bhuvi and bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.
Can someone please suggest what I can do here?
If your input file is huge, you cannot use readlines, but nothing
prevents you from reading one line in a loop.
As the file object is iterable, you can write the loop as:
for line in fh:
and process the content of the input line inside the loop.
The file size is not important, as you do not attempt to read all lines at once.
To check for presence of your string (bhuvi) in the line use
re.search, not re.findall.
Actually you don't need any list of matches, it is enough to find
a single match (it works quicker).
Below you have an example program (Python 3.7), writing the lines contaning your
string, along with the line number:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip() to remove the trailing newline, if any.
Edit after your comment:
You wrote that the file to check is huge. So there is a risk that
if you try to read it whole into the computer memory, the program
runs out of memory.
In such a case you would have to read the file chunk by chunk and
perform search in each chunk separately.
There is also a risk that a row with the text you are looking for will be
partially read in one chunk and the rest in the next,
so you have to take some measure to avoid this in your program.
On the other hand, if there is no other way but using mmap,
try something like re.finditer(r'[^\n]*bhuvi[^\n]*', map), i.e. create
an iterator looking for:
A sequence of chars other than \n.
Your string.
Another sequence of chars other than \n.
This way the match object returned by the iterator will match the
whole line, not your string alone.
I have 2 files, one is a text file containing some sentences. The other is a file which contains the words I want to delete them from the file. first I have to omit the special words, and then write the unique words into a new file, each word in a line. here is the code I wrote. but it doesn't work. in simple words I want to omit some words first then find unique words.
file1 = open('c:/python34/SimilarityCorpus.txt','r')
file2 = open('c:/python34/ListOfStopWords.txt','r')
file3 = open('c:/python34/Output1.txt','w')
first_words=[]
second_words=[]
z=[]
for line in file1: # to write unique words
for word in line.split():
if word not in z:
z.append(word)
for line in file1:
words = line.split()
for w in words:
first_words.append(w)
for line in file2:
w = line.split()
for i in w:
second_words.append(i)
for word1 in first_words :
for word2 in second_words:
if word1==word2:
first_words.remove(word2)
for word in first_words:
file3.write(word)
file3.write(' ')
file1.close()
file2.close()
file3.close()
I know that's basic, but I'm new in programming.
Welcome to programming! It's a fun world here :). I hope the answer below will help you.
Firstly, you are looking to get every unique word. Here, the set object may be useful for you. Using the set, you can iterate over every word and add it to the set, without worrying about duplicates.
z = set()
for line in file1: # to write unique words
for word in line.split():
z.add(word)
From my understanding of your code, you want to find the difference between the SimilarityCorpus and the ListOfStopWords, and then write that to disk. Since you are only interested in unique words, and not worried about the counts, then sets can come to your rescue again.
first_words = set()
for line in file1:
words = line.split()
first_words = first_words.union(words)
Here, the sets().union(other_iterable) operation simplifies the need to iterate over the new words. You can do likewise for second_words.
Finally, you want to take the difference between two sets, which is also available in Python. To do that, you either will be looking for:
words in first_words that are absent in second_words, or
words in second_words that are absent in first_words.
In the first case, you would do:
first_words.difference(second_words)
In the second case, you would do:
second_words.difference(first_words)
More documentation on sets can be found here on the Python docs. I would encourage you to use Python 3 rather than 2, which I see you are, so keep sticking with it!
To write to disk, with each word on a new line, you can do the following:
for word in first_words:
file3.write(word)
file3.write('\n') # this will write a new line.
Currently, you have the following code pattern:
file3 = open('/path/to/your/file.txt', 'w')
# do stuff with file3, e.g. write.
file3.close()
I might suggest that you do, instead:
with open('/path/to/file3.txt', 'w') as file3:
# do stuff with file3.
In this way, you don't need to explicitly open and close the file; the "with open" line can automatically take care of that for you.
I believe the rest of your code is correct, for reading and writing information from and to the disk.
If you could update your question to include more detail on errors that are cropping up, that would really help! Finally, whatever answer you find most useful here, don't forget to upvote/accept it (it doesn't have to be mine, I'm happy to simply add to the corpus of information and help around here).
I'm making a small dictionary application just to learn Python. I have the function for adding words done (just need to add a check to prevent duplicates) but I'm trying to create the function for looking words up.
This is what my text files looks like when I append the words to the text file.
{word|Definition}
And I can check if the word exists by doing this,
if word in open("words/text.txt").read():
But how do I get the definition? I assume I need to use regex (which is why I split it up and placed it inside curly braces), I just have no idea how.
read() would read the entire file contents. You could do this instead:
for line in open("words/text.txt", 'r').readlines():
split_lines = line.strip('{}').split('|')
if word == split_lines[0]: #Or word in line would look for word anywhere in the line
return split_lines[1]
You can use dictionary if you want effective search.
with open("words/text.txt") as fr:
dictionary = dict(line.strip()[1:-1].split('|') for line in fr)
print(dictionary.get(word))
Also try to avoid syntax like below:
if word in open("words/text.txt").read().
Use context manager (with syntax) to ensure that file will be closed.
To get all definitions
f = open("words/text.txt")
for line in f:
print f.split('|')[1]
I am new python. I have a list of words and a very large file. I would like to delete the lines in the file that contain a word from the list of words.
The list of words is given as sorted and can be fed during initialization time. I am trying to find the best approach to solve this problem. I'm doing a linear search right now and it is taking too much time.
Any suggestions?
you can use intersection from set theory to check whether the list of words and words from a line have anything in common.
list_of_words=[]
sett=set(list_of_words)
with open(inputfile) as f1,open(outputfile,'w') as f2:
for line in f1:
if len(set(line.split()).intersection(sett))>=1:
pass
else:
f2.write(line)
If the source file contains only words separated by whitespace, you can use sets:
words = set(your_words_list)
for line in infile:
if words.isdisjoint(line.split()):
outfile.write(line)
Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar'] a line like foo, bar,stuff won't be removed. To handle this, you need regular expressions:
rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
if not re.search(rr, line):
outfile.write(line)
The lines and words in the big file need to somehow be sorted, in which case you can implement binary search. It does not seem like they are so the best you can do is linear search by checking to see if each word in the list is in a given line.
contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)
something like that should work... not sure if it will be faster than going through line by line
[edit] this is untested code and may need some slight tweaks
You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copyfor this).
The rest reads like pseudo-code:
forbidden_words = set("these words shall not occur".split())
with open(inputfile) as infile, open(outputfile, 'w+') as outfile:
outfile.writelines(line for line in infile
if not any(word in forbidden_words for word in line.split()))
See this question for approaches how to get rid of punctuation-induced false-negatives.
I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?
I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.
To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.
Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.
This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)
I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.