I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?
I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.
To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.
Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.
This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)
I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.
Related
I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).
I need a most efficient way to store long string (~2mb) during the execution of the program.
There is a loop and every iteration creates very long string. Lines of this string appears and I need to store them. There is also condition independent of the string. If condition is true I want to print this string. If condition is false, there is no operations. After iteration the string is no longer needed. Then next iteration creates another string.
I consider the variable. When line appears I want to add a line to the variable:
big_string = ''
while true:
big_string += line
[...]
big_string += line
if condition:
print big_string
big_string = ''
Is it a good way to handle that ?
The main options for data storage would be:
Variable,(as you have done)
array,
or a text file.
If you want to keep the strings separate and independent from one another I would recommend an array so you can independently call each index, however if that does not matter and you want to simply call the whole thing I see no issue with your current structure.
I have to print something by taking input from a file. The first few lines are empty. Therefore the output is turning out to be empty. It's like someone has pressed enter key 10 times before writing anything.
I want to ignore those inputs and consider only those which are not empty. What should I do?
By checking if there is anything apart from a newline character ("\n")is present in a line, your problem can be solved
fileObj=open(Filename)
for row in fileObj:
if len(row.replace("\n",""))>0:
print (row)
#Do your operations
If you can edit your question to add material, that would be helpful, but here’s a few pointers for now.
Assuming you’re taking the file in as a string (let’s call it "f"), you can loop over empty lines with a while loop:
charN = 0
while f[charN] == “\n”:
f = f[1:]
This allows you to chop off only the initial returns while keeping any line breaks later on in the file.
Note that, depending on the system this was written in, the enters may be stored as “\r\n”, in which case you could easily alter this for loop to remove those characters too. Good luck!
I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.
First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:
from urllib2 import urlopen
class Word(object):
def __init__(self):
self.word_count = {}
def alpha_only(self, word):
"""Converts word to lowercase and removes any non-alphabetic characters."""
x = ''
for letter in word:
s = letter.lower()
if s in 'abcdefghijklmnopqrstuvwxyz':
x += s
if len(x) > 0:
return x
def count(self, line):
"""Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
Adds each word to word_count if not already present, if present increases the count by 1."""
words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
for word in words:
if word in self.word_count:
self.word_count[word] += 1
elif word != None:
self.word_count[word] = 1
class File(object):
def __init__(self,book):
self.book = urlopen(book)
self.word = Word()
def strip_line(self,line):
"""Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
splits up the line and passes it along to the count method of the word object."""
s = line.strip('\n\r\t')
if s > 1:
self.word.count(s)
def process_book(self):
"""Main processing loop, will not begin processing until the first line after the line containing "START".
After processing it will close the file."""
begin = False
for line in self.book:
if begin == True:
self.strip_line(line)
elif 'START' in line:
begin = True
self.book.close()
book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')
book.process_book()
count = book.word.word_count
So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack, which rules out using list comprehension or mapping to a function to perform any kind of sorting.
I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:
s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)
# Delete the original dict, it is no longer needed
del book.word.word_count
Now I finally have a sorted list of words, s_count. PHEW! So, my questions are:
Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?
This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.
Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack? I expect it to give ValueError: need more than 1 value to unpack.
When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).
Of course, you can always do something like:
for k in count:
print k, count[k]
...
I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)
It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.
The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.
You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.
...
You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.
...
As for the duplicated call of alpha_only(), whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,
words = []
for word in line.split():
word = self.alpha_only(word)
if word is not None:
words.append(word)
I'm working on a project to parse out unique words from a large number of text files. I've got the file handling down, but I'm trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system.
The parser should walk through each line, and check each word against 3 criteria:
Longer than two characters
Not in a predefined dictionary set dict_file
Not already in the word list
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.
My working code's below, but it's slow and kludgy, what am I missing?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+.
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha's recommendations.
One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.
Example:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done:
report_list = list(report_set)
Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it's legal to do for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?
So, get rid of that whole while 1 loop, and change the next loop to for line in report:.
Also, you don't really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.
Also, with a set you don't actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it's already in the set it won't be added again!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case.
So, put all the above together and you get:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))
Try replacing report_list with a dictionary or set.
word_check not in report_list works slow if report_list is a list