Loading large file (25k entries) into dict is slow in Python?

Loading large file (25k entries) into dict is slow in Python? - python

I have a file which has about 25000 lines, and it's a s19 format file.
each line is like: S214 780010 00802000000010000000000A508CC78C 7A
There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:
def __init__(self,filename):
infile = file(filename,'r')
self.all_lines = infile.readlines()
self.dict_by_address = {}
for i in range(0, self.get_line_number()):
self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)
infile.close()
get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int
problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?
And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)

How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)
>>> with open('test.test') as f:
... dict_by_address = {line[4:10]:line[10:-3] for line in f}
...
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}

This code should be tremendously faster than what you have now. EDIT: As #sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
_, key, value, _ = line.split()
self.dict_by_address[key] = value
Some comments:
Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.
Best practice is to use open() rather than file(); I don't think Python 3.x even has file().
You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.
Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.
It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).
EDIT: corrected version:
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
key = extract_address(line)
value = extract_data(line)
self.dict_by_address[key] = value

Related

Extracting data from a text file as an array

I am trying to extract certain string of data from a text file.
The code I use is the following. I want to read the particular string(all actions) from that text file and then store it in an array or list if it is found. and then display in the same order.
import string
solution_path = "/homer/my_dir/solution_detail.txt"
solution = open(solution_path).read()
all_actions = ['company_name','email_address','full_name']
n = 0
sequence_array = []
for line in solution:
for action in all_actions:
if action in line:
sequence_array[n] = action
n = n+1
for x in range(len(sequence_array)):
print (sequence_array[x])
But this code does not do anything but runs without any error.

There are multiple problems with the code.
.read() on a file produces a single string. As a result, for line in solution: iterates over each character of the file's text, not over each line. (The name line is not special, in case you thought it was. The iteration depends only on what is being iterated over.) The natural way to get lines from the file is to loop over the file itself, while it is open. To keep the file open and make sure it closes properly, we use a with block.
You may not simply assign to sequence_array[n] unless the list is already at least n+1 elements long. (The reason you don't get an error from this is because if action in line: is never true, because of the first point.) Fortunately, we can simply .append to the end of the list instead.
If the line contains multiple of the all_actions, it would be stored multiple times. This is probably not what you want to happen. The built-in any function makes it easier to deal with this problem; we can supply it with a generator expression for an elegant solution. But if your exact needs are different then of course there are different approaches.
While the last loop is okay in theory, it is better to loop directly, the same way you attempt to loop over solution. But instead of building up a list, we could instead just print the results as they are found.
So, for example:
with open(solution_path) as solution:
for line in solution:
if any(action in line for action in all_actions):
print(line)

What is happening is that solution contains all the text inside the file. Therefore when you are iterating for line in solution you are actually iterating over each and every character separately, which is why you never get any hits.
try the following code (I can't test it since I don't have you're file)
solution_path = "/homer/my_dir/solution_detail.txt"
all_actions = ['company_name','email_address','full_name']
sequence_array = []
with open(solution_path, 'r') as f:
for line in f.readlines():
for action in all_actions:
if action in line:
sequence_array.append(action)
This will collect all the actions in the documents. if you want to print all of them
for action in sequence_array:
print(action)

Mapping User Input to Text File List

I'm trying to make a function that, given input from the User, can map input to a list of strings in a text file, and return some integer corresponding to the string in the file. Essentially, I check if what the user input is in the file and return the index of the matching string in the file. I have a working function, but it seems slow and error-prone.
def parseInput(input):
Gates = []
try:
textfile = open("words.txt")
while nextLine:
nextLine = textfile.readline()
Gates[n] = nextLine #increment n somewhere
finally:
textfile.close()
while n <= len(Gates):
nextString = Gates[n]
if input in nextString:
#Exit loop
with open("wordsToInts.txt") as textfile:
#Same procedure as the try loop(why isn't that one a with loop?)
if(correct):
return number
This seems rather... bad. I just can't seem to think of a better way to do this though. I have full control over words.txt and wordsToInts.txt(should I combine these?), so I can format them as I please. I'm looking for suggestions re: the function itself, but if a change to the text files would help, I would like to know. My goal is to reduce cause for error, but I will add error checking later. Please, suggest a better way to write this function. If writing in code, please use Python. Pseudocode is fine, however.

I would say to combine the files. You can have your words, and their corresponding values as follows:
words.txt
string1|something here
string2|something here
Then you can store each line as an entry to a dictionary and recall the value based on your input:
def parse_input(input):
word_dict = {}
with open('words.txt') as f:
for line in f.readlines():
line_key, line_value = line.split('|', 1)
word_dict[line_key] = line_value.rstrip('\n')
try:
return word_dict[input]
except KeyError:
return None

I'm trying to make a function that, given input from the User, can map input to a list of strings in a text file, and return some integer corresponding to the string in the file. Essentially, I check if what the user input is in the file and return the index of the matching string in the file
def get_line_number(input):
"""Finds the number of the first line in which `input` occurs.
If input isn't found, returns -1.
"""
with open('words.txt') as f:
for i, line in enumerate(f):
if input in line:
return i
return -1
This function will meet the specification from your description with the additional assumption that the string you care about are on separate lines. Notable things:
File objects in Python act as iterators over the lines of their contents. You don't have to read the lines into a list if all you need to do is check each individual line.
The enumerate function takes an iterator and returns a generator which yields a tuple like (index, element), where element is an element in your iterator and index is its position inside the iterator.
The term iterator means any object that's a sequence of things you can access in a for loop.
The term generator means an object which generates elements to iterate through "on-the-fly". What this means in this case is that you can access each line of a file one by one without having to load the entire file into your machine's memory.
This function is written in the standard Pythonic style, with a docstring, appropriate casing on variable names, and a descriptive name.

Set() on a very long list and creating an even longer matrix

I'm trying to get the set() of all words in a very long database of books (around 60,000 books) and to store in a matrix the 'vocabularies' of each book (the paths of books are in "files"):
for f in files:
book = open(f, 'r')
vocabulary = []
for lines in book.readlines():
words = string.split(lines)
vocabulary += set(words)
matrix.extend([vocabulary])
V += set(vocabulary)
OK, I solved the (memory) problem by creating a file to store everything, but now I get another memory error when trying to create a matrix with:
entries = numpy.zeros((len(V),a))
I tried to solve this by:
entries = numpy.memmap('matrice.mymemmap', shape=(len(V),a))
but the terminal says:
File "/usr/lib/python2.7/dist-packages/numpy/core/memmap.py", line 193, in new
fid = open(filename, (mode == 'c' and 'r' or mode)+'b')
IOError: [Errno 2] No such file or directory: 'matrice.mymemmap'
Can you help me solve this too?

V = set()
for f in files:
with open(f, 'r') as book:
for lines in book.readlines():
words = lines.split(" ")
V.update(words)
Here you first create an empty set. Then for each file you iterate through the lines in the file and split each line by the spaces. This gives you a list of words on the line. Then you update the set by the list of words, i.e. only unique words remain in the set.
So, you will end up with V which contains all the words in your library.
Of course, you might want to clean some upper/lower cases and punctuation in words before updating the set and remove empty words (""). That should happen before the V.update() statement. Otherwise you end up with both, e.g., It and it, fortunately, and fortunately, etc.
Please note the with statement with file operation. This ensures that whatever happens, the file will be closed before you leave the with-block.
If you want to do this book-by-book, then:
vocabularies = []
for f in files:
V = set()
with open(f, 'r') as book:
for lines in book.readlines():
words = lines.split(" ")
V.update(words)
vocabularies.append(V)
Also, instead of for lines in book.readlines(): you may use just for lines in book:.

I don't think your code does what you think it does:
for f in files:
book = open(f, 'r')
vocabulary = []
You've created an empty list called vocabulary
for lines in book.readlines():
words = string.split(lines)
vocabulary += set(words)
For each line in the file, you're creating a set of the words in that line. But then you add it to vocabulary, which is a list. This just puts the elements on the end of the list. If a word appears on multiple lines, it will appear in vocabulary once for every line. This could make vocabulary very large.
matrix.extend([vocabulary])
From this, I would assume that matrix is also a list. This will give you one entry in matrix for each book, and that entry will be a huge list as described above.
V += set(vocabulary)
Is V a list or a set? This copies vocabulary, which is already a set, into another set. Then it takes all the elements of that copied set and adds them to V.
First of all, I think you probably intend for vocabulary to be a set. To create an empty set, use vocabulary = set(). To add one item to a set, use vocabulary.add(word) and to add a collection use vocabulary.update(words). It looks like you mean to do the same with V. That should reduce your memory requirements a lot. That alone might be enough to fix your problem.
If that's not enough to make it work, consider whether you need all of matrix in memory at once. You could write it to a file instead of accumulating it in memory.
You'll probably accumulate lots of extra words due to punctuation and capitalization. Your sets would be smaller if you didn't count 'clearly', 'Clearly', 'Clearly.', 'clearly.', 'clearly,'... as being distinct.
As others have noted, you should use a with statement to make sure your file is closed. However, I doubt this is causing your problem. While it's not guaranteed by all Pythons, in this case the file is probably getting closed automatically quite promptly.

In python, you can't add values to a list using +=. instead, use
vocabulary.append(set(words))
EDIT: was wrong.

create a series of tuples using a for loop

I have searched and cannot find the answer to this even though I am sure it is already out there. I am very new to python but I have done this kind of stuff before in other languages, I am reading in line form a data file and I want to store each line of data in it's own tuple to be accessed outside the for loop.
tup(i) = inLine
where inLine is the line from the file and tup(i) is the tuple it's stored in. i increases as the loop goes round. I can then print any line using something similar to
print tup(100)

Creating a tuple as you describe in a loop isn't a great choice, as they are immutable.
This means that every time you added a new line to your tuple, you're really creating a new tuple with the same values as the old one, plus your new line. This is inefficient, and in general should be avoided.
If all you need is refer to the lines by index, you could use a list:
lines = []
for line in inFile:
lines.append(line)
print lines[3]
If you REALLY need a tuple, you can cast it after you're done:
lines = tuple(lines)

Python File Object supports a method called file.readlines([sizehint]), which reads the entire file content and stores it as a list.
Alternatively, you can pass the file iterator object through tuple to create a tuple of lines and index it in the manner you want
#This will create a tuple of file lines
with open("yourfile") as fin:
tup = tuple(fin)
#This is a straight forward way to create a list of file lines
with open("yourfile") as fin:
tup = fin.readlines()

Python - Best way to read a file and break out the lines by a delimeter

What is the best way to read a file and break out the lines by a delimeter.
Data returned should be a list of tuples.
Can this method be beaten? Can this be done faster/using less memory?
def readfile(filepath, delim):
with open(filepath, 'r') as f:
return [tuple(line.split(delim)) for line in f]

Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:
def readfile(filepath, delim):
with open(filepath, 'r') as f:
for line in f:
yield tuple(line.split(delim))
BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.
lines_as_tuples = readfile(mydata,','):
for linedata in lines_as_tuples:
# do something
This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:
for linedata in lines_as_tuples:
# do another thing
# BUT - this loop never does anything because lines_as_tuples has been consumed!
BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!
In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.
My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).

Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once:
def readfile(path, delim):
return (ln.split(delim) for ln in open(f, 'r'))
You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists.
Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python.
More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.