Proper mmap use - Python

Proper mmap use - Python - python

I'm trying to use mmap to load a dictionary from file.
I'll explain my problem on simplified example. In real, I have 10 files, which have to be loaded in miliseconds (or act like been loaded).
So lets have a dictionary - 50 mb. My program should find a value by key under 1 sec. Searching in this dictionary is not a problem, it could be done far under 1 sec. The problem is that when sb puts an input into the text field and press enter, the program starts to load the dictionary into memory so program can find a key. This loading can takes several seconds but I have to get result under 1 sec (dictionary can't be loaded before pressing enter). So I was recommended to use mmap module which should be far faster.
I can't google a good example. I've tried this (I know that it is an incorrect use)
def loadDict():
with open('dict','r+b') as f: # used pickle to save
fmap = mmap.mmap(f.fileno(),0)
dictionary = cpickle.load(fmap)
return dictionary
def search(pattern):
dictionary = loadDict()
return dictionary['pattern']
search('apple') <- it still takes many seconds
Could you give me a good example of proper mmap use?

Using an example file of 2,400,000 keys/values (52.7 megabytes) pairs such as:
key1,value1
key2,value2
etc , etc
Creating example file:
with open("stacktest.txt", "a") as f:
contents = ["key" + str(i) + ",value" + str(i) for i in range(2400000)]
f.write("\n".join(contents) + "\n")
What is actually slow is having to construct the dictionary. Reading a file of 50mb is fast enough. Finding a value in a wall of text of this size is also fast enough. Using that, you will be able to find a single value in under 1 second.
Since I know the structure of my file I am able to use this shortcut. This should be tuned to your exact file structure though:
Reading in the file and manually searching for the known pattern (searching for the unique string in the whole file, then using the comma delimiter and newline delimiters).
with open("stacktest.txt") as f:
bigfile = f.read()
my_key = "key2399999"
start = bigfile.find(my_key)
comma = bigfile[start:start+1000].find(",") + 1
end = bigfile[start:start+1000].find("\n")
print bigfile[start+comma:start+end]
# value2399999
Timing for it all: 0.43s on average
Mission accomplished?

Related

Getting data from fastq by generator

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
#hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

Python Running Increasingly Slower, Garbage Collection Issue?

So I have code that grabs a list of files from a directory that initially had over 14 millions files. This is a hex-core machine with 20 GB RAM running Ubuntu 14.04 desktop and just grabbing a list of files takes hours - I haven't actually timed it.
Over the past week or so I've run code that doesn't nothing more than gather this list of files, open each file to determine when it was created, and move it to a directory based on the month and year it was created. (The files have been both scp'd and rsync'd so the timestamp the OS provides is meaningless at this point, hence opening the file.)
When I first started running this loop it was moving 1000 files in about 90 seconds. Then after several hours like this that 90 seconds became 2.5 min, then 4, then 5, then 9, and eventually 15 min. So I shut it down and started over.
I noticed that today once it was done gathering a list of over 9 millions files that moving 1000 files took 15 min right off the bat. I just shut the process down again and rebooted the machine because the time to move 1000 files had climbed to over 90 min.
I had hoped to find some means of doing a while + list.pop() style strategy to free memory as the loop progressed. Then found a couple of SO posts that said it could be done with for i in list: ... list.remove(...) but that this was a terrible idea.
Here's the code:
from basicconfig.startup_config import *
arc_dir = '/var/www/data/visits/'
def step1_move_files_to_archive_dirs(files):
"""
:return:
"""
cntr = 0
for f in files:
cntr += 1
if php_basic_files.file_exists(f) is False:
continue
try:
visit = json.loads(php_basic_files.file_get_contents(f))
except:
continue
fname = php_basic_files.basename(f)
try:
dt = datetime.fromtimestamp(visit['Entrance Time'])
except KeyError:
continue
mYr = dt.strftime("%B_%Y")
# Move the lead to Monthly archive
arc_path = arc_dir + mYr + '//'
if not os.path.exists(arc_path):
os.makedirs(arc_path, 0777)
if not os.path.exists(arc_path):
print "Directory: {} was not created".format(arc_path)
else:
# Move the file to the archive
newFile = arc_path + fname
#print "File moved to {}".format(newFile)
os.rename(f, newFile)
if cntr % 1000 is 0:
print "{} files moved ({})".format(cntr, datetime.fromtimestamp(time.time()).isoformat())
def step2_combine_visits_into_1_file():
"""
:return:
"""
file_dirs = php_basic_files.glob(arc_dir + '*')
for fd in file_dirs:
arc_files = php_basic_files.glob(fd + '*.raw')
arc_fname = arc_dir + php_basic_str.str_replace('/', '', php_basic_str.str_replace(arc_dir, '', fd)) + '.arc'
try:
arc_file_data = php_basic_files.file_get_contents(arc_fname)
except:
arc_file_data = {}
for f in arc_files:
uniqID = moduleName = php_adv_str.fetchBefore('.', php_basic_files.basename(f))
if uniqID not in arc_file_data:
visit = json.loads(php_basic_files.file_get_contents(f))
arc_file_data[uniqID] = visit
php_basic_files.file_put_contents(arc_fname, json.dumps(arc_file_data))
def main():
"""
:return:
"""
files = php_basic_files.glob('/var/www/html/ver1/php/VisitorTracking/data/raw/*')
print "Num of Files: {}".format(len(files))
step1_move_files_to_archive_dirs(files)
step2_combine_visits_into_1_file()
Notes:
basicconfig is essentially a bunch of constants I have for the environment and a few commonly used libraries like all the php_basic_* libraries. (I used PHP for years before picking up Python so I built a library to mimic the more common functions I used in order to be up and running with Python faster.)
The step1 def is as far as the program gets so far. The step2 def could, and likely should, be run in parallel. However, I figured I/O was the bottleneck and doing even more of it in parallel would likely slow all functions down a lot more. (I have been tempted to rsync the archive directories to another machine for aggregation thus getting parallel speed without the I/O bottleneck but figured the rsync would also be quite slow.)
The files themselves are all 3 Kb each so not very large.
----- Final Thoughts -------
Like I said, it doesn't appear, to me at least, that any data is being stored from each file opening. Therefore memory should not be an issue. However, I do notice that only 1.2 GB of RAM is being used right now and over 12 GB of was being used before. A big chunk of that 12 could be storing 14 million file names and paths. I've only just started the processing again so for next several hours python will be gathering a list of files and that list isn't in memory yet.
So I was wondering if there was a garbage collection issue or something else I was missing. Why is it slowing down as it progresses through the loop?

step1_move_files_to_archive_dirs:
Here's some reasons Step 1 might be taking longer than you expected...
The response to any exception during Step 1 is to continue to the next file. If you have any corrupted data files, they will stay in the filesystem forever, increasing the amount of work this function has to do next time (and the next, and the next...).
You are reading in every file and converting it from JSON to a dict, just to extract one date. So everything is read and converted at least once. If you control the creation of these files, it might be worth storing this value in the filename or in a separate index / log, so you don't have to go searching for that value again later.
If the input directories and output / archive directories are on separate filesystems, os.rename(f, newFile) can't just rename the file, but has to copy every byte from the source filesystem to the target filesystem. So either every file is near-instantaneously renamed, or every input file is slowly copied.
PS: It's weird that this function double-checks things like whether the input file still exists, or if os.makedirs worked, but then allows any exception from os.rename to crash you mid-loop.
step2_combine_visits_into_1_file:
All your file I/O is hidden inside that PHP library, but it looks to this PHP outsider like you're trying to store in RAM the contents of all the files in each subdirectory. Then, you accumulate all those contents inside some smaller number of archive files, while preserving (most of?) the data that was already there. Not only is that probably slow to begin with, it will get slower as time goes on.
Function code mostly replaced by comments:
file_dirs = # arch_dir/* --- Maybe lots, maybe only a few.
for fd in file_dirs:
arc_files = # arch_dir/subdir*.raw or maybe arch_dir/subdir/*.raw.
arc_fname = # subdir.arc
arc_file_data = # Contents of JSON file subdir.arc, as a dict.
for f in arc_files: # The *.raw files.
uniqID = # String based on f's filename.
if uniqID not in arc_file_data:
# Add to arc_file_data the uniqID key, and the
# _ entire contents_ of the .raw file as its value.
php_basic_files.file_put_contents # (...)
# Convert the arc_file_data dict into one _massive_ string,
# and replace the contents of the subdir.arc file.
Unless you have some maintenance job that periodically trims the *.arc files, you will eventually have the entire contents of all 14 million files (plus any older files) inside the *.arc files. Each of those .arc files gets read into a dict, converted to a mega-string, grown (probably), and then written back to the filesystem. That's a ton of I/O, even if the average .arc file isn't very big (which can only happen if there are lots of them).
Why do all this anyway? By the start of Step 2, you've already got a unique ID for each .raw input file, and it's already in the filename --- so why not use the filesystem itself to store /arch_dir/subdir/unique_id.json?
If you really do need all this data in a few huge archives, that shouldn't require so much work. The .arc files are little more than the unaltered contents of the .raw files, with bits of a JSON dictionary between them. A simple shell script could slap that together without ever interpreting the JSON itself.
(If the values are not just JSON but quoted JSON, you would have to change whatever reads the .arc files to not un-quote those values. But now I'm purely speculating, since I can only see some of what's happening.)
PS: Am I missing something, or is arc_files a list of *.raw filenames. Shouldn't it be raw_files?
Other Comments:
As others have noted, if your file-globbing function returns a mega-list of 14 million filenames, it would be vastly more memory-efficient as a generator that can yield one filename at a time.
Finally, you mentioned popping filenames off a list (although I don't see that in your code)... There is a huge time penalty for inserting or removing the first element of a large list --- del my_list[0] or my_list.pop(0) or my_list.insert(0, something) --- because items 1 though n-1 all have to be copied one index toward 0. That turns an O(n) operation into O(n**2)... again, if that's in your code anywhere.

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.

Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!

search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`

Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

Dictionaries and Big Inputs on Python

I have a big input 20Gb text file which I process. I create an index which I store in a dict. Problem is that I access this dict for every term inside the file plus for every term I may add it as an item to the dict, so I can not just write it to the disk. When I reach my maximum RAM capacity (8gb ram) the system (win8 64-bit) starts paging to virtual memory so I/O is extremely high and system is unstable (I got blue screen 1 time). Any idea how can I improve it ?
edit for example psedocode
input = open("C:\\input.txt",'r').read()
text = input.split()
temp_dict = {}
for i,word in text:
if word in temp_dict :
text[i] = something()
else:
temp_dict[word] = hash_function()
print(temp_dict , file=...)
print(text, file=...)

Don't read the entire file into memory, you should do something like this:
with open("/input.txt",'rU') as file:
index_dict = {}
for line in file:
for word in line.split()
index_dict.setdefault(word, []).append(file.tell() + line.find(word))
To break it down, open the file with a context manager so that if you get an error, it automatically closes the file for you. I also changed the path to work on Unix, and added the U flag for Universal readline mode.
with open("/input.txt",'rU') as file:
Since semantically, an index is a list of words keyed to their location, I'm changing the dict to index_dict:
index_dict = {}
Using the file object directly as an iterator prevents you from reading the entire file into memory:
for line in file:
Then we can split the line and iterate by word:
for word in line.split()
and using the dict.setdefault method, we'll put the location of the word in an empty list if the key isn't already there, but if it is there, we just append it to the list already there:
index_dict.setdefault(word, []).append(file.tell() + line.find(word))
Does that help?

I would recommend simply using a database instead of a dictionary. In its simplest form, a database is a disk-based datastructure which are meant to span several gigabytes.
You can have a look at sqlite3 or SQLAlchemy for instance.
Additionally, you probably don't want to load the whole input file in memory at once either.

How do i replace a specific value in a file in python

Im trying to replace the zero's with a value. So far this is my code, but what do i do next?
g = open("January.txt", "r+")
for i in range(3):
dat_month = g.readline()
Month: January
Item: Lawn
Total square metres purchased:
0
monthly value = 0

You could do that -
but that is not the usual approach, and certainly is not the correct approach for text files.
The correct way to do it is to write another file, with the information you want updated in place, and then rename the new file to the old one. That is the only sane way of doing this with text files, since the information size in bytes for the fields is variable.
As for the impresion that you are "writing 200 bytes to the disk" instead of a single byte, changing your value, don't let that fool you: at the Operating system level, all file access has to be done in blocks, which are usually a couple of kilobytes long (in special cases, and tunned filesystems it could be a couple hundred bytes). Anyway, you will never, in a user-space program, much less in a high level language like Python, trigger a diskwrite of less than a few hundred bytes.
Now, for the code:
import os
my_number = <number you want to place in the line you want to rewrite>
with open("January.txt", "r") as in_file, open("newfile.txt", "w") as out_file:
for line in in_file:
if line.strip() == "0":
out_file.write(str(my_number) + "\n")
else:
out_file.write(line)
os.unlink("January.txt")
os.rename("newfile.txt", "January.txt")
So - that is the general idea -
of course you should not write code with all values hardcoded in that way (i.e. the values to be checked and written fixed in the program code, as are the filenames).
As for the with statement - it is a special construct of the language wich is very appropriate to oppening files and manipulating then in a block, like in this case - but it is not needed.
Programing apart, the concept you have to keep in mind is this:
when you use an application that lets you edit a text file, a spreadsheet, an image, you, as user, may have the impression that after you are done and have saved your work, the updates are comitted to the same file. In the vast, vast majority of use cases, that is not what happens: the application uses internally a pattern like the one I presented above - a completly new file is written to disk and the old one is deleted, or renamed. The few exceptions could be simple database applications, which could replace fixed width fields inside the file itself on updates. Modern day databases certainly do not do that, resorting to appending the most recent, updated information, to the end of the file. PDF files are another kind that were not designed to be replaced entirely on each update, when being created: but also in that case, the updated information is written at the end of the file, even if the update is to take place in a page in the beginning of the rendered document.

dat_month = dat_month.replace("0", "45678")
To write to a file you do:
with open("Outfile.txt", "wt") as outfile:
And then
outfile.write(dat_month)

Try this:
import fileinput
import itertools
import sys
with fileinput.input('January.txt', inplace=True) as file:
beginning = tuple(itertools.islice(file, 3))
sys.stdout.writelines(beginning)
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.writelines(file)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.