So I have code that grabs a list of files from a directory that initially had over 14 millions files. This is a hex-core machine with 20 GB RAM running Ubuntu 14.04 desktop and just grabbing a list of files takes hours - I haven't actually timed it.
Over the past week or so I've run code that doesn't nothing more than gather this list of files, open each file to determine when it was created, and move it to a directory based on the month and year it was created. (The files have been both scp'd and rsync'd so the timestamp the OS provides is meaningless at this point, hence opening the file.)
When I first started running this loop it was moving 1000 files in about 90 seconds. Then after several hours like this that 90 seconds became 2.5 min, then 4, then 5, then 9, and eventually 15 min. So I shut it down and started over.
I noticed that today once it was done gathering a list of over 9 millions files that moving 1000 files took 15 min right off the bat. I just shut the process down again and rebooted the machine because the time to move 1000 files had climbed to over 90 min.
I had hoped to find some means of doing a while + list.pop() style strategy to free memory as the loop progressed. Then found a couple of SO posts that said it could be done with for i in list: ... list.remove(...) but that this was a terrible idea.
Here's the code:
from basicconfig.startup_config import *
arc_dir = '/var/www/data/visits/'
def step1_move_files_to_archive_dirs(files):
"""
:return:
"""
cntr = 0
for f in files:
cntr += 1
if php_basic_files.file_exists(f) is False:
continue
try:
visit = json.loads(php_basic_files.file_get_contents(f))
except:
continue
fname = php_basic_files.basename(f)
try:
dt = datetime.fromtimestamp(visit['Entrance Time'])
except KeyError:
continue
mYr = dt.strftime("%B_%Y")
# Move the lead to Monthly archive
arc_path = arc_dir + mYr + '//'
if not os.path.exists(arc_path):
os.makedirs(arc_path, 0777)
if not os.path.exists(arc_path):
print "Directory: {} was not created".format(arc_path)
else:
# Move the file to the archive
newFile = arc_path + fname
#print "File moved to {}".format(newFile)
os.rename(f, newFile)
if cntr % 1000 is 0:
print "{} files moved ({})".format(cntr, datetime.fromtimestamp(time.time()).isoformat())
def step2_combine_visits_into_1_file():
"""
:return:
"""
file_dirs = php_basic_files.glob(arc_dir + '*')
for fd in file_dirs:
arc_files = php_basic_files.glob(fd + '*.raw')
arc_fname = arc_dir + php_basic_str.str_replace('/', '', php_basic_str.str_replace(arc_dir, '', fd)) + '.arc'
try:
arc_file_data = php_basic_files.file_get_contents(arc_fname)
except:
arc_file_data = {}
for f in arc_files:
uniqID = moduleName = php_adv_str.fetchBefore('.', php_basic_files.basename(f))
if uniqID not in arc_file_data:
visit = json.loads(php_basic_files.file_get_contents(f))
arc_file_data[uniqID] = visit
php_basic_files.file_put_contents(arc_fname, json.dumps(arc_file_data))
def main():
"""
:return:
"""
files = php_basic_files.glob('/var/www/html/ver1/php/VisitorTracking/data/raw/*')
print "Num of Files: {}".format(len(files))
step1_move_files_to_archive_dirs(files)
step2_combine_visits_into_1_file()
Notes:
basicconfig is essentially a bunch of constants I have for the environment and a few commonly used libraries like all the php_basic_* libraries. (I used PHP for years before picking up Python so I built a library to mimic the more common functions I used in order to be up and running with Python faster.)
The step1 def is as far as the program gets so far. The step2 def could, and likely should, be run in parallel. However, I figured I/O was the bottleneck and doing even more of it in parallel would likely slow all functions down a lot more. (I have been tempted to rsync the archive directories to another machine for aggregation thus getting parallel speed without the I/O bottleneck but figured the rsync would also be quite slow.)
The files themselves are all 3 Kb each so not very large.
----- Final Thoughts -------
Like I said, it doesn't appear, to me at least, that any data is being stored from each file opening. Therefore memory should not be an issue. However, I do notice that only 1.2 GB of RAM is being used right now and over 12 GB of was being used before. A big chunk of that 12 could be storing 14 million file names and paths. I've only just started the processing again so for next several hours python will be gathering a list of files and that list isn't in memory yet.
So I was wondering if there was a garbage collection issue or something else I was missing. Why is it slowing down as it progresses through the loop?
step1_move_files_to_archive_dirs:
Here's some reasons Step 1 might be taking longer than you expected...
The response to any exception during Step 1 is to continue to the next file. If you have any corrupted data files, they will stay in the filesystem forever, increasing the amount of work this function has to do next time (and the next, and the next...).
You are reading in every file and converting it from JSON to a dict, just to extract one date. So everything is read and converted at least once. If you control the creation of these files, it might be worth storing this value in the filename or in a separate index / log, so you don't have to go searching for that value again later.
If the input directories and output / archive directories are on separate filesystems, os.rename(f, newFile) can't just rename the file, but has to copy every byte from the source filesystem to the target filesystem. So either every file is near-instantaneously renamed, or every input file is slowly copied.
PS: It's weird that this function double-checks things like whether the input file still exists, or if os.makedirs worked, but then allows any exception from os.rename to crash you mid-loop.
step2_combine_visits_into_1_file:
All your file I/O is hidden inside that PHP library, but it looks to this PHP outsider like you're trying to store in RAM the contents of all the files in each subdirectory. Then, you accumulate all those contents inside some smaller number of archive files, while preserving (most of?) the data that was already there. Not only is that probably slow to begin with, it will get slower as time goes on.
Function code mostly replaced by comments:
file_dirs = # arch_dir/* --- Maybe lots, maybe only a few.
for fd in file_dirs:
arc_files = # arch_dir/subdir*.raw or maybe arch_dir/subdir/*.raw.
arc_fname = # subdir.arc
arc_file_data = # Contents of JSON file subdir.arc, as a dict.
for f in arc_files: # The *.raw files.
uniqID = # String based on f's filename.
if uniqID not in arc_file_data:
# Add to arc_file_data the uniqID key, and the
# _ entire contents_ of the .raw file as its value.
php_basic_files.file_put_contents # (...)
# Convert the arc_file_data dict into one _massive_ string,
# and replace the contents of the subdir.arc file.
Unless you have some maintenance job that periodically trims the *.arc files, you will eventually have the entire contents of all 14 million files (plus any older files) inside the *.arc files. Each of those .arc files gets read into a dict, converted to a mega-string, grown (probably), and then written back to the filesystem. That's a ton of I/O, even if the average .arc file isn't very big (which can only happen if there are lots of them).
Why do all this anyway? By the start of Step 2, you've already got a unique ID for each .raw input file, and it's already in the filename --- so why not use the filesystem itself to store /arch_dir/subdir/unique_id.json?
If you really do need all this data in a few huge archives, that shouldn't require so much work. The .arc files are little more than the unaltered contents of the .raw files, with bits of a JSON dictionary between them. A simple shell script could slap that together without ever interpreting the JSON itself.
(If the values are not just JSON but quoted JSON, you would have to change whatever reads the .arc files to not un-quote those values. But now I'm purely speculating, since I can only see some of what's happening.)
PS: Am I missing something, or is arc_files a list of *.raw filenames. Shouldn't it be raw_files?
Other Comments:
As others have noted, if your file-globbing function returns a mega-list of 14 million filenames, it would be vastly more memory-efficient as a generator that can yield one filename at a time.
Finally, you mentioned popping filenames off a list (although I don't see that in your code)... There is a huge time penalty for inserting or removing the first element of a large list --- del my_list[0] or my_list.pop(0) or my_list.insert(0, something) --- because items 1 though n-1 all have to be copied one index toward 0. That turns an O(n) operation into O(n**2)... again, if that's in your code anywhere.
Related
Multiple sources lead me to believe that using a del statement is rarely necessary. However, I am working on a program that needs to read in huge files ( 6 GB) with filename getting picked up from from a list, do some transformation, write them to a datastore and pick up the next file.
For instance the reference variables buffer and processed will get overwritten with every iteration of loop - is there a point to deleting them explicitly?
files = list() # contains 1000 filenames
for file in files:
buffer = read_from_s3(file)
processed = process_data(buffer)
del buffer # needed?
write_to_another_s3(processed)
del processed # needed?
I have an automated process and need to perform some operations with files. Another process creates these files and stores them in a directory, I only need to work with recent files but have to leave these files in there and never delete them, because of the amount of files I think the process is starting to use a lot of resources when I get the files needed.
My initial idea was to create another process that copies the most recent files(with an extra day just to be sure) to another folder but I just wondering(or I'm sure hehe) if there's a better way to get these files without reading all of them or if my code can be optimized.
My main issue is that when I get to this part of the code, the CPU usage of the server is getting of the charts and I assume that at some point the process will just break due to some OS error. I just need to get the names of the files needed, which are the ones where the creation date is greater than the last file I used, Every time I perform an operation on a file the name goes to a table in a DB which is where I get the name of the last file. My issue isn't with the queries or the operations performed, the CPU usage it's minimum, just this part where I read all the files and compare the dates of them and add them to an array.
Here's my code(don't get to angry if it's horrendous) the heavy load starts after the for:
def get_ordered_files():
valid_files = []
epoch = datetime.datetime.utcfromtimestamp(0)
get_last_file = check_last_successful_file()
last_date = os.path.getctime(get_last_file)
files = glob.glob(files_location + file_extension)
files.sort(key=os.path.getctime, reverse=False)
for single_file in files:
total_days_file = datetime.datetime.fromtimestamp(os.path.getctime(single_file)) - epoch
total_days_last = datetime.datetime.fromtimestamp(last_date) - epoch
if total_days_file.total_seconds() > total_days_last.total_seconds():
check_empty = get_email_account(single_file)
if check_empty != "" and check_empty is not None:
valid_files.append(single_file)
return valid_files
Thank you very much for all your help(I'm using python 3.8).
There are a lot of redundant operations going on in your code.
For example, the use of fromtimestamp() to calculate total_days_last inside the loop can simply be done once outside of the loop. In fact, the use of datetime functions and mucking about with epoch seems unnecessary because you can simply compare the file ctime values directly.
os.path.getctime() is called twice on every file: once for the sort and a second time to calculate total_days_file.
These repetitive calculations over a large number of files would be part of the performance problem.
Another issue is that, if there are a large number of files, the list files could become very large and require a lot of memory.
if check_empty != "" and check_empty is not None: can simply be written as if check_empty:
Here is a simplified version:
def get_ordered_files():
last_ctime = os.path.getctime(check_last_successful_file())
files = glob.glob(files_location + file_extension)
files.sort(key=os.path.getctime)
return [f for f in files
if os.path.getctime(f) > last_ctime and get_email_account(f)]
This eliminates most of the redundant code but still calls os.path.getctime() twice for each file. To avoid that we can store the ctime for each file on the first occasion it is obtained.
pattern = os.path.join(files_location, file_extension)
def get_ordered_files():
last_ctime = os.path.getctime(check_last_successful_file())
files = ((filename, ctime) for filename in glob.iglob(pattern)
if (ctime := os.path.getctime(filename)) > last_ctime and
get_email_account(filename))
return (filename for filename, _ in sorted(files, key=itemgetter(1)))
Here a generator expression is assigned to files. It uses glob.iglob() which is an iterator version of glob.glob() that does not store all the files at once. Both the file name and its ctime value are stored as tuples. The generator expression filters out files that are too old and files that don't have an associated email account. Finally another generator is returned that sorts the files by ctime. The calling code can then iterate over the generator, or call list() on it to realise it as a list.
I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)
This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.
I have numerous large csv files (~400 MB each, need to process thousands of them, at least a hundred per program execution) containing long strings in the first cell of each row (about 100-300 characters per row for 1 million rows per file), and my Python program checks if a substring is in a given string. If so, then I append the row containing the string to a list, to be stored in another series of csv files after all the input files have been processed. For the first dozen input files, the program runs at about 20 seconds per file, which I am satisfied with.
The relevant portion of the code (the string-processing loop) looks as such:
check = set(['a','b','c'])
storage = []
data = glob.glob('data_address/*.csv')
for raw_file in data:
read_file = open(raw_file,'r',newline='',encoding='utf-8')
list_file = list(csv.reader((line.replace('\0','') for line in read_file), delimiter=","))
row_count = sum(1 for row in list_file)
for i in range(1, row_count);
text = set(list_file[i][0].split())
if len(check.intersection(text)) > 0:
storage.append(list_file[i])
The problem is that as the number of processed input files grows, I begin to have certain files that take much longer than 20 seconds. Furthermore, these anomalies take longer and longer to process - the first anomaly takes about 50 seconds to process, and towards the end of the loop, anomalies can take thousands of seconds to process, suggesting that the problem is with the loop itself rather than any individual file. These anomalies are not obviously different from the other files in terms of number of string matches.
What I don't understand is that the increase in processing time is not consistent. I still have plenty of 20-second files in between each anomaly, so it cannot be that the program is simply slowing down as memory storage increases. Does anyone have any idea what's going on? cProfile fails to show any component that might be causing the issue.
I use 64-bit Python 3.8 on Windows 10 with a 1TB hard drive, with about 10,000 MB active memory.
We don't yet have enough information to properly diagnose the issue, so in the meantime I thought I would try to improve the code you shared:
import csv
import pathlib
check = {'a', 'b', 'c'}
data_path = pathlib.Path("data_address")
saved_rows = []
for curr_path in data_path.glob("*.csv"):
with open(curr_path, newline='') as curr_file:
reader = csv.reader((line.replace('\0', '') for line in curr_file), delimiter=",")
for row in reader:
row_text = row[0].split()
if any(elem in check for elem in row_text):
saved_rows.append(row)
Although I'm not able to test it, it should work just fine.
this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.
Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.
You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.
Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.
Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.
#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)