I have several TB of image data, that are currently stored in many hdf-files with pytables, with one file for each frame. One file contains two groups, "LabelData" and "SensorData".
I have created a single (small) file that has all the file names and some meta data, and with the help of that file, I can call and open any needed hdf-data in a python generator.
This gives me a lot of flexibilty, however, it seems quite slow, as every single file has to be opened and closed.
Now I wanted to create a single hdf-file with external links to the other files, would that speed up the process?
As I have understood, creating external links requires to create a node for each link. However, I get the following performance warning:
PerformanceWarning: group / is exceeding the recommended maximum
number of children (16384); be ready to see PyTables asking for lots
of memory and possibly slow I/O. PerformanceWarning)
This is how I have created the file:
import tables as tb
def createLinkFile(linkfile,filenames, linknames):
# Create a new file
f1 = tb.open_file(linkfile, 'w')
for filepath, linkname in zip(filenames,linknames):
data = f1.create_group('/', linkname)
# create an external link
f1.create_external_link(data, 'LabelData', filepath + ':/LabelData')
f1.create_external_link(data, 'SensorData', filepath + ':/SensorData')
f1.close()
Is there a better way?
Related
I have millions of files being created each hour. Each file has one line of data. These files need to be merged into a single file.
I have tried doing this in the following way:-
Using aws s3 cp to download files for the hour.
Use a bash command to merge the files.
OR
Use a python script to merge the files.
This hourly job is being run in Airflow on Kubernetes(EKS). This takes more than one hour to complete and is creating a backlog. Other problem is that it often causes the EC2 Node to stop responding due to high CPU and memory usage. What is the most efficient way of running this job?
The python script for reference:-
from os import listdir
import sys
# from tqdm import tqdm
files = listdir('./temp/')
dest = sys.argv[1]
data = []
tot_len = len(files)
percent = tot_len//100
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as f:
d = f.read()
data.append(d)
result = '\n'.join(data)
with open(dest, 'w') as f:
f.write(result)
A scalable and reliable method would be:
Configure the Amazon S3 bucket to trigger an AWS Lambda function whenever a new file arrives
Within the AWS Lambda function, read the contents of the file and send it to an Amazon Kinesis Firehose stream. Then, delete the input file.
Configure the Amazon Kinesis Firehose stream to buffer the input data and output a new file either based on time period (up to 15 minutes) or data size (up to 128MB)
See: Amazon Kinesis Data Firehose Data Delivery - Amazon Kinesis Data Firehose
This will not produce on file per hour -- the number of files will depend upon the size of the incoming data.
If you need to create hourly files, you could consider using Amazon Athena on the output files from Firehose. Athena allows you to run SQL queries on files stored in Amazon S3. So, if the input files contain a date column, it can select only the data for a specific hour. (You could code the Lambda function add a date column for this purpose.)
I expect you should very seriously consider following the ideas in the AWS-specific answer you got. I'd add this response as a comment, except there's no way to show indented code coherently in a comment.
With respect to your Python script, you're building a giant string with a number of characters equal to the total number of characters across all your input files. So of course the memory use will grow at least that large.
Much less memory-intensive is to write out a file content immediately upon reading it (note this code is untested - may have a typo I'm blind to):
with open(dest, 'w') as fout:
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as fin:
fout.write(fin.read())
And one more thing to try if you pursue this: open the files in binary mode instead ('wb' and 'rb'). That may save useless layers of text-mode character decoding/encoding. I assume you just want to paste the raw bytes together.
Putting this out there in case someone else needs it.
I optimized the merging code to the best of my ability but still the bottleneck was reading or downloading the s3 files which is pretty slow using even the official aws cli.
I found a library s5cmd which is pretty fast as it makes full use of multiprocessing and multithreading and it solved my problem.
Link :- https://github.com/peak/s5cmd
I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)
This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.
I use multiple python scripts that collect data and write it into one single json data file.
It is not possible to combine the scripts.
The writing process is fast and it happens often that errors occur (e.g. some chars at the end duplicate), which is fatal, especially since I am using json format.
Is there a way to prevent a python script to write into a file if there are other script currently trying to write into the file? (It would be absolutely ok, if the data that the python script tries to write into the file gets lost, but it is important that the file syntax does not get somehow 'injured'.)
Code Snipped:
This opens the file and retrieves the data:
data = json.loads(open("data.json").read())
This appends a new dictionary:
data.append(new_dict)
And the old file is overwritten:
open("data.json","w").write( json.dumps(data) )
Info: data is a list which contains dicts.
Operating System: The hole process takes place on linux server.
On Windows, you could try to create the file, and bail out if an exception occurs (because file is locked by another script). But on Linux, your approach is bound to fail.
Instead, I would
write one file per new dictionary, suffixing filename by process ID and a counter
consuming process(es) don't read a single file, but the sorted files (according to modification time) and build the data from it
So in each script:
filename = "data_{}_{}.json".format(os.getpid(),counter)
counter+=1
open(filename ,"w").write( json.dumps(new_dict) )
and in the consumers (reading each dict of sorted files in a protected loop):
files = sorted(glob.glob("*.json"),key=os.path.getmtime())
data = []
for f in files:
try:
with open(f) as fh:
data.append(json.load(fh))
except Exception:
# IO error, malformed json file: ignore
pass
I will post my own solution, since it works for me:
Every single python script checks (before opening and writing the data file) whether a file called data_check exists. If so, the pyhthon script does not try to read and write the file and dismisses the data, that was supposed to be written into the file. If not, the python script creates the file data_check and then starts to read and wirte the file. After the writing process is done the file data_check is removed.
I am usint the 64-bit version of Enthought Python to process data across multiple HDF5 files. I'm using h5py version 1.3.1 (HDF5 1.8.4) on 64-bit Windows.
I have an object that provides a convenient interface to my specific data heirarchy, but testing the h5py.File(fname, 'r') independently yields the same results. I am iterating through a long list (~100 files at a time) and attempting to pull out specific pieces of information from the files. The problem I'm having is that I'm getting the same information out of several files! My loop looks something like:
files = glob(r'path\*.h5')
out_csv = csv.writer(open('output_file.csv', 'rb'))
for filename in files:
handle = hdf5.File(filename, 'r')
data = extract_data_from_handle(handle)
for row in data:
out_csv.writerow((filename, ) +row)
When I inspect the files using something like hdfview, I know the internals are different. However, the csv I get seems to indicate that all the files contain the same data. Has anyone seen this behavior before? Any suggestions where I could go to start debugging this issue?
I've concluded that this is a strange manifestation of Perplexing assignment behavior with h5py object as instance variable . I re-wrote my code so that each file is handled within a function call and the variable is not reused. Using this approach, I don't see the same strange behavior and it seems to work much better. For clarity, the solution looks more like:
files = glob(r'path\*.h5')
out_csv = csv.writer(open('output_file.csv', 'rb'))
def extract_data_from_filename(filename):
return extract_data_from_handle(hdf5.File(filename, 'r'))
for filename in files:
data = extract_data_from_filename(filename)
for row in data:
out_csv.writerow((filename, ) +row)
I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.
I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.
One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.
Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).
Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.
At last I have written a Python Script myself:
import os
import bz2
def split_xml(filename):
''' The function gets the filename of wiktionary.xml.bz2 file as input and creates
smallers chunks of it in a the diretory chunks
'''
# Check and create chunk diretory
if not os.path.exists("chunks"):
os.mkdir("chunks")
# Counters
pagecount = 0
filecount = 1
#open chunkfile in write mode
chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
# Read line by line
bzfile = bz2.BZ2File(filename)
for line in bzfile:
chunkfile.write(line)
# the </page> determines new wiki page
if '</page>' in line:
pagecount += 1
if pagecount > 1999:
#print chunkname() # For Debugging
chunkfile.close()
pagecount = 0 # RESET pagecount
filecount += 1 # increment filename
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
try:
chunkfile.close()
except:
print 'Files already close'
if __name__ == '__main__':
# When the script is self run
split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')
well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module
The method you are referencing is quite a dirty hack :)
I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.
As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.
Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.