Parallel processing a set of XML files in Python - python

I need to process a set (~20) of relatively large (20-300MB) XML files (truncated example below) and I'm looking for a way to speed this up.
The files contain "events" (here named "datasets"), with several relevant properties, most importantly the UID. There are usually multiple events per UID.
I have a unique list of UIDs and for each of them I want to lookup the events and extract the UTCtime property. What would be the best way to parallelize this?
I've tried using threading (see below) but it didn't result in any noticeable speedup. I also tried multiprocessing, but I would need to pass the XML elements between processes, but got errors about the elements not being 'pickleable'.
Thanks
Python approach using threading
import lxml.etree as et, threading,concurrent.futures, datetime as dt
uidList = ['B0 2B 5C 05 09 00 12 E0',
'B0 2A 5C 05 09 00 12 E0',
'AD 2A 5C 05 09 00 12 E0',
'4F 2D 5C 05 09 00 12 E0']
uidList_split = [uidList[0:2],uidList[2:4]]
xPathFmt = 'Dataset/[UID = "{:s}"]'
timeFmt = '%m/%d/%Y %H:%M:%S.%f'
def thread_function(i):
scanTimes = {}
for uid in uidList_split[i]:
scanTimes[uid] = []
for e in root.iterfind(xPathFmt.format(uid)):
scanTimes[uid].append(dt.datetime.strptime(e.findtext('UTCTime'),timeFmt))
return(scanTimes)
tree = et.parse('test.xml')
root = tree.getroot()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
scanTimes = list(executor.map(thread_function, range(2)))
scanTimes = {k: v for d in scanTimes for k, v in d.items()}
for uid in scanTimes: print(uid,scanTimes[uid])
Example XML file
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<Datasets>
<Dataset>
<UTCTime>05/31/2019 03:44:27.737</UTCTime>
<ReaderID>44252</ReaderID>
<Address>1</Address>
<UID>B0 2B 5C 05 09 00 12 E0</UID>
<ScanCount>1</ScanCount>
<Type>177</Type>
</Dataset>
<Dataset>
<UTCTime>05/12/2019 02:46:22.737</UTCTime>
<ReaderID>44252</ReaderID>
<Address>1</Address>
<UID>B0 2B 5C 05 09 00 12 E0</UID>
<ScanCount>1</ScanCount>
<Type>177</Type>
</Dataset>
<Dataset>
<UTCTime>05/31/2019 03:44:34.215</UTCTime>
<ReaderID>44251</ReaderID>
<Address>2</Address>
<UID>B0 2A 5C 05 09 00 12 E0</UID>
<ScanCount>1</ScanCount>
<Type>177</Type>
</Dataset>
<Dataset>
<UTCTime>05/31/2019 04:16:56.957</UTCTime>
<ReaderID>44252</ReaderID>
<Address>1</Address>
<UID>AD 2A 5C 05 09 00 12 E0</UID>
<ScanCount>1</ScanCount>
<Type>177</Type>
</Dataset>
<Dataset>
<UTCTime>05/31/2019 04:05:07.705</UTCTime>
<ReaderID>44252</ReaderID>
<Address>1</Address>
<UID>4F 2D 5C 05 09 00 12 E0</UID>
<ScanCount>1</ScanCount>
<Type>177</Type>
</Dataset>
</Datasets>```

I figure if you are doing lots of lookups, e.g. thousands as you say, you should spend some time getting the data into a better structure for searching. So, I am suggesting parsing the XML into an "in-memory" data-structure and then doing lookups from memory. You may think it will take too much RAM, but if you look at a typical dataset entry in your XML file, you will see it has around 220 bytes, whereas you only really want around 30 bytes of UTCTime and UID, so it is going to be around 7x smaller.
I came up with 2 methods...
The first one uses xmltodict and loads the XML file into a Python dict. It takes around 18 seconds to load a 200MB dummy XML file on my Mac, but then subsequent lookups take just 3 microseconds each. Its advantages are that it is a standard, tested XML reader so it should be robust, but it stores stuff you probably don't need so is heavier on memory.
The second method just parses the XML with Python regexes. It is about the same speed, but takes less memory but is maybe less robust.
#!/usr/bin/env python3
def method1():
import xmltodict
with open('file.xml') as fd:
XML = xmltodict.parse(fd.read())
# Lookup a UID
for Dataset in XML['Datasets']['Dataset']:
if Dataset['UID'] == "31 1e 24 81 82 71 6f 1d":
print(Dataset)
def method2():
import re
# Compile the regex to look for UID and UTCTime for better performance
UIDre = re.compile("<UID>(.*)</UID>")
UTCTimere = re.compile("<UTCTime>(.*)</UTCTime>")
# Parse XML building a dict, indexed by UID, of lists of matching times
d = {}
with open('file.xml') as fp:
for lineno, line in enumerate(fp):
result = re.search(UIDre,line)
if result != None:
UID = result.group(1)
#print(f"UID:{UID}")
if not UID in d:
d[UID] = []
d[UID].append(UTCTime)
continue
result = re.search(UTCTimere,line)
if result != None:
UTCTime = result.group(1)
#print(f"UTCTime:{UTCTime}")
# Do a lookup
print(d["31 1e 24 81 82 71 6f 1d"])
method1()
method2()
In case anyone else fancy testing theories or methods, here is the code I used to generate a 200MB XML file with 1,000,000 dummy datasets:
#!/usr/bin/env python3
import random
print('<?xml version="1.0" encoding="utf-8" standalone="no"?>')
print('<Datasets>')
for d in range(1000000):
ReaderID = random.randrange(65536)
UID = "%02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
UID+= " %02x" % random.randrange(256)
Type = random.randrange(65536)
print(f"<Dataset>")
print(f" <UTCTime>05/31/2019 04:05:07.705</UTCTime>")
print(f" <ReaderID>{ReaderID}</ReaderID>")
print(f" <Address>1</Address>")
print(f" <UID>{UID}</UID>")
print(f" <ScanCount>1</ScanCount>")
print(f" <Type>{Type}</Type>")
print(f"</Dataset>")
print('</Datasets>')
I then manually seeded the UID in my code into the XML file using a regular editor for test purposes.

Q : Perhaps I'm just doing it in an inefficient way?
Yes, this is the root-cause.
1st) the problem-approach is not working as a True-[PARALLEL] computing, but as a "just"-[CONCURRENT] class-of-process scheduling. Either worker may finish its work independently of any other one. This is not a True-[PARALLEL] flow of processing.
2nd) the use of python threads-based tools ( since ever, till 3.7 ) has never brought any efficient amount of [CONCURRENT]-computing, because of the internal python behaviour of the GIL-interpreter-LOCK.
All, and best repeat this in all caps ALL THREADS inside the python interpreter sit and wait until they in a round-robin-fashion acquire the ownership of a singleton instance, called GIL-lock (simplified for brevity), and do nothing useful on waiting.
That means, all threads wait, one (the one, now having POSACK'ed GIL-lock ownership) can do some small amount of computing, before it stops doing that and signals the python to have released the GIL-lock ownership that others try to gain next.
This effectively re-[SERIAL]-ises any work one has loaded on a pool of python threads into one-and-only-one pure-[SERIAL]-sequence of operations, interleaved with small amounts of time, spent above the useful-job scope, during the fights for the GIL-lock acquisition and release.
The GIL-lock monopoly for orchestrating a pure-[SERIAL] flow of work is intentional since ever in the design of the python interpreter. This "strategy" avoids any [CONCURRENT]-operation from ever taking place. Principally.
This means one gains adverse only speedup effects (more time spent for the equal work executed) from threads-based computing in python, unless the use-case was accompanied with such amount of "immense" GUI/disk/network-I/O latencies, that may get masked by the interleaving the thread's computing progress by a round-robin stepping of all the threads through the central GIL-lock dancing ball-room. Never ever elsewhere else.
Last, but not least :
Your iteration strategy seems to be devastatingly wrong either. Scanning the same file as-many-times top-to-bottom as there are items in outer control-list is anywhere near the cost-effective strategy:
def thread_function(i): ############## PERFORMANCE-WISE ANTI-PATTERN [DO NOT DO]
scanTimes = {} #-- set empty dict{}
for uid in uidList_split[i]: #-- len(uidList_split[i])-times re-run (!!!!)
scanTimes[uid] = [] #-- add an empty [] # for next uid
for e in root.iterfind(xPathFmt.format(uid)): # each time start
scanTimes[uid].append(dt.datetime.strptime( # from root.
e.findtext( 'UTCTime' ), timeFmt )# .iterfind()
) # again and again
return(scanTimes)
Way better gets to iterate once-and-only once over the slowest resource ( the disk-file-I/O ) and test each found xPathNonUIDspecifitFmtTEMPLATE for its UID-match against a uidList_split-list ( at a cost way, way cheaper, than re-iterating an expensive, disk-file-I/O dependent ( yet, static ) file more than once. Appending a variable ad-hoc found UID is as easy as:
if uid2seek not in scanTimes.keys():
scanTimes[uid2seek] = [ ..., ] # add a not yet found UID
else:
scanTimes[uid2seek].append(...)# for already "visited" UID(s)

Related

What is a faster way to read binary as uint8, create sliding window of uint32, and find indices where the uint32 value == x?

I have some binary data to parse, and need to find where packets start. All packets start with the same header, but packet size is variable. The header is a 32 bit unsigned integer.
Below is my implementation, but it's slow. Is there some numpy functionality or other options to make this operation faster?
"""Example of binary data:
d9 37 b2 a5 08 31 03 ... 46 00 00 01 b9 1e 43 ... d9 37 b2 a5 30 90 06 00 cb... 08 00 30 43 d9 37 b2 a5 ... 04 01 c8 f4 ...
"""
def sliding_window(iterable, n=2):
"""Return a zipped object where each item is a sliding group of n elements from iterable.
Example:
in = [1,2,3,4,5,6,7,8]
out = [[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7],[5,6,7,8]]
"""
iterables = itertools.tee(iterable, n)
for iterable, num_skipped in zip(iterables, itertools.count()):
for _ in range(num_skipped):
next(iterable, None)
return zip(*iterables)
packet_header = 0xd937b2a5
dat_file = "path/to/file"
dat = np.fromfile(file=dat_file,dtype = np.uint8)
sw_u8s = sliding_window(dat,4)
#this is really slow
sw_u32s = [struct.unpack('>I', bytes(bb)) for bb in sw_u8s]
# then do something like
# packet_start = np.argwhere(sw_u32s == packet_header)
# to find the indices of the packet headers
Read your array as np.uint8:
import numpy as np
# -v- packet header starts at byte 3
data_bytes = bytes([0x32, 0x12, 0x8a, 0xd9, 0x37, 0xb2, 0xa5, 0xf8, 0x3d])
dat = np.frombuffer(data_bytes, dtype=np.uint8)
Make a view of "windowed int32" values:
dat_32win = np.ndarray((len(dat) - 3,), dtype=np.uint32, buffer=dat, strides=(1,))
Now you take your packet header in the byte order of your machine. In the likely chance that it is little endian, you need to reverse the order of the bytes (you can do this programmatically with int.to_bytes and int.from_bytes but it doesn't seem worth the trouble):
packet_header_le = 0xa5b237d9
Now you just need to find the index of that value:
idx = np.argmax(dat_32win == packet_header_le)
print(idx)
# 3
One note, if the packet header is not in the byte sequence, then that last np.argmax would return 0 (because its argument would be an array full of False values), which is the same result you would get if the packet header started at the first byte. You may want to handle that error condition checking that the packet header value is actually there.

How to write a command to produce canbus data in the perticular format?

I am writing a python program to produce canbus data in the following format.
<0x18eeff01> [8] 05 a0 be 1c 00 a0 a0 c0
I am using python-can library for this and trying to read message format as above. I couldn't figure out what is the first format <0x18eeff01> indicates? I don't know how I will produce that in the output.
try:
for i in range(0,200):
msg=bus.recv(timeout=1)
print("------")
data = "{} [{}]".format(msg.channel,msg.dlc)
for i in range(0,msg.dlc):
data += " {}".format(msg.data[i])
print(data)
#Timestamp, Prio, PGN,src,dest, len, data
except can.CanError:
print ("error")
finally:
bus.shutdown()
f.close()````
Following is the output of this code:
````[8] 05 a0 be 1c 00 a0 a0 c0````
How can I produce whole string of the data as mentioned earlier?
0x18eeff01 is the arbitration id in hex form. You can get it with msg.arbitration_id.
See here

rand and rands implementation in python

I need an implementation of rand and rands from c++ in python to re-encrypt a bunch of files. But can't seem to get it right.
I have a exe that un-encrypts a file into text, I also have to source code, after editing the file I need to encrypt it using the same method.
Since I don't know how to write c++ code, I opted to write in python. Trying first to decrypt it to know the method is the same.
The following code is the c++ code that un-encrypted the files, where "buff" is the beginning of the encrypted block and "len" length of said block.
static const char KEY[] = "KF4QHcm2";
static const unsigned long KEY_LEN = sizeof(KEY) - 1;
void unobfuscate(unsigned char* buff, unsigned long len) {
srand((char)buff[len - 1]);
for (unsigned long i = 0; i < len - 2; i += 2) {
buff[i] ^= KEY[rand() % KEY_LEN];
}
}
From what I understand it takes the last character of the encrypted block as the seed, and from the beginning every 2 bytes it xors the value with an element of the KEY array, this index is determined by the remainder of a random number divided by the KEY length.
Searching around the web, I find that c++ uses a simple Linear Congruential Generator that shouldn't be used for encryption, but can't seem to make it work.
I found one example of the code and tried to implement the other but either don't seem to work.
#My try at implementing it
def rand():
global seed
seed = (((seed * 1103515245) + 12345) & 0x7FFFFFFF)
return seed
I also read that the rand function is between 0 and RAND_MAX, but can't find the value of RAND_MAX, if I found it maybe random.randrange() could be used.
It can also be the way I set the seed since it seems in c++ a char works but in python I'm setting it to the value of the character.
Here is what I observe when un-encrypting the file using the various methods. This is just the first 13 bytes, so if someone needs to check if it works it is possible to do so.
The block ends with the sequence: 4C 0A 54 C4 this means C4 is the seed
Example encrypted:
77 43 35 69 26 6B 0C 6E 3A 74 4B 33 71 wC5i&k.n:tK3q
Example un-encrypted using c++:
24 43 6C 69 63 6B 49 6E 69 74 0A 33 34 $ClickInit.34
Example un-encrypted using python example:
1A 43 7E 69 77 6B 38 6E 0E 74 1A 33 3A .C~iwk8n.t.3:
Example un-encrypted using python implementation:
3C 43 73 69 6E 6B 4A 6E 0E 74 1A 33 37 <CsinkJn.t.37
I can also have something wrong in my python script, so here is the file in case it has any errors:
import os
def srand(s):
global seed
seed = s
def rand():
global seed
#Example code
#seed = (seed * 214013 + 2531011) % 2**64
#return (seed >> 16)&0x7fff
#Implementation code
seed = (((seed * 1103515245) + 12345) & 0x7FFFFFFF)
return seed
KEY = ['K','F','4','Q','H','c','m','2']
KEY_LEN = len(KEY) - 1
for filename in os.listdir("."):
if filename.endswith(".dat"):
print(" Decoding " + filename)
#open file
file = open(filename, "rb")
#set file attributes
file_length = os.path.getsize(filename)
file_buffer = [0] * file_length
#copy contents of file to array
for i in range(file_length):
file_buffer[i] = int.from_bytes(file.read(1), 'big')
#close file
file.close()
print(" Random Seed: " + chr(file_buffer[-1]))
#set random generator seed
srand(file_buffer[-1])
#decrypt the file
for i in range(3600, file_length, 2):
file_buffer[i] ^= ord(KEY[rand() % KEY_LEN])
#print to check if output is un-encrypted
for i in range(3600, 3613, 1):
print(file_buffer[i])
print(chr(file_buffer[i]))
continue
else:
#Do not try to un-encrypt the python script
print("/!\ Can't decode " + filename)
continue
If anyone can help me figure this out I would be grateful, if possible I would love this to work in python but, from what I can gather, it seems like I will have to learn c++ to get it to work.
rand is not a cryptographic function.
rand's algorithm is not stable between systems compilers or anything else.
If you have no choice, your best bet is to use python-C/C++ interoperability techniques and actually run rand() and srand(). That will suck, but it will suck as much as the original code did.

How to solve memory issues while multiprocessing using Pool.map()?

I have written the program (below) to:
read a huge text file as pandas dataframe
then groupby using a specific column value to split the data and store as list of dataframes.
then pipe the data to multiprocess Pool.map() to process each dataframe in parallel.
Everything is fine, the program works well on my small test dataset. But, when I pipe in my large data (about 14 GB), the memory consumption exponentially increases and then freezes the computer or gets killed (in HPC cluster).
I have added codes to clear the memory as soon as the data/variable isn't useful. I am also closing the pool as soon as it is done. Still with 14 GB input I was only expecting 2*14 GB memory burden, but it seems like lot is going on. I also tried to tweak using chunkSize and maxTaskPerChild, etc but I am not seeing any difference in optimization in both test vs. large file.
I think improvements to this code is/are required at this code position, when I start multiprocessing.
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
but, I am posting the whole code.
Test example: I created a test file ("genome_matrix_final-chr1234-1mb.txt") of upto 250 mb and ran the program. When I check the system monitor I can see that memory consumption increased by about 6 GB. I am not so clear why so much memory space is taken by 250 mb file plus some outputs. I have shared that file via drop box if it helps in seeing the real problem. https://www.dropbox.com/sh/coihujii38t5prd/AABDXv8ACGIYczeMtzKBo0eea?dl=0
Can someone suggest, How I can get rid of the problem?
My python script:
#!/home/bin/python3
import pandas as pd
import collections
from multiprocessing import Pool
import io
import time
import resource
print()
print('Checking required modules')
print()
''' change this input file name and/or path as need be '''
genome_matrix_file = "genome_matrix_final-chr1n2-2mb.txt" # test file 01
genome_matrix_file = "genome_matrix_final-chr1234-1mb.txt" # test file 02
#genome_matrix_file = "genome_matrix_final.txt" # large file
def main():
with open("genome_matrix_header.txt") as header:
header = header.read().rstrip('\n').split('\t')
print()
time01 = time.time()
print('starting time: ', time01)
'''load the genome matrix file onto pandas as dataframe.
This makes is more easy for multiprocessing'''
gen_matrix_df = pd.read_csv(genome_matrix_file, sep='\t', names=header)
# now, group the dataframe by chromosome/contig - so it can be multiprocessed
gen_matrix_df = gen_matrix_df.groupby('CHROM')
# store the splitted dataframes as list of key, values(pandas dataframe) pairs
# this list of dataframe will be used while multiprocessing
gen_matrix_df_list = collections.OrderedDict()
for chr_, data in gen_matrix_df:
gen_matrix_df_list[chr_] = data
# clear memory
del gen_matrix_df
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
del gen_matrix_df_list # clear memory
p.close()
p.join()
# concat the results from pool.map() and write it to a file
result_merged = pd.concat(result)
del result # clear memory
pd.DataFrame.to_csv(result_merged, "matrix_to_haplotype-chr1n2.txt", sep='\t', header=True, index=False)
print()
print('completed all process in "%s" sec. ' % (time.time() - time01))
print('Global maximum memory usage: %.2f (mb)' % current_mem_usage())
print()
'''function to convert the dataframe from genome matrix to desired output '''
def matrix_to_vcf(matrix_df):
print()
time02 = time.time()
# index position of the samples in genome matrix file
sample_idx = [{'10a': 33, '10b': 18}, {'13a': 3, '13b': 19},
{'14a': 20, '14b': 4}, {'16a': 5, '16b': 21},
{'17a': 6, '17b': 22}, {'23a': 7, '23b': 23},
{'24a': 8, '24b': 24}, {'25a': 25, '25b': 9},
{'26a': 10, '26b': 26}, {'34a': 11, '34b': 27},
{'35a': 12, '35b': 28}, {'37a': 13, '37b': 29},
{'38a': 14, '38b': 30}, {'3a': 31, '3b': 15},
{'8a': 32, '8b': 17}]
# sample index stored as ordered dictionary
sample_idx_ord_list = []
for ids in sample_idx:
ids = collections.OrderedDict(sorted(ids.items()))
sample_idx_ord_list.append(ids)
# for haplotype file
header = ['contig', 'pos', 'ref', 'alt']
# adding some suffixes "PI" to available sample names
for item in sample_idx_ord_list:
ks_update = ''
for ks in item.keys():
ks_update += ks
header.append(ks_update+'_PI')
header.append(ks_update+'_PG_al')
#final variable store the haplotype data
# write the header lines first
haplotype_output = '\t'.join(header) + '\n'
# to store the value of parsed the line and update the "PI", "PG" value for each sample
updated_line = ''
# read the piped in data back to text like file
matrix_df = pd.DataFrame.to_csv(matrix_df, sep='\t', index=False)
matrix_df = matrix_df.rstrip('\n').split('\n')
for line in matrix_df:
if line.startswith('CHROM'):
continue
line_split = line.split('\t')
chr_ = line_split[0]
ref = line_split[2]
alt = list(set(line_split[3:]))
# remove the alleles "N" missing and "ref" from the alt-alleles
alt_up = list(filter(lambda x: x!='N' and x!=ref, alt))
# if no alt alleles are found, just continue
# - i.e : don't write that line in output file
if len(alt_up) == 0:
continue
#print('\nMining data for chromosome/contig "%s" ' %(chr_ ))
#so, we have data for CHR, POS, REF, ALT so far
# now, we mine phased genotype for each sample pair (as "PG_al", and also add "PI" tag)
sample_data_for_vcf = []
for ids in sample_idx_ord_list:
sample_data = []
for key, val in ids.items():
sample_value = line_split[val]
sample_data.append(sample_value)
# now, update the phased state for each sample
# also replacing the missing allele i.e "N" and "-" with ref-allele
sample_data = ('|'.join(sample_data)).replace('N', ref).replace('-', ref)
sample_data_for_vcf.append(str(chr_))
sample_data_for_vcf.append(sample_data)
# add data for all the samples in that line, append it with former columns (chrom, pos ..) ..
# and .. write it to final haplotype file
sample_data_for_vcf = '\t'.join(sample_data_for_vcf)
updated_line = '\t'.join(line_split[0:3]) + '\t' + ','.join(alt_up) + \
'\t' + sample_data_for_vcf + '\n'
haplotype_output += updated_line
del matrix_df # clear memory
print('completed haplotype preparation for chromosome/contig "%s" '
'in "%s" sec. ' %(chr_, time.time()-time02))
print('\tWorker maximum memory usage: %.2f (mb)' %(current_mem_usage()))
# return the data back to the pool
return pd.read_csv(io.StringIO(haplotype_output), sep='\t')
''' to monitor memory '''
def current_mem_usage():
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024.
if __name__ == '__main__':
main()
Update for bounty hunters:
I have achieved multiprocessing using Pool.map() but the code is causing a big memory burden (input test file ~ 300 mb, but memory burden is about 6 GB). I was only expecting 3*300 mb memory burden at max.
Can somebody explain, What is causing such a huge memory requirement for such a small file and for such small length computation.
Also, i am trying to take the answer and use that to improve multiprocess in my large program. So, addition of any method, module that doesn't change the structure of computation part (CPU bound process) too much should be fine.
I have included two test files for the test purposes to play with the code.
The attached code is full code so it should work as intended as it is when copied-pasted. Any changes should be used only to improve optimization in multiprocessing steps.
Prerequisite
In Python (in the following I use 64-bit build of Python 3.6.5) everything is an object. This has its overhead and with getsizeof we can see exactly the size of an object in bytes:
>>> import sys
>>> sys.getsizeof(42)
28
>>> sys.getsizeof('T')
50
When fork system call used (default on *nix, see multiprocessing.get_start_method()) to create a child process, parent's physical memory is not copied and copy-on-write technique is used.
Fork child process will still report full RSS (resident set size) of the parent process. Because of this fact, PSS (proportional set size) is more appropriate metric to estimate memory usage of forking application. Here's an example from the page:
Process A has 50 KiB of unshared memory
Process B has 300 KiB of unshared memory
Both process A and process B have 100 KiB of the same shared memory region
Since the PSS is defined as the sum of the unshared memory of a process and the proportion of memory shared with other processes, the PSS for these two processes are as follows:
PSS of process A = 50 KiB + (100 KiB / 2) = 100 KiB
PSS of process B = 300 KiB + (100 KiB / 2) = 350 KiB
The data frame
Not let's look at your DataFrame alone. memory_profiler will help us.
justpd.py
#!/usr/bin/env python3
import pandas as pd
from memory_profiler import profile
#profile
def main():
with open('genome_matrix_header.txt') as header:
header = header.read().rstrip('\n').split('\t')
gen_matrix_df = pd.read_csv(
'genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
gen_matrix_df.info()
gen_matrix_df.info(memory_usage='deep')
if __name__ == '__main__':
main()
Now let's use the profiler:
mprof run justpd.py
mprof plot
We can see the plot:
and line-by-line trace:
Line # Mem usage Increment Line Contents
================================================
6 54.3 MiB 54.3 MiB #profile
7 def main():
8 54.3 MiB 0.0 MiB with open('genome_matrix_header.txt') as header:
9 54.3 MiB 0.0 MiB header = header.read().rstrip('\n').split('\t')
10
11 2072.0 MiB 2017.7 MiB gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
12
13 2072.0 MiB 0.0 MiB gen_matrix_df.info()
14 2072.0 MiB 0.0 MiB gen_matrix_df.info(memory_usage='deep')
We can see that the data frame takes ~2 GiB with peak at ~3 GiB while it's being built. What's more interesting is the output of info.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 34 columns):
...
dtypes: int64(2), object(32)
memory usage: 1.0+ GB
But info(memory_usage='deep') ("deep" means introspection of the data deeply by interrogating object dtypes, see below) gives:
memory usage: 7.9 GB
Huh?! Looking outside of the process we can make sure that memory_profiler's figures are correct. sys.getsizeof also shows the same value for the frame (most probably because of custom __sizeof__) and so will other tools that use it to estimate allocated gc.get_objects(), e.g. pympler.
# added after read_csv
from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()
Gives:
types | # objects | total size
================================================== | =========== | ============
<class 'pandas.core.series.Series | 34 | 7.93 GB
<class 'list | 7839 | 732.38 KB
<class 'str | 7741 | 550.10 KB
<class 'int | 1810 | 49.66 KB
<class 'dict | 38 | 7.43 KB
<class 'pandas.core.internals.SingleBlockManager | 34 | 3.98 KB
<class 'numpy.ndarray | 34 | 3.19 KB
So where do these 7.93 GiB come from? Let's try to explain this. We have 4M rows and 34 columns, which gives us 134M values. They are either int64 or object (which is a 64-bit pointer; see using pandas with large data for detailed explanation). Thus we have 134 * 10 ** 6 * 8 / 2 ** 20 ~1022 MiB only for values in the data frame. What about the remaining ~ 6.93 GiB?
String interning
To understand the behaviour it's necessary to know that Python does string interning. There are two good articles (one, two) about string interning in Python 2. Besides the Unicode change in Python 3 and PEP 393 in Python 3.3 the C-structures have changed, but the idea is the same. Basically, every short string that looks like an identifier will be cached by Python in an internal dictionary and references will point to the same Python objects. In other word we can say it behaves like a singleton. Articles that I mentioned above explain what significant memory profile and performance improvements it gives. We can check if a string is interned using interned field of PyASCIIObject:
import ctypes
class PyASCIIObject(ctypes.Structure):
_fields_ = [
('ob_refcnt', ctypes.c_size_t),
('ob_type', ctypes.py_object),
('length', ctypes.c_ssize_t),
('hash', ctypes.c_int64),
('state', ctypes.c_int32),
('wstr', ctypes.c_wchar_p)
]
Then:
>>> a = 'name'
>>> b = '!##$'
>>> a_struct = PyASCIIObject.from_address(id(a))
>>> a_struct.state & 0b11
1
>>> b_struct = PyASCIIObject.from_address(id(b))
>>> b_struct.state & 0b11
0
With two strings we can also do identity comparison (addressed in memory comparison in case of CPython).
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>> gen_matrix_df.REF[0] is gen_matrix_df.REF[6]
True
Because of that fact, in regard to object dtype, the data frame allocates at most 20 strings (one per amino acids). Though, it's worth noting that Pandas recommends categorical types for enumerations.
Pandas memory
Thus we can explain the naive estimate of 7.93 GiB like:
>>> rows = 4 * 10 ** 6
>>> int_cols = 2
>>> str_cols = 32
>>> int_size = 8
>>> str_size = 58
>>> ptr_size = 8
>>> (int_cols * int_size + str_cols * (str_size + ptr_size)) * rows / 2 ** 30
7.927417755126953
Note that str_size is 58 bytes, not 50 as we've seen above for 1-character literal. It's because PEP 393 defines compact and non-compact strings. You can check it with sys.getsizeof(gen_matrix_df.REF[0]).
Actual memory consumption should be ~1 GiB as it's reported by gen_matrix_df.info(), it's twice as much. We can assume it has something to do with memory (pre)allocation done by Pandas or NumPy. The following experiment shows that it's not without reason (multiple runs show the save picture):
Line # Mem usage Increment Line Contents
================================================
8 53.1 MiB 53.1 MiB #profile
9 def main():
10 53.1 MiB 0.0 MiB with open("genome_matrix_header.txt") as header:
11 53.1 MiB 0.0 MiB header = header.read().rstrip('\n').split('\t')
12
13 2070.9 MiB 2017.8 MiB gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
14 2071.2 MiB 0.4 MiB gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
15 2071.2 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
16 2040.7 MiB -30.5 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
...
23 1827.1 MiB -30.5 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
24 1094.7 MiB -732.4 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
25 1765.9 MiB 671.3 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
26 1094.7 MiB -671.3 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
27 1704.8 MiB 610.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
28 1094.7 MiB -610.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
29 1643.9 MiB 549.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
30 1094.7 MiB -549.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
31 1582.8 MiB 488.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
32 1094.7 MiB -488.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
33 1521.9 MiB 427.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
34 1094.7 MiB -427.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
35 1460.8 MiB 366.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
36 1094.7 MiB -366.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
37 1094.7 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
...
47 1094.7 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
I want to finish this section by a quote from fresh article about design issues and future Pandas2 by original author of Pandas.
pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset
Process tree
Let's come to the pool, finally, and see if can make use of copy-on-write. We'll use smemstat (available form an Ubuntu repository) to estimate process group memory sharing and glances to write down system-wide free memory. Both can write JSON.
We'll run original script with Pool(2). We'll need 3 terminal windows.
smemstat -l -m -p "python3.6 script.py" -o smemstat.json 1
glances -t 1 --export-json glances.json
mprof run -M script.py
Then mprof plot produces:
The sum chart (mprof run --nopython --include-children ./script.py) looks like:
Note that two charts above show RSS. The hypothesis is that because of copy-on-write it's doesn't reflect actual memory usage. Now we have two JSON files from smemstat and glances. I'll the following script to covert the JSON files to CSV.
#!/usr/bin/env python3
import csv
import sys
import json
def smemstat():
with open('smemstat.json') as f:
smem = json.load(f)
rows = []
fieldnames = set()
for s in smem['smemstat']['periodic-samples']:
row = {}
for ps in s['smem-per-process']:
if 'script.py' in ps['command']:
for k in ('uss', 'pss', 'rss'):
row['{}-{}'.format(ps['pid'], k)] = ps[k] // 2 ** 20
# smemstat produces empty samples, backfill from previous
if rows:
for k, v in rows[-1].items():
row.setdefault(k, v)
rows.append(row)
fieldnames.update(row.keys())
with open('smemstat.csv', 'w') as out:
dw = csv.DictWriter(out, fieldnames=sorted(fieldnames))
dw.writeheader()
list(map(dw.writerow, rows))
def glances():
rows = []
fieldnames = ['available', 'used', 'cached', 'mem_careful', 'percent',
'free', 'mem_critical', 'inactive', 'shared', 'history_size',
'mem_warning', 'total', 'active', 'buffers']
with open('glances.csv', 'w') as out:
dw = csv.DictWriter(out, fieldnames=fieldnames)
dw.writeheader()
with open('glances.json') as f:
for l in f:
d = json.loads(l)
dw.writerow(d['mem'])
if __name__ == '__main__':
globals()[sys.argv[1]]()
First let's look at free memory.
The difference between first and minimum is ~4.15 GiB. And here is how PSS figures look like:
And the sum:
Thus we can see that because of copy-on-write actual memory consumption is ~4.15 GiB. But we're still serialising data to send it to worker processes via Pool.map. Can we leverage copy-on-write here as well?
Shared data
To use copy-on-write we need to have the list(gen_matrix_df_list.values()) be accessible globally so the worker after fork can still read it.
Let's modify code after del gen_matrix_df in main like the following:
...
global global_gen_matrix_df_values
global_gen_matrix_df_values = list(gen_matrix_df_list.values())
del gen_matrix_df_list
p = Pool(2)
result = p.map(matrix_to_vcf, range(len(global_gen_matrix_df_values)))
...
Remove del gen_matrix_df_list that goes later.
And modify first lines of matrix_to_vcf like:
def matrix_to_vcf(i):
matrix_df = global_gen_matrix_df_values[i]
Now let's re-run it. Free memory:
Process tree:
And its sum:
Thus we're at maximum of ~2.9 GiB of actual memory usage (the peak main process has while building the data frame) and copy-on-write has helped!
As a side note, there's so called copy-on-read, the behaviour of Python's reference cycle garbage collector, described in Instagram Engineering (which led to gc.freeze in issue31558). But gc.disable() doesn't have an impact in this particular case.
Update
An alternative to copy-on-write copy-less data sharing can be delegating it to the kernel from the beginning by using numpy.memmap. Here's an example implementation from High Performance Data Processing in Python talk. The tricky part is then to make Pandas to use the mmaped Numpy array.
When you use multiprocessing.Pool a number of child processes will be created using the fork() system call. Each of those processes start off with an exact copy of the memory of the parent process at that time. Because you're loading the csv before you create the Pool of size 3, each of those 3 processes in the pool will unnecessarily have a copy of the data frame. (gen_matrix_df as well as gen_matrix_df_list will exist in the current process as well as in each of the 3 child processes, so 4 copies of each of these structures will be in memory)
Try creating the Pool before loading the file (at the very beginning actually) That should reduce the memory usage.
If it's still too high, you can:
Dump gen_matrix_df_list to a file, 1 item per line, e.g:
import os
import cPickle
with open('tempfile.txt', 'w') as f:
for item in gen_matrix_df_list.items():
cPickle.dump(item, f)
f.write(os.linesep)
Use Pool.imap() on an iterator over the lines that you dumped in this file, e.g.:
with open('tempfile.txt', 'r') as f:
p.imap(matrix_to_vcf, (cPickle.loads(line) for line in f))
(Note that matrix_to_vcf takes a (key, value) tuple in the example above, not just a value)
I hope that helps.
NB: I haven't tested the code above. It's only meant to demonstrate the idea.
I had the same issue. I needed to process a huge text corpus while keeping a knowledge base of few DataFrames of millions of rows loaded in memory. I think this issue is common so I will keep my answer oriented for general purposes.
A combination of settings solved the problem for me (1 & 3 & 5 only might do it for you):
Use Pool.imap (or imap_unordered) instead of Pool.map. This will iterate over data lazily than loading all of it in memory before starting processing.
Set a value to chunksize parameter. This will make imap faster too.
Set a value to maxtasksperchild parameter.
Append output to disk than in memory. Instantly or every while when it reaches a certain size.
Run the code in different batches. You can use itertools.islice if you have an iterator. The idea is to split your list(gen_matrix_df_list.values()) to three or more lists, then you pass the first third only to map or imap, then the second third in another run, etc. Since you have a list you can simply slice it in the same line of code.
GENERAL ANSWER ABOUT MEMORY WITH MULTIPROCESSING
You asked: "What is causing so much memory to be allocated". The answer relies on two parts.
First, as you already noticed, each multiprocessing worker gets it's own copy of the data (quoted from here), so you should chunk large arguments. Or for large files, read them in a little bit at a time, if possible.
By default the workers of the pool are real Python processes forked
using the multiprocessing module of the Python standard library when
n_jobs != 1. The arguments passed as input to the Parallel call are
serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be
reallocated n_jobs times by the workers.
Second, if you're trying to reclaim memory, you need to understand that python works differently than other languages, and you are relying on del to release the memory when it doesn't. I don't know if it's best, but in my own code, I've overcome this be reassigning the variable to a None or empty object.
FOR YOUR SPECIFIC EXAMPLE - MINIMAL CODE EDITING
As long as you can fit your large data in memory twice, I think you can do what you are trying to do by just changing a single line. I've written very similar code and it worked for me when I reassigned the variable (vice call del or any kind of garbage collect). If this doesn't work, you may need to follow the suggestions above and use disk I/O:
#### earlier code all the same
# clear memory by reassignment (not del or gc)
gen_matrix_df = {}
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
#del gen_matrix_df_list # I suspect you don't even need this, memory will free when the pool is closed
p.close()
p.join()
#### later code all the same
FOR YOUR SPECIFIC EXAMPLE - OPTIMAL MEMORY USAGE
As long as you can fit your large data in memory once, and you have some idea of how big your file is, you can use Pandas read_csv partial file reading, to read in only nrows at a time if you really want to micro-manage how much data is being read in, or a [fixed amount of memory at a time using chunksize], which returns an iterator5. By that I mean, the nrows parameter is just a single read: you might use that to just get a peek at a file, or if for some reason you wanted each part to have exactly the same number of rows (because, for example, if any of your data is strings of variable length, each row will not take up the same amount of memory). But I think for the purposes of prepping a file for multiprocessing, it will be far easier to use chunks, because that directly relates to memory, which is your concern. It will be easier to use trial & error to fit into memory based on specific sized chunks than number of rows, which will change the amount of memory usage depending on how much data is in the rows. The only other difficult part is that for some application specific reason, you're grouping some rows, so it just makes it a little bit more complicated. Using your code as an example:
'''load the genome matrix file onto pandas as dataframe.
This makes is more easy for multiprocessing'''
# store the splitted dataframes as list of key, values(pandas dataframe) pairs
# this list of dataframe will be used while multiprocessing
#not sure why you need the ordered dict here, might add memory overhead
#gen_matrix_df_list = collections.OrderedDict()
#a defaultdict won't throw an exception when we try to append to it the first time. if you don't want a default dict for some reason, you have to initialize each entry you care about.
gen_matrix_df_list = collections.defaultdict(list)
chunksize = 10 ** 6
for chunk in pd.read_csv(genome_matrix_file, sep='\t', names=header, chunksize=chunksize)
# now, group the dataframe by chromosome/contig - so it can be multiprocessed
gen_matrix_df = chunk.groupby('CHROM')
for chr_, data in gen_matrix_df:
gen_matrix_df_list[chr_].append(data)
'''Having sorted chunks on read to a list of df, now create single data frames for each chr_'''
#The dict contains a list of small df objects, so now concatenate them
#by reassigning to the same dict, the memory footprint is not increasing
for chr_ in gen_matrix_df_list.keys():
gen_matrix_df_list[chr_]=pd.concat(gen_matrix_df_list[chr_])
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
p.close()
p.join()

Convert systemtime to filetime (Python) [duplicate]

Any links for me to convert datetime to filetime using python?
Example: 13 Apr 2011 07:21:01.0874 (UTC) FILETIME=[57D8C920:01CBF9AB]
Got the above from an email header.
My answer in duplicated question got deleted, so I'll post here:
Surfing around i found this link: http://cboard.cprogramming.com/windows-programming/85330-hex-time-filetime.html
After that, everything become simple:
>>> ft = "57D8C920:01CBF9AB"
... # switch parts
... h2, h1 = [int(h, base=16) for h in ft.split(':')]
... # rebuild
... ft_dec = struct.unpack('>Q', struct.pack('>LL', h1, h2))[0]
... ft_dec
... 129471528618740000L
... # use function from iceaway's comment
... print filetime_to_dt(ft_dec)
2011-04-13 07:21:01
Tuning it up is up for you.
Well here is the solution I end up with
parm3=0x57D8C920; parm3=0x01CBF9AB
#Int32x32To64
ft_dec = struct.unpack('>Q', struct.pack('>LL', parm4, parm3))[0]
from datetime import datetime
EPOCH_AS_FILETIME = 116444736000000000; HUNDREDS_OF_NANOSECONDS = 10000000
dt = datetime.fromtimestamp((ft_dec - EPOCH_AS_FILETIME) / HUNDREDS_OF_NANOSECONDS)
print dt
Output will be:
2011-04-13 09:21:01 (GMT +1)
13 Apr 2011 07:21:01.0874 (UTC)
base on David Buxton 'filetimes.py'
^-Note that theres a difference in the hours
Well I changes two things:
fromtimestamp() fits somehow better than *UTC*fromtimestamp() since I'm dealing with file times here.
FAT time resolution is 2 seconds so I don't care about the 100ns rest that might fall apart.
(Well actually since resolution is 2 seconds normally there be no rest when dividing HUNDREDS_OF_NANOSECONDS )
... and beside the order of parameter passing pay attention that struct.pack('>LL' is for unsigned 32bit Int's!
If you've signed int's simply change it to struct.pack('>ll' for signed 32bit Int's!
(or click the struct.pack link above for more info)

Categories