Related
Background
I am trying to speed-up computation by parallelization (via joblib) using more available cores in Python 3.8, but observed that it does scale poorly.
Trials
I wrote a little script to test and demonstrate the behavior which can be found later. The script (see later) is designed to have a completely independent task doing some iterations of dummy operations using NumPy and Pandas. There is no input and no output to the task, no disc or other I/O, nor any communication or shared memory, just plain CPU and RAM usage. The processes do not use any other resources either other than the occasional request for the current time. Amdahl's Law should not apply to the code here, since there is no common code at all except for process setup.
I ran some experiments with increased workloads by duplicating the tasks using sequential vs. parallelization processing and measured the time it takes for each iteration and the whole (parallel) processes to complete. I ran the script on my Windows 10 laptop, and two AWS EC2 Linux (Amazon Linux 2) machines. The number of parallel processed never exceeded the number of available cores.
Observation
I observed the following (see results later for details, duration in seconds):
In case the number of parallel processed was less than the number of available cores, the total average CPUs utilization (user) never was more than 93%, system calls did not exceed 4%, and no iowait (measured with iostat -hxm 10)
The workload seems to be distributed equally over the available cores, though, which might be an indication for frequent switches between processes even though there are plenty of cores available
Interestingly, for sequential processing, the CPU utilization (user) was around 48%
The summed duration of all iterations is only slightly less than the total duration of a process, hence the process setup does not seem to be a major factor
For each doubling of the number of parallel processes there is a decrease in speed per each iteration/process of 50%
Whereas the duration for sequential processing approx. doubles as expected with doubling the workload (total number of iterations),
the duration for the parallel processing also increased significantly by approx. 50% per each doubling
These findings in this magnitude are unexpected to me.
Questions
What is the cause for this beavior?
Am I missing something?
How can it be remedied in order to utilize the full prospect of using more cores?
Detailed results
Windows 10
6 CPUs, 12 cores
Call: python .\time_parallel_processing.py 1,2,4,8 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParallelCount
Joblib 1 4.363902 0.195268 43.673971 10
2 6.322100 0.140654 63.870973 20
4 9.270582 0.464706 93.631790 40
8 15.489000 0.222859 156.670544 80
Seq 1 4.409772 0.126686 44.133441 10
2 4.465326 0.113183 89.377296 20
4 4.534959 0.125097 181.528372 40
8 4.444790 0.083315 355.849860 80
AWS c5.4xlarge
8 CPUs, 16 cores
Call: python time_parallel_processing.py 1,2,4,8,16 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.196086 0.009798 21.987626 10
2 3.392873 0.010025 34.297323 20
4 4.519174 0.126054 45.967140 40
8 6.888763 0.676024 71.815990 80
16 12.191278 0.156941 123.287779 160
Seq 1 2.192089 0.010873 21.945536 10
2 2.184294 0.008955 43.735713 20
4 2.201437 0.027537 88.156621 40
8 2.145312 0.009631 171.805374 80
16 2.137723 0.018985 342.393953 160
AWS c5.9xlarge
18 CPUs, 36 cores
Call: python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 1.888071 0.023799 18.905295 10
2 2.797132 0.009859 28.307708 20
4 3.349333 0.106755 34.199839 40
8 4.273267 0.705345 45.998927 80
16 6.383214 1.455857 70.469109 160
32 10.974141 4.220783 129.671016 320
Seq 1 1.891170 0.030131 18.934494 10
2 1.866365 0.007283 37.373133 20
4 1.893082 0.041085 75.813468 40
8 1.855832 0.007025 148.643725 80
16 1.896622 0.007573 303.828529 160
32 1.864366 0.009142 597.301383 320
Script code
import argparse
import sys
import time
from argparse import Namespace
from typing import List
import numpy as np
import pandas as pd
from joblib import delayed
from joblib import Parallel
from tqdm import tqdm
RESULT_COLUMNS = {"Mode": str, "ParCount": int, "ProcessId": int, "IterId": int, "Duration": float}
def _create_empty_data_frame() -> pd.DataFrame:
return pd.DataFrame({key: [] for key, _ in RESULT_COLUMNS.items()}).astype(RESULT_COLUMNS)
def _do_task() -> None:
for _ in range(10):
array: np.ndarray = np.random.rand(2500, 2500)
_ = np.matmul(array, array)
data_frame: pd.DataFrame = pd.DataFrame(np.random.rand(250, 250), columns=list(map(str, list(range(250)))))
_ = data_frame.merge(data_frame)
def _process(process_id: int, iter_count: int) -> pd.DataFrame:
durations: pd.DataFrame = _create_empty_data_frame()
for i in tqdm(range(iter_count)):
iter_start_time: float = time.time()
_do_task()
durations = durations.append(
{
"Mode": "",
"ParCount": 0,
"ProcessId": process_id,
"IterId": i,
"Duration": time.time() - iter_start_time,
},
ignore_index=True,
)
return durations
def main(args: Namespace) -> None:
"""Execute main script."""
iter_durations: List[pd.DataFrame] = []
mode_durations: List[pd.DataFrame] = []
for par_count in list(map(int, args.par_counts.split(","))):
total_iter_count: int = par_count * int(args.iter_count)
print(f"\nRunning {par_count} processes in parallel and {total_iter_count} iterations in total")
start_time_joblib: float = time.time()
with Parallel(n_jobs=par_count) as parallel:
joblib_durations: List[pd.DataFrame] = parallel(
delayed(_process)(process_id, int(args.iter_count)) for process_id in range(par_count)
)
iter_durations.append(pd.concat(joblib_durations).assign(**{"Mode": "Joblib", "ParCount": par_count}))
end_time_joblib: float = time.time()
print(f"\nRunning {par_count} processes sequentially with {total_iter_count} iterations in total")
start_time_seq: float = time.time()
seq_durations: List[pd.DataFrame] = []
for process_id in range(par_count):
seq_durations.append(_process(process_id, int(args.iter_count)))
iter_durations.append(pd.concat(seq_durations).assign(**{"Mode": "Seq", "ParCount": par_count}))
end_time_seq: float = time.time()
mode_durations.append(
pd.DataFrame(
{
"Mode": ["Joblib", "Seq"],
"ParCount": [par_count] * 2,
"Duration": [end_time_joblib - start_time_joblib, end_time_seq - start_time_seq],
"TotalIterCount": [total_iter_count] * 2,
}
)
)
print("\nDuration in seconds")
grouping_columns: List[str] = ["Mode", "ParCount"]
print(
pd.concat(iter_durations)
.groupby(grouping_columns)
.agg({"Duration": ["mean", "std"]})
.merge(
pd.concat(mode_durations).groupby(grouping_columns).agg({"Duration": ["mean"], "TotalIterCount": "mean"}),
on=grouping_columns,
suffixes=["/Iter", " total"],
how="inner",
)
)
if __name__ == "__main__":
print(f"Command line: {sys.argv}")
parser: argparse.ArgumentParser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"par_counts",
help="Comma separated list of parallel processes counts to start trials for (e.g. '1,2,4,8,16,32')",
)
parser.add_argument("iter_count", help="Number of iterations per parallel process to carry out")
args: argparse.Namespace = parser.parse_args()
start_time: float = time.time()
main(args)
print(f"\nTotal elapsed time: {time.time() - start_time:.2f} seconds")
Environment
Created with' conda env create -f environment.time_parallel.yaml
environment.time_parallel.yaml:
name: time_parallel
channels:
- defaults
- conda-forge
dependencies:
- python=3.8.5
- pip=20.3.3
- pandas=1.2.0
- numpy=1.19.2
- joblib=1.0.0
- tqdm=4.55.1
Update 1
Thanks to the coment of #sholderbach I investigated into the NumPy/Pandas usage and found out a couple of things.
1)
NumPy uses a linear algebra backend which automatically will run some commands (including matrix multiplication) in parallel threads which results in too many threads altogether clogging the system, the more parallel processes, the more, hence the increasing duration per iteration.
I tested this hypthesis by removing NumPy and Pandas operations in method _do_task adn replacing it by simple math operations only:
def _do_task() -> None:
for _ in range(10):
for i in range(10000000):
_ = 1000 ^ 2 % 200
The results are exactly as expected in that the duration of an iteration does not change when increasing the number of processes (beyond the number of cores available).
Windows 10
Call python time_parallel_processing.py 1,2,4,8 5
Duration in seconds
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.562570 0.015496 13.468393 5
2 2.556241 0.021074 13.781174 10
4 2.565614 0.054754 16.171828 20
8 2.630463 0.258474 20.328055 40
Seq 2 2.576542 0.033270 25.874965 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.082849 0.022352 20.854512 10
2 2.126195 0.034078 21.596860 20
4 2.287874 0.254493 27.420978 40
8 2.141553 0.030316 21.912917 80
16 2.156828 0.137937 24.483243 160
32 3.581366 1.197282 42.884399 320
Seq 2 2.076256 0.004231 41.571033 20
2)
Following the hint of #sholderbach I found a number of other links which cover the topic of linear algebra backends using multiple threads automatically and how to turn this off:
NumPy issue (from #sholderbach)
threadpoolctl package
Nice article
Pinning process to a specific CPU with Python (and package psutil)
Add to _process:
proc = psutil.Process()
proc.cpu_affinity([process_id])
with threadpool_limits(limits=1):
...
Add to environment:
- threadpoolctl=2.1.0
- psutil=5.8.0
Note: I had to replace joblib by multiprocessing, since pinning did not work properly with joblib (only one half of the processes got spawned at a time on Linux).
I did some tests with mixed results. Monitoring shows that pinnng and restricting to one thread per process works for both Windows 10 and Linux/AWS c5.9xlarge. Unfortunately, the absolute duration per iteration increases by these "fixes".
Also, the duration per iteration still begins to increase at some point of parallelization.
Here are the results:
Windows 10
Call: python time_parallel_processing.py 1,2,4,8 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 9.502184 0.046554 47.542230 5
2 9.557120 0.092897 49.488612 10
4 9.602235 0.078271 50.249238 20
8 10.518716 0.422020 60.186707 40
Seq 2 9.493682 0.062105 95.083382 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,20,24,28,32 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Parallel 1 5.271010 0.008730 15.862883 3
2 5.400430 0.016094 16.271649 6
4 5.708021 0.069001 17.428172 12
8 6.088623 0.179789 18.745922 24
16 8.330902 0.177772 25.566504 48
20 10.515132 3.081697 47.895538 60
24 13.506221 4.589382 53.348917 72
28 16.318631 4.961513 57.536180 84
32 19.800182 4.435462 64.717435 96
Seq 2 5.212529 0.037129 31.332297 6
What is the cause for this behavior?
Very generally, this type of slowdown usually indicates some combination of being blocked by the GIL, context-switching between cores, or doing a lot of pickling
Am I missing something?
You may be missing some small issues - try profiling (some sampling profiler may be much more performant than cProfile) to see where the time is spent!
However, there's still a finite limit to how fast this can be made before you are reimplementing the suggestions below
How can it be remedied in order to utilize the full prospect of using more cores?
Take a look at numba and dask, which can allow you to get tremendous speedups on numpy and pandas code through parallelization that steps outside of the GIL
numba compiles numpy code and caches it for greater speed and practical processor operations
dask is a framework full of good tricks for efficient parallelization on a single and multiple systems
When I had a scaling issue with ipyparallel, it was caused by garbage collection during the run.
The source code of timeit shows how to disable gc properly.
I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.
Code is at the bottom.
I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).
My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].
binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()
This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.
Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).
So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.
PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.
Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)
import os, errno, sys
import random, time
import mmap
def create_binary_test_file():
print("Creating files with 3,000,000,000 characters, takes a few seconds...")
test_binary_file1 = open("test_binary_file1.testbin", "wb")
test_binary_file2 = open("test_binary_file2.testbin", "wb")
test_binary_file3 = open("test_binary_file3.testbin", "wb")
for i in range(1000):
if i % 100 == 0 :
print("progress - ", i/10, " % ")
# efficiently create random strings and write to files
tbl = bytes.maketrans(bytearray(range(256)),
bytearray([ord(b'a') + b % 26 for b in range(256)]))
random_string = (os.urandom(3000000).translate(tbl))
test_binary_file1.write(str(random_string).encode('utf-8'))
test_binary_file2.write(str(random_string).encode('utf-8'))
test_binary_file3.write(str(random_string).encode('utf-8'))
test_binary_file1.close()
test_binary_file2.close()
test_binary_file3.close()
print("Created binary file for testing.The file contains 3,000,000,000 characters")
# Opening binary test file
try:
binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
create_binary_test_file()
binary_file = open("test_binary_file1.testbin", "r+b")
## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)
## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,100000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)
Getting information from start of file
Iter 9 Average time - 0.14790
Iter 19 Average time - 0.14590
Iter 29 Average time - 0.14456
Iter 39 Average time - 0.14279
Iter 49 Average time - 0.14256
Iter 59 Average time - 0.14312
Iter 69 Average time - 0.14145
Iter 79 Average time - 0.13867
Iter 89 Average time - 0.14079
Iter 99 Average time - 0.13979
Getting information from all of the file
Iter 9 Average time - 0.46114
Iter 19 Average time - 0.47547
Iter 29 Average time - 0.47936
Iter 39 Average time - 0.47469
Iter 49 Average time - 0.47158
Iter 59 Average time - 0.47114
Iter 69 Average time - 0.47247
Iter 79 Average time - 0.47881
Iter 89 Average time - 0.47792
Iter 99 Average time - 0.47681
The basic reason why you have this time difference is that you have to seek to where you need in the file. The further from position 0 you are, the longer it's going to take.
What might help is since you know the starting index you need, seek on the file descriptor to that point and then do the mmap. Or really, why bother with mmap in the first place - just read the number of bytes that you need from the seeked-to position, and put that into your result variable.
To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.
[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):
When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.
[/edit]
The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.
I'm 99.99% sure that if you test reading from the last 100KB of each file, it will perform as well as the first example - it's not about the location of the data, but about the size of the data accessed.
The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.
In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.
I have written the program (below) to:
read a huge text file as pandas dataframe
then groupby using a specific column value to split the data and store as list of dataframes.
then pipe the data to multiprocess Pool.map() to process each dataframe in parallel.
Everything is fine, the program works well on my small test dataset. But, when I pipe in my large data (about 14 GB), the memory consumption exponentially increases and then freezes the computer or gets killed (in HPC cluster).
I have added codes to clear the memory as soon as the data/variable isn't useful. I am also closing the pool as soon as it is done. Still with 14 GB input I was only expecting 2*14 GB memory burden, but it seems like lot is going on. I also tried to tweak using chunkSize and maxTaskPerChild, etc but I am not seeing any difference in optimization in both test vs. large file.
I think improvements to this code is/are required at this code position, when I start multiprocessing.
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
but, I am posting the whole code.
Test example: I created a test file ("genome_matrix_final-chr1234-1mb.txt") of upto 250 mb and ran the program. When I check the system monitor I can see that memory consumption increased by about 6 GB. I am not so clear why so much memory space is taken by 250 mb file plus some outputs. I have shared that file via drop box if it helps in seeing the real problem. https://www.dropbox.com/sh/coihujii38t5prd/AABDXv8ACGIYczeMtzKBo0eea?dl=0
Can someone suggest, How I can get rid of the problem?
My python script:
#!/home/bin/python3
import pandas as pd
import collections
from multiprocessing import Pool
import io
import time
import resource
print()
print('Checking required modules')
print()
''' change this input file name and/or path as need be '''
genome_matrix_file = "genome_matrix_final-chr1n2-2mb.txt" # test file 01
genome_matrix_file = "genome_matrix_final-chr1234-1mb.txt" # test file 02
#genome_matrix_file = "genome_matrix_final.txt" # large file
def main():
with open("genome_matrix_header.txt") as header:
header = header.read().rstrip('\n').split('\t')
print()
time01 = time.time()
print('starting time: ', time01)
'''load the genome matrix file onto pandas as dataframe.
This makes is more easy for multiprocessing'''
gen_matrix_df = pd.read_csv(genome_matrix_file, sep='\t', names=header)
# now, group the dataframe by chromosome/contig - so it can be multiprocessed
gen_matrix_df = gen_matrix_df.groupby('CHROM')
# store the splitted dataframes as list of key, values(pandas dataframe) pairs
# this list of dataframe will be used while multiprocessing
gen_matrix_df_list = collections.OrderedDict()
for chr_, data in gen_matrix_df:
gen_matrix_df_list[chr_] = data
# clear memory
del gen_matrix_df
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
del gen_matrix_df_list # clear memory
p.close()
p.join()
# concat the results from pool.map() and write it to a file
result_merged = pd.concat(result)
del result # clear memory
pd.DataFrame.to_csv(result_merged, "matrix_to_haplotype-chr1n2.txt", sep='\t', header=True, index=False)
print()
print('completed all process in "%s" sec. ' % (time.time() - time01))
print('Global maximum memory usage: %.2f (mb)' % current_mem_usage())
print()
'''function to convert the dataframe from genome matrix to desired output '''
def matrix_to_vcf(matrix_df):
print()
time02 = time.time()
# index position of the samples in genome matrix file
sample_idx = [{'10a': 33, '10b': 18}, {'13a': 3, '13b': 19},
{'14a': 20, '14b': 4}, {'16a': 5, '16b': 21},
{'17a': 6, '17b': 22}, {'23a': 7, '23b': 23},
{'24a': 8, '24b': 24}, {'25a': 25, '25b': 9},
{'26a': 10, '26b': 26}, {'34a': 11, '34b': 27},
{'35a': 12, '35b': 28}, {'37a': 13, '37b': 29},
{'38a': 14, '38b': 30}, {'3a': 31, '3b': 15},
{'8a': 32, '8b': 17}]
# sample index stored as ordered dictionary
sample_idx_ord_list = []
for ids in sample_idx:
ids = collections.OrderedDict(sorted(ids.items()))
sample_idx_ord_list.append(ids)
# for haplotype file
header = ['contig', 'pos', 'ref', 'alt']
# adding some suffixes "PI" to available sample names
for item in sample_idx_ord_list:
ks_update = ''
for ks in item.keys():
ks_update += ks
header.append(ks_update+'_PI')
header.append(ks_update+'_PG_al')
#final variable store the haplotype data
# write the header lines first
haplotype_output = '\t'.join(header) + '\n'
# to store the value of parsed the line and update the "PI", "PG" value for each sample
updated_line = ''
# read the piped in data back to text like file
matrix_df = pd.DataFrame.to_csv(matrix_df, sep='\t', index=False)
matrix_df = matrix_df.rstrip('\n').split('\n')
for line in matrix_df:
if line.startswith('CHROM'):
continue
line_split = line.split('\t')
chr_ = line_split[0]
ref = line_split[2]
alt = list(set(line_split[3:]))
# remove the alleles "N" missing and "ref" from the alt-alleles
alt_up = list(filter(lambda x: x!='N' and x!=ref, alt))
# if no alt alleles are found, just continue
# - i.e : don't write that line in output file
if len(alt_up) == 0:
continue
#print('\nMining data for chromosome/contig "%s" ' %(chr_ ))
#so, we have data for CHR, POS, REF, ALT so far
# now, we mine phased genotype for each sample pair (as "PG_al", and also add "PI" tag)
sample_data_for_vcf = []
for ids in sample_idx_ord_list:
sample_data = []
for key, val in ids.items():
sample_value = line_split[val]
sample_data.append(sample_value)
# now, update the phased state for each sample
# also replacing the missing allele i.e "N" and "-" with ref-allele
sample_data = ('|'.join(sample_data)).replace('N', ref).replace('-', ref)
sample_data_for_vcf.append(str(chr_))
sample_data_for_vcf.append(sample_data)
# add data for all the samples in that line, append it with former columns (chrom, pos ..) ..
# and .. write it to final haplotype file
sample_data_for_vcf = '\t'.join(sample_data_for_vcf)
updated_line = '\t'.join(line_split[0:3]) + '\t' + ','.join(alt_up) + \
'\t' + sample_data_for_vcf + '\n'
haplotype_output += updated_line
del matrix_df # clear memory
print('completed haplotype preparation for chromosome/contig "%s" '
'in "%s" sec. ' %(chr_, time.time()-time02))
print('\tWorker maximum memory usage: %.2f (mb)' %(current_mem_usage()))
# return the data back to the pool
return pd.read_csv(io.StringIO(haplotype_output), sep='\t')
''' to monitor memory '''
def current_mem_usage():
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024.
if __name__ == '__main__':
main()
Update for bounty hunters:
I have achieved multiprocessing using Pool.map() but the code is causing a big memory burden (input test file ~ 300 mb, but memory burden is about 6 GB). I was only expecting 3*300 mb memory burden at max.
Can somebody explain, What is causing such a huge memory requirement for such a small file and for such small length computation.
Also, i am trying to take the answer and use that to improve multiprocess in my large program. So, addition of any method, module that doesn't change the structure of computation part (CPU bound process) too much should be fine.
I have included two test files for the test purposes to play with the code.
The attached code is full code so it should work as intended as it is when copied-pasted. Any changes should be used only to improve optimization in multiprocessing steps.
Prerequisite
In Python (in the following I use 64-bit build of Python 3.6.5) everything is an object. This has its overhead and with getsizeof we can see exactly the size of an object in bytes:
>>> import sys
>>> sys.getsizeof(42)
28
>>> sys.getsizeof('T')
50
When fork system call used (default on *nix, see multiprocessing.get_start_method()) to create a child process, parent's physical memory is not copied and copy-on-write technique is used.
Fork child process will still report full RSS (resident set size) of the parent process. Because of this fact, PSS (proportional set size) is more appropriate metric to estimate memory usage of forking application. Here's an example from the page:
Process A has 50 KiB of unshared memory
Process B has 300 KiB of unshared memory
Both process A and process B have 100 KiB of the same shared memory region
Since the PSS is defined as the sum of the unshared memory of a process and the proportion of memory shared with other processes, the PSS for these two processes are as follows:
PSS of process A = 50 KiB + (100 KiB / 2) = 100 KiB
PSS of process B = 300 KiB + (100 KiB / 2) = 350 KiB
The data frame
Not let's look at your DataFrame alone. memory_profiler will help us.
justpd.py
#!/usr/bin/env python3
import pandas as pd
from memory_profiler import profile
#profile
def main():
with open('genome_matrix_header.txt') as header:
header = header.read().rstrip('\n').split('\t')
gen_matrix_df = pd.read_csv(
'genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
gen_matrix_df.info()
gen_matrix_df.info(memory_usage='deep')
if __name__ == '__main__':
main()
Now let's use the profiler:
mprof run justpd.py
mprof plot
We can see the plot:
and line-by-line trace:
Line # Mem usage Increment Line Contents
================================================
6 54.3 MiB 54.3 MiB #profile
7 def main():
8 54.3 MiB 0.0 MiB with open('genome_matrix_header.txt') as header:
9 54.3 MiB 0.0 MiB header = header.read().rstrip('\n').split('\t')
10
11 2072.0 MiB 2017.7 MiB gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
12
13 2072.0 MiB 0.0 MiB gen_matrix_df.info()
14 2072.0 MiB 0.0 MiB gen_matrix_df.info(memory_usage='deep')
We can see that the data frame takes ~2 GiB with peak at ~3 GiB while it's being built. What's more interesting is the output of info.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 34 columns):
...
dtypes: int64(2), object(32)
memory usage: 1.0+ GB
But info(memory_usage='deep') ("deep" means introspection of the data deeply by interrogating object dtypes, see below) gives:
memory usage: 7.9 GB
Huh?! Looking outside of the process we can make sure that memory_profiler's figures are correct. sys.getsizeof also shows the same value for the frame (most probably because of custom __sizeof__) and so will other tools that use it to estimate allocated gc.get_objects(), e.g. pympler.
# added after read_csv
from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()
Gives:
types | # objects | total size
================================================== | =========== | ============
<class 'pandas.core.series.Series | 34 | 7.93 GB
<class 'list | 7839 | 732.38 KB
<class 'str | 7741 | 550.10 KB
<class 'int | 1810 | 49.66 KB
<class 'dict | 38 | 7.43 KB
<class 'pandas.core.internals.SingleBlockManager | 34 | 3.98 KB
<class 'numpy.ndarray | 34 | 3.19 KB
So where do these 7.93 GiB come from? Let's try to explain this. We have 4M rows and 34 columns, which gives us 134M values. They are either int64 or object (which is a 64-bit pointer; see using pandas with large data for detailed explanation). Thus we have 134 * 10 ** 6 * 8 / 2 ** 20 ~1022 MiB only for values in the data frame. What about the remaining ~ 6.93 GiB?
String interning
To understand the behaviour it's necessary to know that Python does string interning. There are two good articles (one, two) about string interning in Python 2. Besides the Unicode change in Python 3 and PEP 393 in Python 3.3 the C-structures have changed, but the idea is the same. Basically, every short string that looks like an identifier will be cached by Python in an internal dictionary and references will point to the same Python objects. In other word we can say it behaves like a singleton. Articles that I mentioned above explain what significant memory profile and performance improvements it gives. We can check if a string is interned using interned field of PyASCIIObject:
import ctypes
class PyASCIIObject(ctypes.Structure):
_fields_ = [
('ob_refcnt', ctypes.c_size_t),
('ob_type', ctypes.py_object),
('length', ctypes.c_ssize_t),
('hash', ctypes.c_int64),
('state', ctypes.c_int32),
('wstr', ctypes.c_wchar_p)
]
Then:
>>> a = 'name'
>>> b = '!##$'
>>> a_struct = PyASCIIObject.from_address(id(a))
>>> a_struct.state & 0b11
1
>>> b_struct = PyASCIIObject.from_address(id(b))
>>> b_struct.state & 0b11
0
With two strings we can also do identity comparison (addressed in memory comparison in case of CPython).
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>> gen_matrix_df.REF[0] is gen_matrix_df.REF[6]
True
Because of that fact, in regard to object dtype, the data frame allocates at most 20 strings (one per amino acids). Though, it's worth noting that Pandas recommends categorical types for enumerations.
Pandas memory
Thus we can explain the naive estimate of 7.93 GiB like:
>>> rows = 4 * 10 ** 6
>>> int_cols = 2
>>> str_cols = 32
>>> int_size = 8
>>> str_size = 58
>>> ptr_size = 8
>>> (int_cols * int_size + str_cols * (str_size + ptr_size)) * rows / 2 ** 30
7.927417755126953
Note that str_size is 58 bytes, not 50 as we've seen above for 1-character literal. It's because PEP 393 defines compact and non-compact strings. You can check it with sys.getsizeof(gen_matrix_df.REF[0]).
Actual memory consumption should be ~1 GiB as it's reported by gen_matrix_df.info(), it's twice as much. We can assume it has something to do with memory (pre)allocation done by Pandas or NumPy. The following experiment shows that it's not without reason (multiple runs show the save picture):
Line # Mem usage Increment Line Contents
================================================
8 53.1 MiB 53.1 MiB #profile
9 def main():
10 53.1 MiB 0.0 MiB with open("genome_matrix_header.txt") as header:
11 53.1 MiB 0.0 MiB header = header.read().rstrip('\n').split('\t')
12
13 2070.9 MiB 2017.8 MiB gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
14 2071.2 MiB 0.4 MiB gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
15 2071.2 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
16 2040.7 MiB -30.5 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
...
23 1827.1 MiB -30.5 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
24 1094.7 MiB -732.4 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
25 1765.9 MiB 671.3 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
26 1094.7 MiB -671.3 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
27 1704.8 MiB 610.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
28 1094.7 MiB -610.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
29 1643.9 MiB 549.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
30 1094.7 MiB -549.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
31 1582.8 MiB 488.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
32 1094.7 MiB -488.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
33 1521.9 MiB 427.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
34 1094.7 MiB -427.2 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
35 1460.8 MiB 366.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
36 1094.7 MiB -366.1 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
37 1094.7 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
...
47 1094.7 MiB 0.0 MiB gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
I want to finish this section by a quote from fresh article about design issues and future Pandas2 by original author of Pandas.
pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset
Process tree
Let's come to the pool, finally, and see if can make use of copy-on-write. We'll use smemstat (available form an Ubuntu repository) to estimate process group memory sharing and glances to write down system-wide free memory. Both can write JSON.
We'll run original script with Pool(2). We'll need 3 terminal windows.
smemstat -l -m -p "python3.6 script.py" -o smemstat.json 1
glances -t 1 --export-json glances.json
mprof run -M script.py
Then mprof plot produces:
The sum chart (mprof run --nopython --include-children ./script.py) looks like:
Note that two charts above show RSS. The hypothesis is that because of copy-on-write it's doesn't reflect actual memory usage. Now we have two JSON files from smemstat and glances. I'll the following script to covert the JSON files to CSV.
#!/usr/bin/env python3
import csv
import sys
import json
def smemstat():
with open('smemstat.json') as f:
smem = json.load(f)
rows = []
fieldnames = set()
for s in smem['smemstat']['periodic-samples']:
row = {}
for ps in s['smem-per-process']:
if 'script.py' in ps['command']:
for k in ('uss', 'pss', 'rss'):
row['{}-{}'.format(ps['pid'], k)] = ps[k] // 2 ** 20
# smemstat produces empty samples, backfill from previous
if rows:
for k, v in rows[-1].items():
row.setdefault(k, v)
rows.append(row)
fieldnames.update(row.keys())
with open('smemstat.csv', 'w') as out:
dw = csv.DictWriter(out, fieldnames=sorted(fieldnames))
dw.writeheader()
list(map(dw.writerow, rows))
def glances():
rows = []
fieldnames = ['available', 'used', 'cached', 'mem_careful', 'percent',
'free', 'mem_critical', 'inactive', 'shared', 'history_size',
'mem_warning', 'total', 'active', 'buffers']
with open('glances.csv', 'w') as out:
dw = csv.DictWriter(out, fieldnames=fieldnames)
dw.writeheader()
with open('glances.json') as f:
for l in f:
d = json.loads(l)
dw.writerow(d['mem'])
if __name__ == '__main__':
globals()[sys.argv[1]]()
First let's look at free memory.
The difference between first and minimum is ~4.15 GiB. And here is how PSS figures look like:
And the sum:
Thus we can see that because of copy-on-write actual memory consumption is ~4.15 GiB. But we're still serialising data to send it to worker processes via Pool.map. Can we leverage copy-on-write here as well?
Shared data
To use copy-on-write we need to have the list(gen_matrix_df_list.values()) be accessible globally so the worker after fork can still read it.
Let's modify code after del gen_matrix_df in main like the following:
...
global global_gen_matrix_df_values
global_gen_matrix_df_values = list(gen_matrix_df_list.values())
del gen_matrix_df_list
p = Pool(2)
result = p.map(matrix_to_vcf, range(len(global_gen_matrix_df_values)))
...
Remove del gen_matrix_df_list that goes later.
And modify first lines of matrix_to_vcf like:
def matrix_to_vcf(i):
matrix_df = global_gen_matrix_df_values[i]
Now let's re-run it. Free memory:
Process tree:
And its sum:
Thus we're at maximum of ~2.9 GiB of actual memory usage (the peak main process has while building the data frame) and copy-on-write has helped!
As a side note, there's so called copy-on-read, the behaviour of Python's reference cycle garbage collector, described in Instagram Engineering (which led to gc.freeze in issue31558). But gc.disable() doesn't have an impact in this particular case.
Update
An alternative to copy-on-write copy-less data sharing can be delegating it to the kernel from the beginning by using numpy.memmap. Here's an example implementation from High Performance Data Processing in Python talk. The tricky part is then to make Pandas to use the mmaped Numpy array.
When you use multiprocessing.Pool a number of child processes will be created using the fork() system call. Each of those processes start off with an exact copy of the memory of the parent process at that time. Because you're loading the csv before you create the Pool of size 3, each of those 3 processes in the pool will unnecessarily have a copy of the data frame. (gen_matrix_df as well as gen_matrix_df_list will exist in the current process as well as in each of the 3 child processes, so 4 copies of each of these structures will be in memory)
Try creating the Pool before loading the file (at the very beginning actually) That should reduce the memory usage.
If it's still too high, you can:
Dump gen_matrix_df_list to a file, 1 item per line, e.g:
import os
import cPickle
with open('tempfile.txt', 'w') as f:
for item in gen_matrix_df_list.items():
cPickle.dump(item, f)
f.write(os.linesep)
Use Pool.imap() on an iterator over the lines that you dumped in this file, e.g.:
with open('tempfile.txt', 'r') as f:
p.imap(matrix_to_vcf, (cPickle.loads(line) for line in f))
(Note that matrix_to_vcf takes a (key, value) tuple in the example above, not just a value)
I hope that helps.
NB: I haven't tested the code above. It's only meant to demonstrate the idea.
I had the same issue. I needed to process a huge text corpus while keeping a knowledge base of few DataFrames of millions of rows loaded in memory. I think this issue is common so I will keep my answer oriented for general purposes.
A combination of settings solved the problem for me (1 & 3 & 5 only might do it for you):
Use Pool.imap (or imap_unordered) instead of Pool.map. This will iterate over data lazily than loading all of it in memory before starting processing.
Set a value to chunksize parameter. This will make imap faster too.
Set a value to maxtasksperchild parameter.
Append output to disk than in memory. Instantly or every while when it reaches a certain size.
Run the code in different batches. You can use itertools.islice if you have an iterator. The idea is to split your list(gen_matrix_df_list.values()) to three or more lists, then you pass the first third only to map or imap, then the second third in another run, etc. Since you have a list you can simply slice it in the same line of code.
GENERAL ANSWER ABOUT MEMORY WITH MULTIPROCESSING
You asked: "What is causing so much memory to be allocated". The answer relies on two parts.
First, as you already noticed, each multiprocessing worker gets it's own copy of the data (quoted from here), so you should chunk large arguments. Or for large files, read them in a little bit at a time, if possible.
By default the workers of the pool are real Python processes forked
using the multiprocessing module of the Python standard library when
n_jobs != 1. The arguments passed as input to the Parallel call are
serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be
reallocated n_jobs times by the workers.
Second, if you're trying to reclaim memory, you need to understand that python works differently than other languages, and you are relying on del to release the memory when it doesn't. I don't know if it's best, but in my own code, I've overcome this be reassigning the variable to a None or empty object.
FOR YOUR SPECIFIC EXAMPLE - MINIMAL CODE EDITING
As long as you can fit your large data in memory twice, I think you can do what you are trying to do by just changing a single line. I've written very similar code and it worked for me when I reassigned the variable (vice call del or any kind of garbage collect). If this doesn't work, you may need to follow the suggestions above and use disk I/O:
#### earlier code all the same
# clear memory by reassignment (not del or gc)
gen_matrix_df = {}
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
#del gen_matrix_df_list # I suspect you don't even need this, memory will free when the pool is closed
p.close()
p.join()
#### later code all the same
FOR YOUR SPECIFIC EXAMPLE - OPTIMAL MEMORY USAGE
As long as you can fit your large data in memory once, and you have some idea of how big your file is, you can use Pandas read_csv partial file reading, to read in only nrows at a time if you really want to micro-manage how much data is being read in, or a [fixed amount of memory at a time using chunksize], which returns an iterator5. By that I mean, the nrows parameter is just a single read: you might use that to just get a peek at a file, or if for some reason you wanted each part to have exactly the same number of rows (because, for example, if any of your data is strings of variable length, each row will not take up the same amount of memory). But I think for the purposes of prepping a file for multiprocessing, it will be far easier to use chunks, because that directly relates to memory, which is your concern. It will be easier to use trial & error to fit into memory based on specific sized chunks than number of rows, which will change the amount of memory usage depending on how much data is in the rows. The only other difficult part is that for some application specific reason, you're grouping some rows, so it just makes it a little bit more complicated. Using your code as an example:
'''load the genome matrix file onto pandas as dataframe.
This makes is more easy for multiprocessing'''
# store the splitted dataframes as list of key, values(pandas dataframe) pairs
# this list of dataframe will be used while multiprocessing
#not sure why you need the ordered dict here, might add memory overhead
#gen_matrix_df_list = collections.OrderedDict()
#a defaultdict won't throw an exception when we try to append to it the first time. if you don't want a default dict for some reason, you have to initialize each entry you care about.
gen_matrix_df_list = collections.defaultdict(list)
chunksize = 10 ** 6
for chunk in pd.read_csv(genome_matrix_file, sep='\t', names=header, chunksize=chunksize)
# now, group the dataframe by chromosome/contig - so it can be multiprocessed
gen_matrix_df = chunk.groupby('CHROM')
for chr_, data in gen_matrix_df:
gen_matrix_df_list[chr_].append(data)
'''Having sorted chunks on read to a list of df, now create single data frames for each chr_'''
#The dict contains a list of small df objects, so now concatenate them
#by reassigning to the same dict, the memory footprint is not increasing
for chr_ in gen_matrix_df_list.keys():
gen_matrix_df_list[chr_]=pd.concat(gen_matrix_df_list[chr_])
'''Now, pipe each dataframe from the list using map.Pool() '''
p = Pool(3) # number of pool to run at once; default at 1
result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
p.close()
p.join()
I would like to parse a text file to extract information into a spreadsheet program , the problem is that the input text file often has multiple seperators ,delimiters and text characters like below.
Script Task : dd_new
==========Summary==========
Success Num : 2690
Fail Num : 2342
===========================
==========Fail MML Command==========
MML Command-----LST RACHCFG:;
NE : 042249
Report : Ne is not connected.
MML Command-----LST RACHCFG:;
NE : 073196
Report : +++ 073196 2016-08-11 16:01:00 DST
O&M #22638
%%/*327033029*/LST RACHCFG:;%%
RETCODE = 939589973 Invalid Command.
--- END
MML Command-----LST RACHCFG:;
NE : 236263
Report : +++ 236263 2016-08-11 16:00:59 DST
O&M #807456706
%%/*325939762*/LST RACHCFG:;%%
RETCODE = 0 Operation succeeded.
Display RACHCfg
---------------
Local cell ID Power ramping step(dB) Preamble initial received target power(dBm) Message size of select group A(bit) PRACH Frequency Offset Indication of PRACH Configuration Index PRACH Configuration Index Maximum number of preamble transmission Timer for contention resolution(subframe) Maximum number of Msg3 HARQ transmissions
1 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
2 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
3 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
(Number of results = 3)
--- END
MML Command-----LST RACHCFG:;
NE : 236264
Report : +++ 236264 2016-08-11 16:01:00 DST
O&M #807445026
%%/*325939772*/LST RACHCFG:;%%
RETCODE = 0 Operation succeeded.
Display RACHCfg
---------------
Local cell ID = 1
Power ramping step(dB) = 2dB
Preamble initial received target power(dBm) = -104dBm
Message size of select group A(bit) = 56bits
PRACH Frequency Offset = 6
Indication of PRACH Configuration Index = Not configure
PRACH Configuration Index = NULL
Maximum number of preamble transmission = 10times
Timer for contention resolution(subframe) = 64
Maximum number of Msg3 HARQ transmissions = 5
(Number of results = 1)
--- END
Approach:
The part where we split the line with a ":" as in line.split(':') is a clear way to build a dictionary ,but I need a way for the program to gnaw its way into making decisions all along the way.
Is it a valid block ? (header Script task :dd new can be discarded).
Fail MML commands with both NE not connected and Invalid command need to be extracted with the corresponding type into a dict :
{NE: '042249', Status : 'Ne is not connected'}, {NE :'07316', Status : 'Invalid Command'}
Succesful MML command should be able to parse succesfully into a nested dictionary when either of the two output formatting of Display RACH Cfg are encountered .Both should yield (Nested Key value dictionary for each NE,Local cell id and then each parameter.
.
{NE:236263,{Local Cell Id: '1' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........},Local Cell Id: '2' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........}Local Cell Id: '3' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........}
and a similar (with a single Local Cell Id for the second case)
{NE :236264 ,{Local Cell Id : '1' { Power ramping Step :'2dB',Preamble initial received target power:-104dBm,.....}
I have searched the forums and seen examples of regex and ideas of replacing '===' with start /end tag and reading blocks but if there is a more pythonic way or an already existing library that deals with such stuff, it would make my life very easy.
For those who want to know what it is, these are MML logs from telecommunication equipment listing its configuration.
I'm running a streaming hadoop job on a single hadoop pseudo-distributed node in python, also using hadoop-lzo to produce splits on a .lzo compressed input file.
Everything works as expected when using small compressed or uncompressed test datasets; MapReduce output matches that from a simple 'cat | map | sort | reduce' pipeline in unix. - whether the input is compressed or not.
However, once I move to processing the single large .lzo (pre-indexed) dataset (~40GB compressed) and the job is split to multiple mappers, the output looks to be truncated - only the first few key values are present.
The code + outputs follow - as you can see, it's a very simple count for testing the whole process.
output from straight forward unix pipeline on test data (subset of large dataset);
lzop -cd objectdata_input.lzo | ./objectdata_map.py | sort | ./objectdata_red.py
3656 3
3671 3
51 6
output from hadoop job on test data (same test data as above)
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input objectdata_input.lzo -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -output retention_counts -mapper objectdata_map.py -reducer objectdata_red.py -file /home/bob/python-dev/objectdata_map.py -file /home/bob/python-dev/objectdata_red.py
3656 3
3671 3
51 6
Now, the test data is a small subset of lines from the real dataset, so I would at least expect to see the keys from above in the resulting output when the job is run against the full dataset. However, what I get is;
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input objectdata_input_full.lzo -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -output retention_counts -mapper objectdata_map.py -reducer objectdata_red.py -file /home/bob/python-dev/objectdata_map.py -file /home/bob/python-dev/objectdata_red.py
1 40475582
12 48874
14 8929777
15 219984
16 161340
17 793211
18 78862
19 47561
2 14279960
20 56399
21 3360
22 944639
23 1384073
24 956886
25 9667
26 51542
27 2796
28 336767
29 840
3 3874316
30 1776
33 1448
34 12144
35 1872
36 1919
37 2035
38 291
39 422
4 539750
40 1820
41 1627
42 97678
43 67581
44 11009
45 938
46 849
47 375
48 876
49 671
5 262848
50 5674
51 90
6 6459687
7 4711612
8 20505097
9 135592
...There are many less keys than I would expect based on the dataset.
I'm less bothered by the key's themselves - this set could be expected given the input dataset, I am more concerned that there should be many many more keys, in the thousands. When I run the code in a unix pipeline against the first 25million records in the dataset, I get keys in the range approx 1 - 7000.
So, this output appears to be just the first few lines of what I would actually expect, and I'm not sure why. Am I missing collating many part-0000# files? or something similar? this is just a single-node pseudo-distributed hadoop I am testing on at home, so if there are more part-# files to collect I have no idea where they could be; they do not show up in the retention_counts dir in HDFS.
The mapper and reducer code is as follows - effectivley the same as the many word-count examples floating about;
objectdata_map.py
#!/usr/bin/env python
import sys
RETENTION_DAYS=(8321, 8335)
for line in sys.stdin:
line=line.strip()
try:
retention_days=int(line[RETENTION_DAYS[0]:RETENTION_DAYS[1]])
print "%s\t%s" % (retention_days,1)
except:
continue
objectdata_red.py
#!/usr/bin/env python
import sys
last_key=None
key_count=0
for line in sys.stdin:
key=line.split('\t')[0]
if last_key and last_key!=key:
print "%s\t%s" % (last_key,key_count)
key_count=1
else:
key_count+=1
last_key=key
print "%s\t%s" % (last_key,key_count)
This is all on a manually installed hadoop 1.1.2, pseudo-distributed mode, with hadoop-lzo built and installed from
https://github.com/kevinweil/hadoop-lzo