MapReduce output not the complete set expected? - python

I'm running a streaming hadoop job on a single hadoop pseudo-distributed node in python, also using hadoop-lzo to produce splits on a .lzo compressed input file.
Everything works as expected when using small compressed or uncompressed test datasets; MapReduce output matches that from a simple 'cat | map | sort | reduce' pipeline in unix. - whether the input is compressed or not.
However, once I move to processing the single large .lzo (pre-indexed) dataset (~40GB compressed) and the job is split to multiple mappers, the output looks to be truncated - only the first few key values are present.
The code + outputs follow - as you can see, it's a very simple count for testing the whole process.
output from straight forward unix pipeline on test data (subset of large dataset);
lzop -cd objectdata_input.lzo | ./objectdata_map.py | sort | ./objectdata_red.py
3656 3
3671 3
51 6
output from hadoop job on test data (same test data as above)
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input objectdata_input.lzo -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -output retention_counts -mapper objectdata_map.py -reducer objectdata_red.py -file /home/bob/python-dev/objectdata_map.py -file /home/bob/python-dev/objectdata_red.py
3656 3
3671 3
51 6
Now, the test data is a small subset of lines from the real dataset, so I would at least expect to see the keys from above in the resulting output when the job is run against the full dataset. However, what I get is;
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input objectdata_input_full.lzo -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -output retention_counts -mapper objectdata_map.py -reducer objectdata_red.py -file /home/bob/python-dev/objectdata_map.py -file /home/bob/python-dev/objectdata_red.py
1 40475582
12 48874
14 8929777
15 219984
16 161340
17 793211
18 78862
19 47561
2 14279960
20 56399
21 3360
22 944639
23 1384073
24 956886
25 9667
26 51542
27 2796
28 336767
29 840
3 3874316
30 1776
33 1448
34 12144
35 1872
36 1919
37 2035
38 291
39 422
4 539750
40 1820
41 1627
42 97678
43 67581
44 11009
45 938
46 849
47 375
48 876
49 671
5 262848
50 5674
51 90
6 6459687
7 4711612
8 20505097
9 135592
...There are many less keys than I would expect based on the dataset.
I'm less bothered by the key's themselves - this set could be expected given the input dataset, I am more concerned that there should be many many more keys, in the thousands. When I run the code in a unix pipeline against the first 25million records in the dataset, I get keys in the range approx 1 - 7000.
So, this output appears to be just the first few lines of what I would actually expect, and I'm not sure why. Am I missing collating many part-0000# files? or something similar? this is just a single-node pseudo-distributed hadoop I am testing on at home, so if there are more part-# files to collect I have no idea where they could be; they do not show up in the retention_counts dir in HDFS.
The mapper and reducer code is as follows - effectivley the same as the many word-count examples floating about;
objectdata_map.py
#!/usr/bin/env python
import sys
RETENTION_DAYS=(8321, 8335)
for line in sys.stdin:
line=line.strip()
try:
retention_days=int(line[RETENTION_DAYS[0]:RETENTION_DAYS[1]])
print "%s\t%s" % (retention_days,1)
except:
continue
objectdata_red.py
#!/usr/bin/env python
import sys
last_key=None
key_count=0
for line in sys.stdin:
key=line.split('\t')[0]
if last_key and last_key!=key:
print "%s\t%s" % (last_key,key_count)
key_count=1
else:
key_count+=1
last_key=key
print "%s\t%s" % (last_key,key_count)
This is all on a manually installed hadoop 1.1.2, pseudo-distributed mode, with hadoop-lzo built and installed from
https://github.com/kevinweil/hadoop-lzo

Related

Index error while using OligoMiner commands in wsl

I am trying to design oligoprobes using OligoMiner tool in windows subsystem linux (wsl) by activating anaconda software. Please have a loo to th ecommands given below. in the middle i am facing an error of Index error and I am not able to fix it. I copied the fasta DNA files into Oligominer folder before giving commands in wsl. these were as follows:
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ python blockParse.py 5.fa
0 of 345789
100000 of 345789
200000 of 345789
6635 candidate probes identified in 345.77 kb yielding 19.19 candidates/kb
Program took 14.599275 seconds
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ bowtie2-build 5.fa 5
Settings:
Output files: "5.*.bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
5.fa
Building a SMALL index
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 86447
Using parameters --bmax 64836 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 64836 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 345789 (target: 64835)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 345789 for bucket 1
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 345790 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 82484
fchr[G]: 168573
fchr[T]: 263073
fchr[$]: 345789
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4309819 bytes to primary EBWT file: 5.1.bt2
Wrote 86452 bytes to secondary EBWT file: 5.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 345789
bwtLen: 345790
sz: 86448
bwtSz: 86448
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 21612
offsSz: 86448
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 1801
numLines: 1801
ebwtTotLen: 115264
ebwtTotSz: 115264
color: 0
reverse: 0
Total time for call to driver() for forward index: 00:00:00
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 86447
Using parameters --bmax 64836 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 64836 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 345789 (target: 64835)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 345789 for bucket 1
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 345790 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 82484
fchr[G]: 168573
fchr[T]: 263073
fchr[$]: 345789
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4309819 bytes to primary EBWT file: 5.rev.1.bt2
Wrote 86452 bytes to secondary EBWT file: 5.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 345789
bwtLen: 345790
sz: 86448
bwtSz: 86448
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 21612
offsSz: 86448
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 1801
numLines: 1801
ebwtTotLen: 115264
ebwtTotSz: 115264
color: 0
reverse: 1
Total time for backward call to driver() for mirror index: 00:00:01
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ bowtie2 -x ~/ol_dir/OligoMiner$ bowtie2 -x ~/ol_dir/OligoMiner/5 -U 5.fa
stq --no-hd -t -k 100 --very-sensitive-local -S 5_u.sam
(ERR): "/home/rahul/ol_dir/OligoMiner$" does not exist or is not a Bowtie 2 index
Exiting now ...
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ bowtie2 -x ~/ol_dir/OligoMiner/5 -U 5.fa
Error: reads file does not look like a FASTQ file
terminate called after throwing an instance of 'int'
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ bowtie2 -x ~/ol_dir/OligoMiner/5 -U 5.fastq --no-hd -t -k 100 --very-sensitive-local -S 5_u.sam
Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:00:00
6635 reads; of these:
6635 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
6578 (99.14%) aligned exactly 1 time
57 (0.86%) aligned >1 times
100.00% overall alignment rate
Time searching: 00:00:00
Overall time: 00:00:00
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$ python outputClean.py -u -f 5_u.sam
Traceback (most recent call last):
File "outputClean.py", line 486, in <module>
main()
File "outputClean.py", line 480, in main
reportVal, debugVal, metaVal, outNameVal, startTime)
File "outputClean.py", line 70, in cleanOutput
if x[0] is not '#' else ' ' for x in file_read]
IndexError: list index out of range
(ol) rahul#DESKTOP-J3Q9JD9:~/ol_dir/OligoMiner$
I was trying Anaconda activated environment where Oligominer tool was used to create oligoprobes from the input DNA fasta file.
I was expecting to get the probes after the commands given but could not get them.

Opencv_traincascade- is it not possible to train opcncv_traincascade with just 26 samples or less?

I am trying to run this code in terminal
C:\Ankit\VirEnv\opencv\build\x64\vc15\bin\opencv_traincascade.exe -data cascade/ -vec C:\Ankit\VirEnv\pos3.vec -bg neg.txt -w 24 -h 24 -negPos 20 -numNeg 1000 -minHitRate 0.9
This is my output
(env) PS C:\Ankit\VirEnv> C:\Ankit\VirEnv\opencv\build\x64\vc15\bin\opencv_traincascade.exe -data cascade/ -vec C:\Ankit\VirEnv\pos3.vec -bg neg.txt -w 24 -h 24 -negPos 20 -numNeg 1000 -minHitRate 0.9
PARAMETERS:
cascadeDirName: cascade/
vecFileName: C:\Ankit\VirEnv\pos3.vec
bgFileName: neg.txt
numPos: 2000
numNeg: 1000
numStages: 20
precalcValBufSize[Mb] : 1024
precalcIdxBufSize[Mb] : 1024
acceptanceRatioBreakValue : -1
stageType: BOOST
featureType: HAAR
sampleWidth: 24
sampleHeight: 24
boostType: GAB
minHitRate: 0.9
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
Number of unique features given windowSize [24,24] : 162336
===== TRAINING 0-stage =====
<BEGIN
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(3.4.16) Error: Bad argument (> Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
> ) in CvCascadeImageReader::PosReader::get, file C:\build\3_4_winpack-build-win64-vc15\opencv\apps\traincascade\imagestorage.cpp, line 158
I am getting this error
OpenCV(3.4.16) Error: Bad argument (> Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
> ) in CvCascadeImageReader::PosReader::get, file C:\build\3_4_winpack-build-win64-vc15\opencv\apps\traincascade\imagestorage.cpp, line 158
I only have 26 sample, I have tried to mess with Pog value and the minHitRate vale but still this error is there, i don't know what to do next. Is it not possible to do this with just 26 sample? is it necessary to have more than 26 sample to make this work.

pyamiibo vs amiiboapi.com to lookup amiibo bin dump file

I am trying to use pyamiibo from https://pypi.org/project/pyamiibo/ to read my amiibo bin dump file in python, and attempting to use amiiboapi.com to look up the details...
For Duck Hunt, pyamiibo's uid_hex returns "04 FC 30 82 03 49 80", but amiiboapi.com returns { "head": "07820000", "tail": "002f0002",}...
What should I do to link up the 2 outputs?
The UID is unique to each tag, the head and tail are in pages 21 and 22 respectively

How to redirect output from awk as a variable in python?

Say I have a data file given below. The given awk command splits the files into multiple parts using the first value of the column and writes it to a file.
chr pos idx len
2 23 4 4
2 25 7 3
2 29 8 2
2 35 1 5
3 37 2 5
3 39 3 3
3 41 6 3
3 45 5 5
4 25 3 4
4 32 6 3
4 38 5 4
awk 'BEGIN {FS=OFS="\t"} {print > "file_"$1".txt"}' write_multiprocessData.txt
The above code will split the files as file_2.txt, file_3.txt ... . Since, awk loads the file into memory first. I rather want to write a python script that would call awk and split the file and directly load it into linux memory (and give unique variable names to the data as file_1, file_2).
Would this be possible? If not what other variations can I try.
I think your awk code has a little bug. If you want to incorporate your awk code into a python code that organizes all the things you wanna do try this:
import os
from numpy import *
os.system("awk '{if(NR>1) print >\"file_\"$1\".txt\"}' test.dat")
os.system works very well, however I did not know it is obsolescence. Anyway, as suggested subprocess works as well:
import subprocess
cmd = "awk '{if(NR>1) print >\"file_\"$1\".txt\"}' test.dat"
p = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True)
There is no need for Awk here.
from collections import defaultdict
prefix = defaultdict(list)
with open('Data.txt', 'r') as data:
for line in data:
line = line.rstrip('\r\n')
prefix[line.split()[0]].append(line)
Now you have in the dict prefix all the first fields from all the lines as keys, and the list of lines with that prefix as the value for each key.
If you also wish to write the results into files at this point, that's an easy exercise.
Generally, simple Awk scripts are nearly always easy and natural to reimplement in Python. Because Awk is very specialized for a constrained set of tasks, the Python code will often be less succinct, but with the Python adage "explicit is better than implicit" in mind, this may actually be a feature from a legibility and maintanability point of view.

Python Locust stopped without error message, how should I check?

I run locust using this command:
locust -f locustfile.py --no-web -c10 -r10 &> locust.log &
My understanding is all output (stdout, stderr) will goes to locust.log
However, when the program stopped without me triggering to stop, last lines of locust.log is only the stat like below, no error message could be found:
Name # reqs # fails Avg Min Max | Median req/s
--------------------------------------------------------------------------------------------------------------------------------------------
GET /*******/**********/ 931940 8(0.00%) 45 10 30583 | 23 101.20
GET /**************/************/ 931504 14(0.00%) 47 11 30765 | 24 104.10
GET /**************/***************/ 594 92243(99.36%) 30 12 549 | 23 0.00
--------------------------------------------------------------------------------------------------------------------------------------------
Total 1864038 92265(4.95%) 205.30
Since I didn't put number of request, the job should not stop forever.
Where and how should I check why the job is stopping?

Categories