no performance difference between different number of map tasks (1, 2, 4..) - python

I am very new to hadoop and am testing the performance difference between different number of map tasks and reduce tasks. The file size is about 5GB and hadoop is installed on 4 core/8 core machine (hyper threading).
The map and reduce were written in python, so I specify the number of map tasks by -D mapred.map.tasks=2 and specify the number of reduce tasks by -D mapred.reduce.tasks=2.
Problem
The problem is that the result doesn't show any performance difference between different number of map tasks..
Result
+----------+----------+----------+
| map | reduce | time |
+----------+----------+----------+
| 1 | 1 | 47m 09s |
| 2 | 1 | 45m 35s |
| 4 | 1 | 46m 30s |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| 1 | 2 | 38m 37s |
| 2 | 2 | 39m 22s |
| 4 | 2 | 39m 29s |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| 1 | 4 | 38m 18s |
| 2 | 4 | 38m 48s |
| 4 | 4 | 38m 23s |
+----------+----------+----------+
It seems that there is a few minute difference between using 1 reduce task and using 2 reduce tasks, but no difference when I change the number of map tasks.Is it that all the tasks are performed on only one node, and the map tasks are not running in parallel?
What could be causing this? I would appreciate any information.
Edit
I also tried specifying these values in mapred-site.xml instead of in the command but didn't make any changes.

Option mapred.map.tasks is not a directive but a hint to hadoop, so how did you check actual number of map tasks executed? While job is performed, you can monitor running jobs in job tracker and running tasks in task tracker. Also, you can ssh on your hadoop machine, and check for running map/reduce tasks, those will be java processes.
You can try set mapreduce.tasktracker.map.tasks.maximum in your mapred-site.xml to bound mappers per node to see parallel execution benefits.
For more performance monitor options, you probably might opt to install Ganglia, also see this blog entry: Monitoring Hadoop beyond Ganglia

Related

Which methodology of programming technique could I use to solve the workflow optimization with constraints?

So there is a problem about how to maximize the productivity of the production line if there are many constraints.
Below is the table of the productivity of each worker and in which step they can produce.
The constraints are like,
Each product is required to process these 6 procedures sequentially (1 to 2 to 3 to 4 to 5 to 6) and each worker is only capable to process certain steps. All the products will start from Building A, and after completing all the steps, they can be in either building for shipment. Each worker can only process 1 product at one time and is not allowed to run different procedures concurrently. It is assumed that the product is always available to start at Building X.
The transportation time within the same building is assumed to be negligible. However, cross building transportation time is 25 mins. The truck of a maximum capacity of 5, can only be at either building at any point in time.
| Worker | Procedure 1 time/min | Procedure 2 time/min | Procedure 3 time/min | Procedure 4 time/min | Procedure 5 time/min | Procedure 6 time/min |
| -------- | -------- |-------- |-------- |-------- |-------- |-------- |
| a | 5 | | 10 | | | |
| b | | 15 | | | | 10 |
| c | | 15 | | | 10 | |
| d | 5 | | | 15 | | |
| e | 5 | |5 | | 15 | |
| f | | | | 10 | | 10 |
The objective is to find the the maximum throughput (the total number of products produced) within 168 hours. You will also need to be able to list out every step that each product went through during the process.
I have tried to split the question into two parts:
Firstly, the workers produce the products normally (I have to list out every single steps by hand but I am still not sure if it is the best way to optimise the results) , and at some point in time -- the last stage is to assume that all the workers are in equilibrium state in doing each procedure, and each procedure produces the some amount of products at the same time. (The idea is to assume that all the workers are working all the time as well as the truck to maximise the productivity) I have tried to solve the second part using linear programming and get the results, but I cannot get the specific steps of which the results will be optimised using this methodology.
Now I am not sure which methodology could I use to solve this problem, can someone give me any suggestions please? I really appreciate it.

how to process multiple time series with machine-learning/deep learning method(fault diagnosis)

There is a industrial fault diagnosis scene.This is a binary classification problem concern to time series.When a fault occurs,the data from one machine is shown below:the label change from zero to one
| time | feature |label|
| -------- | -------------- | -------------- |
| 1 | 26 |0|
| 2 |29 |1|
| 3 | 30 |1|
| 4 | 20 |0|
The question is ,the fault doesnt happen a frequently,so i need need to select sufficient amount of slices of time series for training.
Thus i wanna ask that how should i orgnaize these data:should i take them as one time serise or any other choices.How to orgnize theses data and What machine learning method should I use to realize fault diagnosis?

Is there a way to improve a MERGE query?

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

SciPy Optimization algorithm

I need to solve an optimization task with Python.
The task is following:
Fabric produces desks, chairs, bureau and cupboards. For producing this stuff two types of boards could be used. Fabric has 1500m. of first type and 1000m. of second. Fabric has 800 Employees. What should produce fabric and how much to receive a maximum profit?
The input values are following:
| | Products |
| | Desk | Chair | Bureau | Cupboard |
|--------------|------|-------|--------|----------|
| Board 1 type | 5 | 1 | 9 | 12 |
| Board 2 type | 2 | 3 | 4 | 1 |
| Employees | 3 | 2 | 5 | 10 |
| Profit | 12 | 5 | 15 | 10 |
Unfortunately I don't have an experience in solving optimization tasks so I don't even know where to start. What I did:
I found sciPy optimization package which suppose to solve such type of problems.
I have some vision about input and output for my function. The input should amount of each type of product and the output supposed to be the profit. But the choice of resources(boards, employees) might also be different. And this affects algorithm implementation.
Could you please give me at least any direction where to go? Thank you!
EDIT:
Basically #Balzola is right. It's a simplex algorithm. The task might be solved by using SciPy.optimize.linprog solution which uses simplex under the hood.
Typical https://en.wikipedia.org/wiki/Simplex_algorithm
Looks like scipy can do it:
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#nelder-mead-simplex-algorithm-method-nelder-mead

Categories