How to inspect total amount of celery worker? - python

I start 10 celery workers by the following.
celery -A worker.celery worker -l info -c 10
I need to know the total amount of active celery workers. If the total amout of active workers are not bigger than 10, we can handle the new task. If not, the new task has to wait till a worker finished. Here is the code to check the total amount of active workers.
import json
import subprocess
def get_celery_worker():
bash_command = "celery -A worker inspect active -j"
process = subprocess.Popen(bash_command.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
output_string = output.decode("utf-8")
output_json = json.loads(output_string)
number_of_celery_worker = 0
if len(list(output_json.values())[0]) == 0:
pass
else:
for value in list(output_json.values())[0]:
for v in value.values():
if v == 'run_task': # Here run_task is the worker name.
number_of_celery_worker += 1
return int(number_of_celery_worker / 2) # Every task contains two run_task
I start 10 tasks on by one every second.The subprocess gives me the total workers are: 0, 0, 1, 1, 2, 3, 4, 5, 6, 6.
Anyone has an idea to implement this or any other idea to count workers?

Ok, First things first
I do not understand the concept of celery, and have not written a code using it till now, but I did some reading to help answer this question and I understood that one can use the package as a command and also as a python API, but I am still trying to get if it is possible to use both in a program or if these are two methods that should or cannot be used together and that is why I would like to see your code.
Given that your full code is unavailable and that the one given raises errors
I assume you can also mix your command line usage with python API usage. There is a class State in the module celery.events.state which is fundamental here. If you can find a way to get an object of State from the current variables and procedures in your environment which I have no idea about, then getting the number of workers is straightforward from the method alive_workers()
Should you be new to the API, the following resources might prove helpful:
Celery Doc Homepage
Celery User guide
Celery API reference
celery.events.state documentation
Hope this helps, but if it doesn't leave your question in the comment section below, I will reply as soon as I can. Thank you.

In Celery parlance "celery worker" is actually the main process that controls everything, including communication and management of the underlying "worker- processes".
So, when you run a command like celery -A worker.celery worker -l info -c 10, you will actually start a SINGLE process that Celery referes to as "celery worker". Since you did not specify the concurrency type, Celery worker will assume the prefork concurrency, which is based on processes.
The -c 10 parameter (concurrency level) instructs the Celery worker that you want 10 worker-processes. So, when Celery worker starts, it will create 10 worker processes and will communicate with them when when some task needs to be executed.
Sure there is much more going on underneath... For an example, Celery worker may every now and then kill some and replace them with new, fresh worker processes, etc.
As for the management and monitoring API, Celery provides the stats inspect command, https://docs.celeryq.dev/en/stable/userguide/workers.html#worker-statistics, giving you statistics for each node (Celery worker). Run something like celery -A worker.celery inspect stats to see what I mean. Statistics are available through the API as well (if you want to do some Python scripting) - once you get the result look for "pool" key. It should contain something like:
"pool": {
"max-concurrency": 60,
"max-tasks-per-child": "N/A",
"processes": [
4262,
4263,
... skipped ...
],
"put-guarded-by-semaphore": false,
"timeouts": [
0,
0
],
"writes": {
"all": "0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.02, 0.02, 0.01, 0.02, 0.02, 0.02, 0.01, 0.02, 0.01, 0.02, 0.02, 0.02, 0.03, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.01, 0.01, 0.03, 0.01, 0.02, 0.03, 0.02, 0.02, 0.02, 0.01, 0.01, 0.02, 0.01, 0.02, 0.02, 0.01, 0.02, 0.02, 0.01, 0.02, 0.02, 0.01, 0.02, 0.02, 0.01, 0.02, 0.02",
"avg": "0.02",
"inqueues": {
"active": 0,
"total": 60
},
"raw": "92, 99, 86, 118, 104, 62, 119, 115, 74, 102, 106, 95, 76, 88, 59, 114, 96, 86, 140, 84, 75, 75, 71, 79, 43, 76, 97, 51, 71, 44, 132, 106, 91, 71, 72, 148, 54, 117, 144, 83, 101, 105, 51, 74, 114, 69, 84, 92, 72, 86, 125, 67, 80, 97, 64, 88, 101, 61, 90, 94",
"strategy": "fair",
"total": 5330
}
},
The "processes" list contains process IDs of all worker-processes spawned by Celery worker. That number is the number you are looking for. It will however be a constant - 10, as that is what you asked from the Celery worker (the -c 10 parameter).

Related

How to quickly find common values ​in a list of (sorted?) lists?

EDIT:
Originally posted with the question below,
but since a practical walkaround was found I changed the title to match the walkaround.
See the bottom of the details.
Question (OLD)
Similar to this question (which already solved), I want to create a boolean array from an array of indices, but I need a much faster solution.
The code in the above answer takes up to 80% of my code's execution time. So I want to speed it up somehow.
Details
In case you are wondering, here are the details of what I want to do.
I have a table that has about 50 columns and over 2 million rows.
All elements are uint8 in the range of 0 to 200.
import numpy as np
np.random.seed(0)
table = np.random.randint(0, 201, (10, 5), dtype=np.uint8) # rows=10, cols=5
The table looks like this:
array([[172, 10, 127, 140, 47],
[170, 196, 151, 117, 166],
[ 22, 183, 192, 33, 67],
[179, 78, 154, 82, 162],
[195, 118, 125, 139, 103],
[125, 9, 164, 116, 108],
[161, 159, 21, 81, 89],
[165, 102, 98, 36, 183],
[ 5, 112, 87, 58, 43],
[ 76, 70, 60, 75, 189]], dtype=uint8)
For each column, an accepted row indices is given by user as follows:
accepted_rows_for_column_0 = [1, 2, 5, 6]
accepted_rows_for_column_1 = [0, 1, 2, 4, 6, 8, 9]
accepted_rows_for_column_2 = [0, 1, 2, 3, 5, 6, 7]
accepted_rows_for_column_3 = [1, 2, 3, 4, 6, 8]
accepted_rows_for_column_4 = [2, 3, 6, 9]
# for convenient
accepted_rows = [accepted_rows_for_column_0, accepted_rows_for_column_1,
accepted_rows_for_column_2, accepted_rows_for_column_3,
accepted_rows_for_column_4]
# also, all unaccepted row indices are accessible
unaccepted_rows = ...
Here is some code to generate an actual size table for testing (if you want to try it out).
import numpy as np
import random
np.random.seed(0)
random.seed(0)
table = np.random.randint(0, 201, (2 * 10 ** 6, 50), dtype=np.uint8)
accepted_rows = [
np.array(sorted(random.sample(list(range(table.shape[0])), random.randint(table.shape[0] // 2, table.shape[0]))))
for _ in range(table.shape[1])
]
Now, I want to extract all rows where all columns are accepted.
In the example above (10x5 table), 2 and 6 are the target row indices.
expected_result = table[[2, 6]]
array([[ 22, 183, 192, 33, 67],
[161, 159, 21, 81, 89]], dtype=uint8)
The following is a solution using this answer.
def as_boolean_array(indices, size):
t = np.zeros(size, dtype=bool)
t[np.array(indices)] = True # This line is slow.
return t
indices = np.array([as_boolean_array(idx, len(table)) for idx in accepted_rows]).all(axis=0)
results = table[indices]
This is the fastest way I have found.
The execution time is about 350 msec even for a table with 2 million rows.
However, more than 300 msec of that time is spent on a single line that is doing fancy indexing.
As for the execution environment,
table is in memory, and there is about 2GB of free memory and 10GB of SSD disk.
Since the program itself runs in multiple processes,
parallelization using multiprocessing is ineffective.
Any suggestions?
EDIT:
As #myrtlecat and #Michael Szczesny mentioned in the comments section, if the indices are sorted, the intersection can be computed relatively fast.
By merging them first, I was able to greatly reduce the number of slow fancy indexing runs.
import sortednp as snp
indices = as_boolean_array(snp.kway_intersect(*accepted_rows), len(table))
results = table[indices]
Note: The execution time of kway_intersect seems to depend on the number of accepted indices. It it actually slower when most of the indices are accepted. In my case, I can easily get around this problem by using an unaccepted indices instead.

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)
Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]
You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_ is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5 and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds is:
array([[0], [5], [11], [16]], dtype=int32)

scipy.stats.binned_statistic_dd() bin numbering has lots of extra bins

I'm struggling to deal with a scipy.stats.binned_statistic_dd() result. I have an array of positions and another array of ids that I'm binning in 3 directions. I'm providing a list of the bin edges as input rather than a number of bins in each direction coupled with a range option. I have 3 bins in x, 2 in y, and 3 in z, or 18 bins.
However, when I check the binnumbers listed, they are all in a range greater than 20. How do I get the bin numbers to reflect the number of bins provided and get rid of all the extra bins?
I've tried to follow what was suggested in this post (Output in scipy.stats.binned_statistic_dd()) which deals with something similar, but I can't understand how to apply this to my case. As usual, the documentation is as cryptic as ever.
Any help on get my binnumbers between 1-18 in this example would be greatly appreciated!
pos = np.array([[-0.02042167, -0.0223282 , 0.00123734],
[-0.0420364 , 0.01196078, 0.00694259],
[-0.09625651, -0.00311446, 0.06125461],
[-0.07693234, -0.02749618, 0.03617278],
[-0.07578646, 0.01199925, 0.02991888],
[-0.03258293, -0.00371765, 0.04245596],
[-0.06765955, 0.02798434, 0.07075846],
[-0.02431445, 0.02774102, 0.06719837],
[ 0.02798265, -0.01096739, -0.01658691],
[-0.00584252, 0.02043389, -0.00827088],
[ 0.00623063, -0.02642285, 0.03232817],
[ 0.00884222, 0.01498996, 0.02912483],
[ 0.07189474, -0.01541584, 0.01916607],
[ 0.07239394, 0.0059483 , 0.0740187 ],
[-0.08519159, -0.02894125, 0.10923724],
[-0.10803509, 0.01365444, 0.09555333],
[-0.0442866 , -0.00845725, 0.10361843],
[-0.04246779, 0.00396127, 0.1418258 ],
[-0.08975861, 0.02999023, 0.12713186],
[ 0.01772454, -0.0020405 , 0.08824418]])
ids = np.array([16, 9, 6, 19, 1, 4, 10, 5, 18, 11, 2, 12, 13, 8, 3, 17, 14,
15, 20, 7])
xbinEdges = np.array([-0.15298488, -0.05108961, 0.05080566, 0.15270093])
ybinEdges = np.array([-0.051, 0. , 0.051])
zbinEdges = np.array([-0.053, 0.049, 0.151, 0.253])
ret = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=False)
bincounts = ret.statistic
binnumber = ret.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
ranges = [[-0.15298488071, 0.15270092971],
[-0.051000000000000004, 0.051000000000000004],
[-0.0530000000000001, 0.25300000000000006]]
ret3 = stats.binned_statistic_dd(pos, ids, bins=(3,2,3), statistic='count', expand_binnumbers=False, range=ranges)
bincounts = ret3.statistic
binnumber = ret3.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
Ok, after several days of background thinking and a quick scour through the binned_statistic_dd() source code I think I've come to the correct answer and it's pretty simple.
It seem binned_statistic_dd() adds an extra set of outlier bins in the binning phase and then removes these when returning the histogram results, but leaving the bin numbers untouched (I think this is in case you want to reuse the result for further stats outputs).
So it seems that if you export the expanded binnumbers (expand_binnumbers=True) and then subtract 1 from each binnumber to re-adjust the bin indices you can calculate the "correct" bin ids.
ret2 = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=True)
bincounts2 = ret2.statistic
binnumber2 = ret2.binnumber
indxnum2 = binnumber2-1
corrected_bin_ids = np.ravel_multi_index((indxnum2),(numX, numY, numZ))
Quick and simple in the end!

Validating t-test results using Python scipy

I have simple Python function:
from scipy.stats import ttest_1samp
def tTest( expectedMean, sampleSet, alpha=0.05 ):
# T-value and P-value
tv, pv = ttest_1samp(sampleSet, expectedMean)
print(tv,pv)
return pv >= alpha
if __name__ == '__main__':
# Expected mean is 10
print tTest(10.0, [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99])
My expectation is that t-test should fail for this sample, as it is nowhere near the expected population mean of 10. However, program produces result:
(1.0790344826428238, 0.3017839504736506)
True
I.e. the p-value is ~30% which is too high to reject the hypothesis. I am not very knowledgeable about the maths behind t-test but I don't understand how this result can be correct. Does anyone have any ideas?
I performed the test using R just to check if the results are the same and they are:
t.test(x=c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99), alternative = "two.sided",
mu = 10, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
data: c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99)
t = 1.079, df = 12, p-value = 0.3018
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
-829.9978 2498.3055
sample estimates:
mean of x
834.1538
You can see that the p-value is 0.3.
This is a really interesting problem, I have a lot of issues with Hypothesis testing. First of all the sample size influences a lot, if u have a big sample size, lets say 5000 values, minor deviations from the expected value that you are testing will influence a lot the p-value, and so you will reject the null hypothesis most of the times, having small samples does the opposite.
And what is happening here is that you have a high variance in the data.
If you try to replace your data from [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99]
To
[99, 99, 99, 99, 100, 99, 99, 99, 99, 100, 99, 100, 100]
So it has a really small variance, your p-value will be a lot smaller, even tho the mean of this one is probably closer to 10.

binning random data into groups of equal data point by their value

I got a 2-columns dataframe (volume and price), and I want to create 20 bins based on the volume column with equal amount of data in each bin.
I.e. if I got volume = [1,6,8,2,6,9,3,6] and 4 bins, I want to cut the data to 1st bin: 1:2, 2nd: 3:6, 3rd: 6:8, 4th: 8:9
then to find the average volume and price within each bin and plot a graph of volume(x-axis) against price(y-axis)
the intervals don't need to be equally spaced. I want to have the same number of data in each interval and determine the range of each interval, then find the average value of the data within each interval and plot it
data = df['Volume']
discrete_dat, cutoff = discretize(dat, 20)
myList = sorted(set(cutoff))
Cutoff = np.asarray(myList)
df_2 = pd.DataFrame({'X' : fd['Volume'], 'Y' : df['dMidP']}) #we build a dataframe from the data
data_cut = pd.cut(data,Cutoff) #we cut the data following the bins #we cut the data following the bins
grp = df_2.groupby(by = data_cut) #we group the data by the cut
ret = grp.aggregate(np.mean) #we produce an aggregate representation (mean) of each bin
plt.loglog(df['Volume'],df['dMidP'],'o')
plt.loglog(ret.X,ret.Y,'r-')
plt.title('Price Impact (Sell)')
plt.xlabel('Volume')
plt.ylabel('dMidP')
plt.show()
my raw data and output plot
however, when I use the counter function, it returns me the following, indicating the number of data points in each interval is different.
Counter({Interval(0.41299999999999998, 0.46400000000000002, closed='right'): 2029,
Interval(0.877, 0.92800000000000005, closed='right'): 543,
Interval(0.050999999999999997, 0.069599999999999995, closed='right'): 93,
Interval(0.60299999999999998, 0.71399999999999997, closed='right'): 99,
Interval(0.46400000000000002, 0.496, closed='right'): 93,
Interval(0.092799999999999994, 0.125, closed='right'): 111,
Interval(0.125, 0.14799999999999999, closed='right'): 86,
Interval(0.0092800000000000001, 0.018599999999999998, closed='right'): 101,
Interval(0.53800000000000003, 0.60299999999999998, closed='right'): 99,
Interval(0.14799999999999999, 0.186, closed='right'): 108,
Interval(0.018599999999999998, 0.023199999999999998, closed='right'): 102,
Interval(0.186, 0.23200000000000001, closed='right'): 134,
Interval(3.246, 4.2670000000000003, closed='right'): 85,
Interval(0.496, 0.53800000000000003, closed='right'): 103,
Interval(1.391, 1.716, closed='right'): 86,
Interval(0.26400000000000001, 0.32500000000000001, closed='right'): 104,
nan: 243,
Interval(0.23200000000000001, 0.26400000000000001, closed='right'): 60,
Interval(0.032500000000000001, 0.046399999999999997, closed='right'): 186,
Interval(0.00464, 0.0092800000000000001, closed='right'): 87,
Interval(0.023199999999999998, 0.032500000000000001, closed='right'): 74,
Interval(0.71399999999999997, 0.877, closed='right'): 101,
Interval(0.97399999999999998, 1.1359999999999999, closed='right'): 92,
Interval(4.2670000000000003, 6.3120000000000003, closed='right'): 100,
Interval(0.046399999999999997, 0.050999999999999997, closed='right'): 33,
Interval(1.716, 1.855, closed='right'): 145,
Interval(0.069599999999999995, 0.092799999999999994, closed='right'): 97,
Interval(1.1359999999999999, 1.391, closed='right'): 319,
Interval(2.319, 2.7829999999999999, closed='right'): 114,
Interval(0.32500000000000001, 0.41299999999999998, closed='right'): 98,
Interval(0.92800000000000005, 0.97399999999999998, closed='right'): 72,
Interval(2.7829999999999999, 3.246, closed='right'): 75,
Interval(2.1429999999999998, 2.319, closed='right'): 128,
Interval(1.855, 2.1429999999999998, closed='right'): 56})

Categories