Printing out the value from a matrix - python

I have a task which consumes arbit CPU and memory over time. It gives me an output executing the following linux command:
mpstat -u 1 -P ALL
The output looks like:
02:22:14 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:22:15 PM all 4.51 0.00 0.11 0.00 0.00 0.00 0.00 0.00 95.37
02:22:15 PM 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 99.00
02:22:15 PM 1 **78.22** 0.00 0.99 0.00 0.00 0.00 0.00 0.00 20.79
02:22:15 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I want to grab the value located in 4th column and 3rd row a[3][4] i.e 78.22 every 20 seconds in bash/python/perl.
So the script I want, will execute the mpstat command and print out the value in the specified column and based on the values it creates a graph. I was thinking of appending the required value in a .dat file and run a gnuplot or and app which creates the graph.
Any suggestion on how to go?

You can get the 3rd row and 4th cell using awk. The following code grabs this cell from mpstat's output and appends it along with the current UNIX timestamp to a statistics file.
mpstat -u 1 -P ALL | awk 'NR==4 {print systime(), $4}' >> stats.txt
To run this command every 20 seconds:
watch -n 20 "mpstat -u 1 -P ALL | awk 'NR==4 {print systime(), \$4}' >> stats.txt"
Then plot with gnuplot:
cat stats.txt | gnuplot -p -e 'set datafile separator " "; plot "-" using 1:2 with lines'

Try the following
#!/bin/bash
function _mpstat() {
while :; do
arr=( $(mpstat -P 1 | tail -n 1) )
echo "${arr[3]}"
sleep 20
done >> file.txt
}
_mpstat &
echo "_mpstat PID: $!"
Explanation
while :; do Infinite loop
$(mpstat -P 1 | tail -n 1) Mpstat only cpu 1 -P 1and tail -n 1 the last line, return value $()
arr=( ... ) Commands return value to array
echo "${arr[3]}" echo array index 3
sleep 20 Sleep for 20 seconds
>> file.txt Send stdout to file within while loop block.
_mpstat & Send the function to a background process &
echo "_mpstat PID: $! Returns PID of function
You can grep the PID to display its parent and kill both when needed.

Related

How to align 3 dataframes with some threshold in python?

I have some x & y columns in a dataframe such as below:
X-1 X-1_y X-2 X-2_y X-3 X-3_y
0 411.726266 1387.29 437.404307 3755.08 437.273585 3360.85
1 437.692665 677.39 448.557534 1460.70 448.760155 981.45
2 448.596937 2276.35 481.550490 0.00 481.566018 0.00
3 481.634531 0.00 486.966310 0.00 487.208899 0.00
4 486.971163 0.00 492.578155 0.00 492.446192 0.00
5 492.505388 0.00 500.000000 608.22 500.153040 0.00
6 500.030500 810.45 508.218825 0.00 508.315935 0.00
7 508.106596 0.00 513.579177 0.00 513.620953 9582.45
8 513.424161 0.00 515.308245 0.00 515.175867 0.00
9 535.131828 0.00 534.346333 0.00 534.985459 0.00
10 551.779516 3124.92 551.712654 2226.94 551.680943 2522.73
11 559.050425 1081.89 559.084859 984.05 559.087271 1600.48
12 562.108257 3532.11 562.253910 3686.94 562.234223 4495.73
13 591.436797 0.00 590.659433 0.00 591.396752 0.00
and I would like to align all 3 X columns and merge it to 1 X column. if the numbers in X columns are too close to each other (i.e +- 1) get a avg of the three if available but if the numbers are not close to each other, append a new row, so the final result would be a new dataframe like this :
avg X X-1_y X-2_y X-3_y
0 411.726266 1387.29 0.00 0.00
1 437.456852 677.39 3755.08 3360.85
2 448.638209 2276.35 1460.70 981.45
3 481.583680 0.00 0.00 0.00
4 487.048791 0.00 0.00 0.00
5 492.509912 0.00 0.00 0.00
6 500.061180 810.45 608.22 0.00
7 508.213785 0.00 0.00 0.00
8 513.541430 0.00 0.00 9582.45
9 515.242056 0.00 0.00 0.00
10 534.821206 0.00 0.00 0.00
11 551.724371 3124.92 2226.94 2522.73
12 559.074185 1081.89 984.05 1600.48
13 562.198797 3532.11 3686.94 4495.73
14 591.164327 0.00 0.00 0.00
example of how the result is created:
if the numbers in a row of X are +- 1 then get a avg, if all 3 are not within +-1, then append three new rows but if 1 is not within the other two then append 2 rows (1 is the new off value and 2nd is the avg of the other 2 that are within +-1). for example , on first row of data,
X-1 X-1_y X-2 X-2_y X-3 X-3_y
0 411.726266 1387.29 437.404307 3755.08 437.273585 3360.85
1 437.692665 677.39 448.557534 1460.70 448.760155 981.45
X1(411.72) is not within +-1 of the X2(437.4) and X3(437.2) so it will append a new line in the result but X2(437.4) and X3(437.2) are within +-1 of each other and also within +-1 of the 2nd row of the X1(437.692) so append a avg of the 3 in the next line avg of (X1_row2 , X2_row1 , X3_row1)
results will be
avg X X-1_y X-2_y X-3_y
0 411.726266 1387.29 0.00 0.00
1 437.456852 677.39 3755.08 3360.85
thanks in advance
You can do the following:
First we want to make a flat list of all the values:
x=list(df["X-1"])+list(df["X-2"])+list(df["X-3"])
items=[[row["X-1"]]+[row["X-2"]]+[row["X-3"]] for index,row in df.iterrows()]
flat_list = [item for sublist in items for item in sublist]
Then:
final=[]
x=0
while x<len(flat_list):
try:
if (abs(flat_list[x:x+3][0]-flat_list[x:x+3][1])<1)&(abs(flat_list[x:x+3][0]-flat_list[x:x+3][2])<1):
final.append(sum(flat_list[x:x+3])/3)
x+=3
else:
final.append(flat_list[x])
x+=1
except:
final.append(flat_list(x))
and that will produce what you want for the column avg_X:
final
[411.726266,
437.4568523333333,
448.6382086666667,
481.5836796666667,
487.04879066666666,
492.5099116666667,
500.06118,
508.21378533333336,
513.5414303333333,
515.308245,
515.175867,
534.8212066666666,
551.7243709999999,
559.0741849999999,
562.1987966666667,
591.1643273333334]

Return column names for 3 highest values in rows

I'm trying to come up with a way to return the column names for the 3 highest values in each row of the table below. So far I've been able to return the highest value using idxmax but I haven't been able to figure out how to get the 2nd and 3rd highest.
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6
0 9 0.00 0.15 0.06 0.11 0.23 0.01
1 4 0.00 0.25 0.04 0.10 0.10 0.00
2 11 0.00 0.34 0.00 0.09 0.24 0.00
3 12 0.00 0.16 0.00 0.11 0.00 0.00
4 0 0.00 0.35 0.00 0.04 0.02 0.00
5 17 0.01 0.21 0.02 0.18 0.27 0.01
Expected output:
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5,Stat2,Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2,Stat4,Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2,Stat5,Stat4]
3 12 0.00 0.16 0.00 0.19 0.00 0.01 [Stat4,Stat2,Stat6]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2,Stat4,Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5,Stat2,Stat4]
If anyone has ideas on how to do this I'd appreciate it.
Use numpy.argsort for positions of sorted values and filter all columns without first:
a = df.iloc[:, 1:].to_numpy()
df['TopThree'] = df.columns[1:].to_numpy()[np.argsort(-a, axis=1)[:, :3]].tolist()
print (df)
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5, Stat2, Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2, Stat4, Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2, Stat5, Stat4]
3 12 0.00 0.16 0.00 0.11 0.00 0.00 [Stat2, Stat4, Stat1]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2, Stat4, Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5, Stat2, Stat4]
If performace is not important:
df['TopThree'] = df.iloc[:, 1:].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)

Pandas sum over partition by rows following SQL equivalent

I am looking a way to aggregate (in pandas) a subset of values based on a particular partition, an equivalent of
select table.*,
sum(income) over (order by id, num_yyyymm rows between 3 preceding and 1 preceding) as prev_income_3,
sum(income) over (order by id, num_yyyymm rows between 1 following and 3 following) as next_income_3
from table order by a.id_customer, num_yyyymm;
I tried with the following solution but it has some problems:
1) Takes ages to complete
2) I have to merge all the results at the end of
for x, y in df.groupby(['id_customer']):
print(y[['num_yyyymm', 'income']])
y['next3'] = y['income'].iloc[::-1].rolling(3).sum()
print(y[['num_yyyymm', 'income', 'next3']])
break
Results:
num_yyyymm income next3
0 201501 0.00 0.00
1 201502 0.00 0.00
2 201503 0.00 0.00
3 201504 0.00 0.00
4 201505 0.00 0.00
5 201506 0.00 0.00
6 201507 0.00 0.00
7 201508 0.00 0.00
8 201509 0.00 0.00
9 201510 0.00 0.00
10 201511 0.00 0.00
11 201512 0.00 0.00
12 201601 0.00 0.00
13 201602 0.00 0.00
14 201603 0.00 0.00
15 201604 0.00 0.00
16 201605 0.00 0.00
17 201606 0.00 0.00
18 201607 0.00 0.00
19 201608 0.00 0.00
20 201609 0.00 1522.07
21 201610 0.00 1522.07
22 201611 0.00 1522.07
23 201612 1522.07 0.00
24 201701 0.00 -0.00
25 201702 0.00 1.52
26 201703 0.00 1522.07
27 201704 0.00 1522.07
28 201705 1.52 1520.55
29 201706 1520.55 0.00
30 201707 0.00 NaN
31 201708 0.00 NaN
32 201709 0.00 NaN
Does anybody have an alternative solution?

c_type python add values

I am trying to add some values to an existing variable. I am kind of new to Python. I am using ctype variables. This is my code which doesn't work.
rgdSamples = (c_double * 100)()
fSamples = (c_double * 1000)()
for i in range(10)
fSamples += rgdSamples;
any suggestions?
Based on the numeric values from your snippet, I assumed that fSamples should be a 2D array (10 rows, 100 columns) of doubles. Not clear why [Python 3]: ctypes - A foreign function library for Python is preferred instead of Python lists, but here's an example (for display purposes, arrays are much smaller).
code.py:
#!/usr/bin/env python3
import sys
import ctypes
COLS = 10 # Change it to 100
ROWS = 5 # change it to 10
DoubleArr1D = ctypes.c_double * COLS
DoubleArr2D = (ctypes.c_double * COLS) * ROWS # Parentheses present for clarity only
def print_matrix(matrix, text=None): # This function isn't very Pythonic, but keeping it like this for clarity
if text is not None:
print(text)
for row in matrix:
for element in row:
print("{:6.2f} ".format(element), end="")
print()
print()
def main():
mat = DoubleArr2D() # Initialize matrix with 0s
arr = DoubleArr1D(*range(1, COLS + 1)) # Initialize array with numbers 1..COLS
print_matrix(mat, text="Initial matrix:")
for row_idx in range(ROWS):
for col_idx in range(COLS):
mat[row_idx][col_idx] += arr[col_idx] * (row_idx + 1) # Add the values from array (multiplied by a factor) to the current row
print_matrix(mat, text="Final matrix:")
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
print("Done.")
Output:
[cfati#CFATI-5510-0:e:\Work\Dev\StackOverflow\q055494830]> "e:\Work\Dev\VEnvs\py_064_03.07.03_test0\Scripts\python.exe" code.py
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32
Initial matrix:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Final matrix:
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00
3.00 6.00 9.00 12.00 15.00 18.00 21.00 24.00 27.00 30.00
4.00 8.00 12.00 16.00 20.00 24.00 28.00 32.00 36.00 40.00
5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00
Done.

Delete non-consecutive values from a dataframe column

I have a dataframe like this:
Ind TIME PREC ET PET YIELD
0 1 1.21 0.02 0.02 0.00
1 2 0.00 0.03 0.04 0.00
2 3 0.00 0.03 0.05 0.00
3 4 0.00 0.04 0.05 0.00
4 5 0.00 0.05 0.07 0.00
5 6 0.00 0.03 0.05 0.00
6 7 0.00 0.02 0.04 0.00
7 8 1.14 0.03 0.04 0.00
8 9 0.10 0.02 0.03 0.00
9 10 0.00 0.03 0.04 0.00
10 11 0.10 0.05 0.11 0.00
11 12 0.00 0.06 0.15 0.00
12 13 2.30 0.14 0.44 0.00
13 14 0.17 0.09 0.29 0.00
14 15 0.00 0.13 0.35 0.00
15 16 0.00 0.14 0.39 0.00
16 17 0.00 0.10 0.31 0.00
17 18 0.00 0.15 0.51 0.00
18 19 0.00 0.22 0.58 0.00
19 20 0.10 0.04 0.09 0.00
20 21 0.00 0.04 0.06 0.00
21 22 0.27 0.13 0.43 0.00
22 23 0.00 0.10 0.25 0.00
23 24 0.00 0.03 0.04 0.00
24 25 0.00 0.04 0.05 0.00
25 26 0.43 0.04 0.15 0.00
26 27 0.17 0.06 0.23 0.00
27 28 0.50 0.02 0.04 0.00
28 29 0.00 0.03 0.04 0.00
29 30 0.00 0.04 0.08 0.00
30 31 0.00 0.04 0.08 0.00
31 1 6.48 1.97 5.10 0.03
32 32 0.00 0.22 0.70 0.00
33 33 0.00 0.49 0.88 0.00
In this dataframe column 'TIME' shows ordinal day number in a year, and after the end of every month - an ordinal number of month in a year, which messes up all dataframe calculations, so, for this reason, I would like to drop all rows that contain month value. First, I tried to use .shift():
df = df.loc[df.TIME == df.TIME.shift() +1],
however, in this case, I delete twice as many rows as it supposed to be. I also tried to delete every value after the end of every month:
for i in indexes:
df = df.loc[df.index != i],
where indexes is a list, containing row indexes after day value is equal to 31, 59, ... 365 or end of every month. However, in a leap year, these values would be different, and I could create another list for a leap year, but this method would be very non-pythonist. So, I wonder, is there any better way to delete non-consecutive values from a dataframe (excluding when one year ends and another one starts: 364, 365, 1, 2)?
EDIT: I should, probably, add that there are twenty years in this dataframe, so this is how the dataframe looks like at the end of each year:
TIME PREC ET PET YIELD
370 360 0.00 0.14 0.26 0.04
371 361 0.00 0.15 0.27 0.04
372 362 0.00 0.14 0.25 0.04
373 363 0.11 0.18 0.32 0.04
374 364 0.00 0.15 0.25 0.04
375 365 0.00 0.17 0.29 0.04
376 12 16.29 4.44 7.74 1.89
377 1 0.00 0.16 0.28 0.03
378 2 0.00 0.18 0.32 0.03
379 3 0.00 0.22 0.40 0.03
df
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
6 12 16.29 4.44 7.74 1.89
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
38 1 6.48 1.97 5.10 0.03
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
Look at the diffs in TIME. Drop the rows where diff is between -360 and -1
df[~df.TIME.diff().le(-12)]
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
df[df['TIME'].shift().fillna(0) <= df['TIME']]
Gives what you're looking for. You were almost there with
df.loc[df.TIME == df.TIME.shift() +1]
But you don't need to get rid of cases where .shift is smaller, because that's just the first of the month.
The addition of .fillna(0) takes care of the NaN in the first row of df['TIME'].shift().
Edit:
For the end of year case, just be sure to also take those with a difference of 11, to catch where the 12th month ends.
That would give
df[(df['TIME'].shift().fillna(0) <= df['TIME']+11)]
Edit2:
By the by, I checked solution runtimes, and the current version(df[~df.TIME.diff().le(-12)]) of #piRSquared's seems to run fastest.
For completeness, of the one presented in this post and the original version posted by #piRSquared,
the former was a bit faster on datasets on the order of 10000 rows or fewer, the latter somewhat faster on those larger.

Categories