How to align 3 dataframes with some threshold in python? - python

I have some x & y columns in a dataframe such as below:
X-1 X-1_y X-2 X-2_y X-3 X-3_y
0 411.726266 1387.29 437.404307 3755.08 437.273585 3360.85
1 437.692665 677.39 448.557534 1460.70 448.760155 981.45
2 448.596937 2276.35 481.550490 0.00 481.566018 0.00
3 481.634531 0.00 486.966310 0.00 487.208899 0.00
4 486.971163 0.00 492.578155 0.00 492.446192 0.00
5 492.505388 0.00 500.000000 608.22 500.153040 0.00
6 500.030500 810.45 508.218825 0.00 508.315935 0.00
7 508.106596 0.00 513.579177 0.00 513.620953 9582.45
8 513.424161 0.00 515.308245 0.00 515.175867 0.00
9 535.131828 0.00 534.346333 0.00 534.985459 0.00
10 551.779516 3124.92 551.712654 2226.94 551.680943 2522.73
11 559.050425 1081.89 559.084859 984.05 559.087271 1600.48
12 562.108257 3532.11 562.253910 3686.94 562.234223 4495.73
13 591.436797 0.00 590.659433 0.00 591.396752 0.00
and I would like to align all 3 X columns and merge it to 1 X column. if the numbers in X columns are too close to each other (i.e +- 1) get a avg of the three if available but if the numbers are not close to each other, append a new row, so the final result would be a new dataframe like this :
avg X X-1_y X-2_y X-3_y
0 411.726266 1387.29 0.00 0.00
1 437.456852 677.39 3755.08 3360.85
2 448.638209 2276.35 1460.70 981.45
3 481.583680 0.00 0.00 0.00
4 487.048791 0.00 0.00 0.00
5 492.509912 0.00 0.00 0.00
6 500.061180 810.45 608.22 0.00
7 508.213785 0.00 0.00 0.00
8 513.541430 0.00 0.00 9582.45
9 515.242056 0.00 0.00 0.00
10 534.821206 0.00 0.00 0.00
11 551.724371 3124.92 2226.94 2522.73
12 559.074185 1081.89 984.05 1600.48
13 562.198797 3532.11 3686.94 4495.73
14 591.164327 0.00 0.00 0.00
example of how the result is created:
if the numbers in a row of X are +- 1 then get a avg, if all 3 are not within +-1, then append three new rows but if 1 is not within the other two then append 2 rows (1 is the new off value and 2nd is the avg of the other 2 that are within +-1). for example , on first row of data,
X-1 X-1_y X-2 X-2_y X-3 X-3_y
0 411.726266 1387.29 437.404307 3755.08 437.273585 3360.85
1 437.692665 677.39 448.557534 1460.70 448.760155 981.45
X1(411.72) is not within +-1 of the X2(437.4) and X3(437.2) so it will append a new line in the result but X2(437.4) and X3(437.2) are within +-1 of each other and also within +-1 of the 2nd row of the X1(437.692) so append a avg of the 3 in the next line avg of (X1_row2 , X2_row1 , X3_row1)
results will be
avg X X-1_y X-2_y X-3_y
0 411.726266 1387.29 0.00 0.00
1 437.456852 677.39 3755.08 3360.85
thanks in advance

You can do the following:
First we want to make a flat list of all the values:
x=list(df["X-1"])+list(df["X-2"])+list(df["X-3"])
items=[[row["X-1"]]+[row["X-2"]]+[row["X-3"]] for index,row in df.iterrows()]
flat_list = [item for sublist in items for item in sublist]
Then:
final=[]
x=0
while x<len(flat_list):
try:
if (abs(flat_list[x:x+3][0]-flat_list[x:x+3][1])<1)&(abs(flat_list[x:x+3][0]-flat_list[x:x+3][2])<1):
final.append(sum(flat_list[x:x+3])/3)
x+=3
else:
final.append(flat_list[x])
x+=1
except:
final.append(flat_list(x))
and that will produce what you want for the column avg_X:
final
[411.726266,
437.4568523333333,
448.6382086666667,
481.5836796666667,
487.04879066666666,
492.5099116666667,
500.06118,
508.21378533333336,
513.5414303333333,
515.308245,
515.175867,
534.8212066666666,
551.7243709999999,
559.0741849999999,
562.1987966666667,
591.1643273333334]

Related

Replace Only Integer Values with a sequence of consecutive numbers in a column

How do I replace only the integer values in the ID column with a sequence of consecutive numbers? I'd like any non-integer or NaN cells skipped.
Current df:
ID AMOUNT
1 0.00
test 5.00
test test 0.00
test 0.00
1 0.00
xx 304.95
x xx 304.95
1 0.00
1 0.00
xxxxx 0.00
1 0.00
xxx 0.00
xx xx 0.00
1 0.00
Desired Outcome:
ID AMOUNT
1 0.00
test 5.00
test test 0.00
test 0.00
2 0.00
xx 304.95
x xx 304.95
3 0.00
4 0.00
xxxxx 0.00
5 0.00
xxx 0.00
xx xx 0.00
6 0.00
I tried making a new column using np.arange(len(df)) and then replacing the ID values with that, but it's not giving me the expected outcome.
Thank you!
You can use:
df['ID'] = (pd
.to_numeric(df['ID'], errors='coerce') # convert to numeric
.cumsum() # increment numbers
.convert_dtypes().astype(object) # back to integer
.fillna(df['ID']) # restore non-numeric
)
Alternative using slicing and updating:
s = pd.to_numeric(df['ID'], errors='coerce')
df['ID'].update(s[s.notna()].cumsum().astype(int).astype(object))
output:
ID AMOUNT
0 1 0.00
1 test 5.00
2 test test 0.00
3 test 0.00
4 2 0.00
5 xx 304.95
6 x xx 304.95
7 3 0.00
8 4 0.00
9 xxxxx 0.00
10 5 0.00
11 xxx 0.00
12 xx xx 0.00
13 6 0.00
Solution 1
Identify numeric values with regex then create a range counter and use boolean indexing to update the values
m = df['ID'].str.match('\d+', na=False)
df.loc[m , 'ID'] = range(1, m.sum() + 1)
Solution 2
Identify numeric values with pandas builtin function then create a range counter and use boolean indexing to update the values
m = pd.to_numeric(df['ID'], errors='coerce').notna()
df.loc[m , 'ID'] = range(1, m.sum() + 1)
Result
ID AMOUNT
0 1 0.00
1 test 5.00
2 test test 0.00
3 test 0.00
4 2 0.00
5 xx 304.95
6 x xx 304.95
7 3 0.00
8 4 0.00
9 xxxxx 0.00
10 5 0.00
11 xxx 0.00
12 xx xx 0.00
13 6 0.00
If you can iterate over the ID-column this can be done easily via pythons isinstance(object, class) function.
count = 0
for index, value in enumerate(df['ID']): # Iterate over the column
if isinstance(value, int): # Check if this is an integer
df['ID'][index] = count # Replace integer
count += 1
pass
pass

New data frame with the % of the debt paid in the month of the payment

I have two dataframes df1 and df2.
One with clients debt, the other with client payments with dates.
I want to create a new data frame with the % of the debt paid in the month of the payment until 01-2017.
import pandas as pd
d1 = {'client number': ['2', '2','3','6','7','7','8','8','8','8','8','8','8','8'],
'month': [1, 2, 3,1,10,12,3,5,8,1,2,4,5,8],
'year':[2013,2013,2013,2019,2013,2013,2013,2013,2013,2014,2014,2015,2016,2017],
'payment' :[100,100,200,10000,200,100,300,500,200,100,200,200,500,50]}
df1 = pd.DataFrame(data=d1).set_index('client number')
df1
d2 = {'client number': ['2','3','6','7','8'],
'debt': [200, 600,10000,300,3000]}
df2 = pd.DataFrame(data=d2)
x=[1,2,3,4,5,6,7,8,9,10]
y=[2013,2014,2015,2016,2017]
for x in month and y in year
if df1['month']=x and df1['year']=year :
df2[month&year] = df1['payment']/df2['debt']
the result needs to be something like this for all the clients
what am I missing?
thank you for your time and help
First set the index of both the dataframes df1 and df2 to client number, then use Index.map to map the client numbers in df1 to their corresponding debt's from df2, then use Series.div to divide the payments of each client by their respective debt's, thus obtaining the fraction of debt which is paid, then create a new column date in df1 from month and year columns finally use DataFrame.join along with DataFrame.pivot_table:
df1 = df1.set_index('client number')
df2 = df2.set_index('client number')
df1['pct'] = df1['payment'].div(df1.index.map(df2['debt'])).round(2)
df1['date'] = df1['year'].astype(str) + '-' + df1['month'].astype(str).str.zfill(2)
df3 = (
df2.join(
df1.pivot_table(index=df1.index, columns='date', values='pct', aggfunc='sum').fillna(0))
.reset_index()
)
Result:
# print(df3)
client number debt 2013-01 2013-02 2013-03 2013-05 2013-08 ... 2013-12 2014-01 2014-02 2015-04 2016-05 2017-08 2019-01
0 2 200 0.5 0.5 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 3 600 0.0 0.0 0.33 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 6 10000 0.0 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 1.0
3 7 300 0.0 0.0 0.00 0.00 0.00 ... 0.33 0.00 0.00 0.00 0.00 0.00 0.0
4 8 3000 0.0 0.0 0.10 0.17 0.07 ... 0.00 0.03 0.07 0.07 0.17 0.02 0.0

Pandas sum over partition by rows following SQL equivalent

I am looking a way to aggregate (in pandas) a subset of values based on a particular partition, an equivalent of
select table.*,
sum(income) over (order by id, num_yyyymm rows between 3 preceding and 1 preceding) as prev_income_3,
sum(income) over (order by id, num_yyyymm rows between 1 following and 3 following) as next_income_3
from table order by a.id_customer, num_yyyymm;
I tried with the following solution but it has some problems:
1) Takes ages to complete
2) I have to merge all the results at the end of
for x, y in df.groupby(['id_customer']):
print(y[['num_yyyymm', 'income']])
y['next3'] = y['income'].iloc[::-1].rolling(3).sum()
print(y[['num_yyyymm', 'income', 'next3']])
break
Results:
num_yyyymm income next3
0 201501 0.00 0.00
1 201502 0.00 0.00
2 201503 0.00 0.00
3 201504 0.00 0.00
4 201505 0.00 0.00
5 201506 0.00 0.00
6 201507 0.00 0.00
7 201508 0.00 0.00
8 201509 0.00 0.00
9 201510 0.00 0.00
10 201511 0.00 0.00
11 201512 0.00 0.00
12 201601 0.00 0.00
13 201602 0.00 0.00
14 201603 0.00 0.00
15 201604 0.00 0.00
16 201605 0.00 0.00
17 201606 0.00 0.00
18 201607 0.00 0.00
19 201608 0.00 0.00
20 201609 0.00 1522.07
21 201610 0.00 1522.07
22 201611 0.00 1522.07
23 201612 1522.07 0.00
24 201701 0.00 -0.00
25 201702 0.00 1.52
26 201703 0.00 1522.07
27 201704 0.00 1522.07
28 201705 1.52 1520.55
29 201706 1520.55 0.00
30 201707 0.00 NaN
31 201708 0.00 NaN
32 201709 0.00 NaN
Does anybody have an alternative solution?

removing the name of a pandas dataframe index after appending a total row to a dataframe

I have calculated a series of totals tips by day of a week and appended it to the bottom of totalspt dataframe.
I have set the index.name for the totalspt dataframe to None.
However while the dataframe is displaying the default 0,1,2,3 index it doesn't display the default empty cell in the top left directly above the index.
How could I make this cell empty in the dataframe?
total_bill tip sex smoker day time size tip_pct
0 16.54 1.01 F N Sun D 2 0.061884
1 12.54 1.40 F N Mon D 2 0.111643
2 10.34 3.50 M Y Tue L 4 0.338491
3 20.25 2.50 M Y Wed D 2 0.123457
4 16.54 1.01 M Y Thu D 1 0.061064
5 12.54 1.40 F N Fri L 2 0.111643
6 10.34 3.50 F Y Sat D 3 0.338491
7 23.25 2.10 M Y Sun B 3 0.090323
pivot = tips.pivot_table('total_bill', index=['sex', 'size'],columns=['day'],aggfunc='sum').fillna(0)
print pivot
day Fri Mon Sat Sun Thu Tue Wed
sex size
F 2 12.54 12.54 0.00 16.54 0.00 0.00 0.00
3 0.00 0.00 10.34 0.00 0.00 0.00 0.00
M 1 0.00 0.00 0.00 0.00 16.54 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 20.25
3 0.00 0.00 0.00 23.25 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 10.34 0.00
totals_row = tips.pivot_table('total_bill',columns=['day'],aggfunc='sum').fillna(0).astype('float')
totalpt = pivot.reset_index('sex').reset_index('size')
totalpt.index.name = None
totalpt = totalpt[['Fri', 'Mon','Sat', 'Sun', 'Thu', 'Tue', 'Wed']]
totalpt = totalpt.append(totals_row)
print totalpt
**day** Fri Mon Sat Sun Thu Tue Wed #problem text day
0 12.54 12.54 0.00 16.54 0.00 0.00 0.00
1 0.00 0.00 10.34 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 16.54 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 20.25
4 0.00 0.00 0.00 23.25 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 10.34 0.00
total_bill 12.54 12.54 10.34 39.79 16.54 10.34 20.25
That's the columns' name.
In [11]: df = pd.DataFrame([[1, 2]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
In [13]: df.columns.name = 'XX'
In [14]: df
Out[14]:
XX A B
0 1 2
You can set it to None to clear it.
In [15]: df.columns.name = None
In [16]: df
Out[16]:
A B
0 1 2
An alternative, if you wanted to keep it, is to give the index a name:
In [21]: df.columns.name = "XX"
In [22]: df.index.name = "index"
In [23]: df
Out[23]:
XX A B
index
0 1 2
You can use rename_axis. Since 0.17.0
In [3939]: df
Out[3939]:
XX A B
0 1 2
In [3940]: df.rename_axis(None, axis=1)
Out[3940]:
A B
0 1 2
In [3942]: df = df.rename_axis(None, axis=1)
In [3943]: df
Out[3943]:
A B
0 1 2

Printing out the value from a matrix

I have a task which consumes arbit CPU and memory over time. It gives me an output executing the following linux command:
mpstat -u 1 -P ALL
The output looks like:
02:22:14 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:22:15 PM all 4.51 0.00 0.11 0.00 0.00 0.00 0.00 0.00 95.37
02:22:15 PM 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 99.00
02:22:15 PM 1 **78.22** 0.00 0.99 0.00 0.00 0.00 0.00 0.00 20.79
02:22:15 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:22:15 PM 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:22:15 PM 23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I want to grab the value located in 4th column and 3rd row a[3][4] i.e 78.22 every 20 seconds in bash/python/perl.
So the script I want, will execute the mpstat command and print out the value in the specified column and based on the values it creates a graph. I was thinking of appending the required value in a .dat file and run a gnuplot or and app which creates the graph.
Any suggestion on how to go?
You can get the 3rd row and 4th cell using awk. The following code grabs this cell from mpstat's output and appends it along with the current UNIX timestamp to a statistics file.
mpstat -u 1 -P ALL | awk 'NR==4 {print systime(), $4}' >> stats.txt
To run this command every 20 seconds:
watch -n 20 "mpstat -u 1 -P ALL | awk 'NR==4 {print systime(), \$4}' >> stats.txt"
Then plot with gnuplot:
cat stats.txt | gnuplot -p -e 'set datafile separator " "; plot "-" using 1:2 with lines'
Try the following
#!/bin/bash
function _mpstat() {
while :; do
arr=( $(mpstat -P 1 | tail -n 1) )
echo "${arr[3]}"
sleep 20
done >> file.txt
}
_mpstat &
echo "_mpstat PID: $!"
Explanation
while :; do Infinite loop
$(mpstat -P 1 | tail -n 1) Mpstat only cpu 1 -P 1and tail -n 1 the last line, return value $()
arr=( ... ) Commands return value to array
echo "${arr[3]}" echo array index 3
sleep 20 Sleep for 20 seconds
>> file.txt Send stdout to file within while loop block.
_mpstat & Send the function to a background process &
echo "_mpstat PID: $! Returns PID of function
You can grep the PID to display its parent and kill both when needed.

Categories