Dropping all rows of dataframe where discontinuous data happens

Dropping all rows of dataframe where discontinuous data happens - python

Consider the following part of a Pandas dataframe:
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12338 1000 45201 0.024
12339 1000 45204 0.024
12340 1000 45205 0.024
12341 1000 45206 0.024
12342 1000 45207 0.024
12343 1000 45208 0.024
12344 1000 45209 0.045
12345 1000 45210 0.033
12346 1000 45211 0.025
12347 1000 45212 0.024
12348 1000 45213 0.024
12349 1000 45214 0.024
12350 1000 45215 0.024
12351 1000 45216 0.108
12352 1000 45217 1.109
12353 1000 45218 2.025
12354 1000 45219 2.918
12355 1000 45220 4.130
12356 1000 45221 0.601
12357 1000 45222 0.330
12358 1000 45223 0.400
12359 1000 45224 0.200
12360 1000 45225 0.093
12361 1000 45226 0.023
12362 1000 45227 0.023
12363 1000 45228 0.023
12364 1000 45229 0.024
12365 1000 45230 0.024
12366 1000 45231 0.118
12367 1000 45232 0.064
12368 1000 45233 0.023
12369 1000 45234 0.023
12370 1000 45235 0.023
12371 1000 45236 0.022
12372 1000 45237 0.022
12373 1000 45238 0.022
12374 1000 45239 0.106
12375 1000 45240 0.074
12376 1000 45241 0.105
12377 1000 45242 1.231
12378 1000 45243 0.500
12379 1000 45244 0.382
12380 1000 45245 0.405
12381 1000 45246 0.469
12382 1000 45247 0.173
12383 1000 45248 0.035
12384 1000 45301 0.026
12385 1000 45302 0.027
In column 1, it's a code that represents when some measurements (values in column 2) were taken. The first three digits of the values in column 1 represent a day, and the last two digits represent the HH:MM:SS. We start from day 450 (first two rows), and in the third row we are already in the day 451. From the index 12290 to 12337 you can see that we have 48 values (which represent 48 half-hourly measurements of a single day). So, last digit 01 means a measurement between 00:00:00 and 00:29:59, 02 means a measurement between 00:30:00 and 00:59:59, 03means a measurement between 01:00:00 and 01:29:59, and so on.
For example, a discontinuity happens in column 1 between index 12289 and index 12290, but this discontinuity happened between 450 and 451 in the the first three digits (a discontinuity between two days, since we moved from one day to another. The last three digits 048 in 45048 represent the measurements between 23:30:00 and 23:59:59 in that day 450), so those rows should not be dropped.
But now, if you look at the index 12338 and index 12339, there is a discontinuity happening in the same day 452, we are missing the measurements from 02 and 03 (we have measurements at 45201 and then the next at 45204. So, ALL rows from the 452 day should be dropped.
And again a discontinuity happens between index 12383 and index 12384, but since that happens between two different days (452 and 453), nothing should be dropped.
All the values in column 1 are int64.
Sorry if this is long and/or confusing, but any ideas in how can I solve this?

The first mask checks the continuous values. The second mask get every row from bad days:
# 1 -> 2 -> 3 -> ... or 47 -> 48 -> 1 -> ...
m1 = ~df[1].diff().fillna(1).isin([1, 53])
m2 = ~df[1].floordiv(100).isin(df.loc[m, 1].floordiv(100).tolist())
out = df[m2]
Output:
>>> out
0 1 2
12288 1000 45047 0.403
12289 1000 45048 0.334
12290 1000 45101 0.246
12291 1000 45102 0.096
12292 1000 45103 0.096
12293 1000 45104 0.024
12294 1000 45105 0.023
12295 1000 45106 0.023
12296 1000 45107 0.024
12297 1000 45108 0.024
12298 1000 45109 0.024
12299 1000 45110 0.055
12300 1000 45111 0.107
12301 1000 45112 0.024
12302 1000 45113 0.024
12303 1000 45114 0.024
12304 1000 45115 0.060
12305 1000 45116 1.095
12306 1000 45117 1.090
12307 1000 45118 0.418
12308 1000 45119 0.292
12309 1000 45120 0.446
12310 1000 45121 0.121
12311 1000 45122 0.121
12312 1000 45123 0.090
12313 1000 45124 0.031
12314 1000 45125 0.031
12315 1000 45126 0.031
12316 1000 45127 0.031
12317 1000 45128 0.036
12318 1000 45129 0.124
12319 1000 45130 0.069
12320 1000 45131 0.031
12321 1000 45132 0.031
12322 1000 45133 0.031
12323 1000 45134 0.031
12324 1000 45135 0.031
12325 1000 45136 0.059
12326 1000 45137 0.115
12327 1000 45138 0.595
12328 1000 45139 1.375
12329 1000 45140 0.780
12330 1000 45141 0.028
12331 1000 45142 0.029
12332 1000 45143 0.029
12333 1000 45144 0.029
12334 1000 45145 0.028
12335 1000 45146 0.085
12336 1000 45147 0.528
12337 1000 45148 0.107
12384 1000 45301 0.026
12385 1000 45302 0.027

Split the timestamp column. Then group by day and filter by values less than 48.
df["day"] = df["1"]/100
df["day"] = df["day"].astype('int64')
df["half_hour"] = df["1"]%100
df["num_readings_per_day"] = df.groupby("day")["half_hour"].transform('count')
df = df[df.num_readings_per_day==48]

A normal jump is either 1 or 53, use diff combined with isin to detect them and invert with ~ to find the true discontinuities:
df.loc[(~df['1'].diff().isin([np.nan, 1, 53]))]
Output:
0 1 2
12339 1000 45204 0.024

Related

Using a function in a loop and storing all the results

So I created a function that returns the returns of quantile portfolios as a time series.
If I call Quantile_Returns(2014), the result (DataFrame) looks like this.
Date Q1 Q2 Q3 Q4 Q5
2014-02-28 6.20 4.87 5.41 5.04 4.91
2014-03-31 -0.50 0.05 1.55 1.36 1.49
2014-04-30 -0.17 0.20 0.33 -0.26 1.76
2014-05-30 2.69 1.95 1.95 2.11 2.29
2014-06-30 3.12 3.40 2.81 1.82 2.36
2014-07-31 -2.52 -2.34 -1.92 -2.36 -1.80
2014-08-29 4.60 3.87 4.50 4.65 3.58
2014-09-30 -3.29 -3.25 -3.51 -0.96 -1.76
2014-10-31 2.55 4.63 2.37 3.60 2.10
2014-11-28 0.88 2.08 1.26 4.46 2.83
2014-12-31 0.35 0.20 -0.19 1.01 0.34
2015-01-30 -2.97 -2.63 -3.44 -2.32 -2.61
Now I would want to call this function for a number of years time_period = list(range(1960,2021)) and get a result that is a time series which goes from 1960 to 2021.
I tried like this
time_period = list(range(1960,2021))
for j in time_period:
if j == 1960:
Quantile = pd.DataFrame(Quantile_Returns(j))
else:
Quantile = pd.concat(Quantile, Quantile_Returns(j+1))
But It did not work.
The Error is:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
How can I implement this?
Thank you!

Try replacing the whole loop with
Quantile = pd.concat(Quantile_Returns(j) for j in range(1960, 2021))
pd.concat is expecting a sequence of pandas objects, and in the second pass through your loop you are giving it a DataFrame as the first argument (not a sequence of DataFrames). Also, the second argument should be an axis to concatenate on, not another DataFrame.
Here, I just passed it the sequence of all the DataFrames for different years as the first argument (using a generator expression).

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:

This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334

Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]

Create files from extracted content of one file

I have a large file which contains information based on the number of processes and benchmark case used. All this information is followed one after the other within the same file.
--
# Benchmarking Allgather
# #processes = 8
# ( 3592 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 1.77 2.07 1.97
2 1000 1.79 2.08 1.97
4 1000 1.79 2.07 1.98
8 1000 1.82 2.12 2.01
--
# Benchmarking Allgather
# #processes = 16
# ( 3584 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 2.34 2.85 2.73
2 1000 2.36 2.87 2.74
4 1000 2.38 2.90 2.76
8 1000 2.42 2.95 2.79
In order to quickly plot the information I was planning to create a file per independent content, for instance, with the information given above I would create two files called "Allgather_8" and "Allgather_16" and the expected content of these files would be:
$cat Allgather_8
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 1.77 2.07 1.97
2 1000 1.79 2.08 1.97
4 1000 1.79 2.07 1.98
8 1000 1.82 2.12 2.01
$cat Allgather_16
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 2.34 2.85 2.73
2 1000 2.36 2.87 2.74
4 1000 2.38 2.90 2.76
8 1000 2.42 2.95 2.79
I could then plot this with gnuplot or matplotlib.
What I have tried so far:
I have been using grep and awk to extract the content, which works for independent sections but I don't know how to automate this.
Any ideas?

awk '
/Benchmarking/ { close(out); out = $NF }
/#processes/ { out = out "_" $NF }
/^[[:space:]]/ { print > out }
' file

Regarding calculation of atomic distances in trajectory file of GROMACS in python

I have a trajectory file of 100 ns (10 ns time frame ie., t=10.0, t=20.0...t=100.0) in GROMACS. How can i calculate the atomic distance by writing a python program from one atom with respect to other atom (say distance between N and all OW) for each particular time frame ie., dst b/w N and all OW when t=10.0, when t=20.0...until t=100.0 (ie., i need to calculate distance b/w N and all OW at time t=10.0, and distance b/w N and all OW at time t=20.0 etc...) and get it in a single file...
This is my trajectory file
Generated by trjconv : ABC t= 10.00000
1ABC N 1 0.134 0.731 0.816
28SOL OW 115 1.586 0.579 1.240
29SOL OW 118 0.135 0.791 1.373
30SOL OW 121 0.279 0.419 0.486
31SOL OW 124 0.185 1.369 1.168
32SOL OW 127 0.270 1.932 0.692
33SOL OW 130 1.811 1.427 0.103
34SOL OW 133 0.506 1.752 1.413
35SOL OW 136 0.067 0.943 0.328
36SOL OW 139 0.607 0.127 1.843
Generated by trjconv : ABC t= 20.00000
1ABC N 1 0.174 0.862 0.867
28SOL OW 115 1.835 0.664 1.072
29SOL OW 118 0.162 0.991 1.333
30SOL OW 121 1.962 0.302 0.351
31SOL OW 124 1.991 1.557 0.807
32SOL OW 127 0.371 1.974 0.575
33SOL OW 130 0.027 1.951 0.214
35SOL OW 136 0.017 0.962 0.259
36SOL OW 139 0.359 0.315 1.701
.
.
.
Generated by trjconv : ABC t= 100.00000
1ABC N 1 0.436 0.482 0.720
28SOL OW 115 1.617 0.655 0.781
29SOL OW 118 0.444 1.118 0.961
30SOL OW 121 0.563 0.038 1.949
31SOL OW 124 0.101 0.983 0.321
32SOL OW 127 1.243 0.134 0.914
33SOL OW 130 0.765 1.254 0.416
34SOL OW 133 2.072 1.977 1.276
35SOL OW 136 1.030 0.726 0.400
36SOL OW 139 1.905 0.134 1.699
and i want the distance calculated between N and all OW output as :-
For ABC t= 10.00000
28SOL 0.47672738541
29SOL 0.442346018406
30SOL 0.353905354579
31SOL 0.416744526059
32SOL 0.4526643348
33SOL 0.28253495359
For ABC t= 20.00000
28SOL 0.4657273839
29SOL 0.5323460153
30SOL 0.16905354587
31SOL 0.65474452654
32SOL 0.5246643547
33SOL 0.98253495546
.
.
.
.
For ABC t= 100.00000
28SOL 0.1357273845
29SOL 0.2353460160
30SOL 0.15605354068
31SOL 0.56474452705
32SOL 0.5016644010
33SOL 0.240534950236
How can i write the python program for this...?
Thank you
Update: here is the existing script:
import math
myfile=open('abc.txt', 'r')
text=myfile.read()
temp=text.split()
num=len(temp)
i=16
while(i<num):
dist=math.sqrt((float(temp[i])-float(temp[10]))**2+(float(te‌mp[i+1])-float(temp[‌11]))**2+(float(temp‌[i+2])-float(temp[12‌]))**2)
print temp[i-3],dist
i=i+6
myfile.close()

Regarding Python Program to calculate the distance between oxygen of one water molecule with hydrogens of other water molecules

I have a trajectory for 0.5 ps each after md simulation in gromacs. I want to calculate the distance between oxygen of one water molecule(OW) with the hydrogens of other water molecules (HW1,HW2)ie., distance between OW of 28SOL and HW1,HW2 of 30SOL,distance between OW of 28SOL and HW1,HW2 of 31SOL and for all combinations for t=0.00, for t=0.500....
Generated by trjconv : Protein t= 0.00000
28SOL OW 115 0.439 0.940 1.110
28SOL HW1 116 0.462 1.020 1.055
28SOL HW2 117 0.414 0.864 1.050
29SOL OW 118 1.626 1.796 1.779
29SOL HW1 119 1.550 1.763 1.834
29SOL HW2 120 1.594 1.871 1.720
30SOL OW 121 1.022 0.116 0.460
30SOL HW1 122 0.955 0.125 0.533
30SOL HW2 123 1.002 0.182 0.388
31SOL OW 124 1.063 0.349 1.874
31SOL HW1 125 1.028 0.428 1.824
31SOL HW2 126 1.129 0.300 1.816
32SOL OW 127 1.726 0.716 1.886
32SOL HW1 128 1.737 0.680 1.793
32SOL HW2 129 1.799 0.782 1.905
.
.
.
Generated by trjconv : Protein t= 0.50000
28SOL OW 115 0.494 1.029 1.115
28SOL HW1 116 0.529 1.116 1.080
28SOL HW2 117 0.465 0.971 1.039
29SOL OW 118 1.566 1.834 1.772
29SOL HW1 119 1.556 1.767 1.846
29SOL HW2 120 1.476 1.864 1.742
30SOL OW 121 0.913 0.070 0.385
30SOL HW1 122 0.876 0.086 0.477
30SOL HW2 123 0.880 0.142 0.323
31SOL OW 124 1.089 0.344 1.872
31SOL HW1 125 1.028 0.403 1.820
31SOL HW2 126 1.154 0.300 1.809
.
.
.
How can i write a python code to calculate the distance for this...

This is what gmx distance does, with the right choices of index groups.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dropping all rows of dataframe where discontinuous data happens - python

Split the timestamp column. Then group by day and filter by values less than 48. df["day"] = df["1"]/100 df["day"] = df["day"].astype('int64') df["half_hour"] = df["1"]%100 df["num_readings_per_day"] = df.groupby("day")["half_hour"].transform('count') df = df[df.num_readings_per_day==48]

A normal jump is either 1 or 53, use diff combined with isin to detect them and invert with ~ to find the true discontinuities: df.loc[(~df['1'].diff().isin([np.nan, 1, 53]))] Output: 0 1 2 12339 1000 45204 0.024

Related

Using a function in a loop and storing all the results

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

Create files from extracted content of one file

Regarding calculation of atomic distances in trajectory file of GROMACS in python

Regarding Python Program to calculate the distance between oxygen of one water molecule with hydrogens of other water molecules

Categories

Resources