Comparing many timestamps with pandas - python

I have two dataframes with different sizes containing timestamps. I need to find nearest timestamps. In df A I need to find all first timestamps after any of timestamps of df B. The dataframes have each around 100,000 rows so iteration is not a way and even df.apply() took around 6 mins.
e.g.:
A:
11
12
15
16
18
20
25
30
50
B:
14
19
22
27
result:
15
20
25
30

Use Series.searchsorted:
out = a.loc[a['A'].searchsorted(b['B']), 'A']
print (out)
2 15
5 20
6 25
7 30
Name: A, dtype: int64

Related

Getting all values which match the nsmallest(5) values in a pandas series

I want to pull out all values from a series which have a value found in the n smallest values, as I may have many values with a zero value, but nsmallest(5) only returns 5.
I got this to work, but am wondering if there is a more pythonic way of doing it, like using a lambda, or using a basic in statement?
alcohol[[True if a in alcohol.nsmallest(5).values else False for a in alcohol]] # works, but best way?
IIUC:
>>> alcohol[alcohol <= alcohol.nsmallest(5).max()]
1 46
5 19
6 25
7 17
9 42
dtype: int64
Setup
np.random.seed(2022)
alcohol = pd.Series(np.random.randint(1, 100, 10))
print(alcohol)
# Output
0 93
1 46
2 50
3 56
4 89
5 19
6 25
7 17
8 54
9 42
dtype: int64

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Creating Multiple DataFrames from single DataFrame based on different values of single column

I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times

Pandas get monthly open close from daily data?

I have around 700 rows with data starting from Jan 2010.
I am trying to find the monthly movement i.e. 1st recorded data open for a month minus the last recorded data close for that month.
Groupby allows for sum() and mean() but I can't figure out how to get the above mentioned two data points.
df
0 2010-04-01 9464.15 9507.75
1 2010-04-05 9593.55 9698.60
2 2010-04-06 9732.60 9728.20
3 2010-04-07 9778.50 9681.05
4 2010-04-08 9676.70 9520.00
5 2010-04-09 9538.00 9656.50
6 2010-04-12 9661.20 9575.45
7 2010-04-13 9565.05 9483.00
8 2010-04-15 9501.60 9344.60
9 2010-04-16 9345.50 9353.75
10 2010-04-19 9273.85 9302.90
11 2010-04-20 9314.55 9446.10
12 2010-04-21 9477.10 9555.30
13 2010-04-22 9534.05 9623.25
14 2010-04-23 9653.15 9813.30
15 2010-04-26 9890.80 9839.15
16 2010-04-27 9827.00 9756.45
17 2010-04-28 9630.35 9634.90
18 2010-04-29 9652.60 9803.80
19 2010-04-30 9809.40 9870.35
20 2010-05-03 9809.40 9775.50
21 2010-05-04 9816.60 9623.70
22 2010-05-05 9461.35 9581.85
23 2010-05-06 9588.85 9582.00
24 2010-05-07 9426.65 9276.10
25 2010-05-10 9419.50 9656.25
26 2010-05-11 9683.20 9626.10
27 2010-05-12 9640.80 9722.20
28 2010-05-13 9788.35 9773.35
29 2010-05-14 9738.15 9589.05
Desired output
df
Date Open Close
0 2010-APR 9464.15 9634.90 # Close, is from 2010-04-30
1 2010-MAY 9809.40 9589.05 # Close, if from 2010-05-14
It would be great to have two more columns such as Open Date and Close Date.
I this will do
df["Date] = pd.to_datetime(df["Date"])
gb = df.groupby([df.Date.dt.month])
pd.DataFrame({'Open':gb.Open.nth(0), 'Close':gb.Close.nth(-1)})

Group Daily Data by Week for Python Dataframe

So I have a Python dataframe that is sorted by month and then by day,
In [4]: result_GB_daily_average
Out[4]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
This continues on for every month of the year and I would like to be able to sort it by week instead of day, so that it looks something like this:
In [4]: result_GB_week_average
Out[4]:
NREL Avert
Month Week
1 1 Average values from first 7 days
2 Average values from next 7 days
3 Average values from next 7 days
4 Average values from next 7 days
And so forth. What would the easiest way to do this be?
I assume by weeks you don't mean actual calendar week!!! Here is my proposed solution:
#First add a dummy column
result_GB_daily_average['count'] = 1
#Then calculate a cumulative sum and divide it by 7
result_GB_daily_average['Week'] = result_GB_daily_average['count'].cumsum() / 7.0
#Then Round the weeks
result_GB_daily_average['Week']=result_GB_daily_average['Week'].round()
#Then do the group by and calculate average
result_GB_week_average = result_GB_daily_average.groupby('Week')['NREL','AVERT'].mean()

Categories