How to count na "blocks" in one column depending on another columns - python

Suppose I have this df:
Date | Time | DeviceID | Temperature (c°)| Humidity (%)|
---------------------------------------------------------------
01/01/20 | 12:00 | 567 | 13.1 | 73 |
01/01/20 | 12:10 | 2543 | 13 | 72.7 |
01/01/20 | 12:20 | 573 | 13.5 | 70 |
01/01/20 | 12:30 | 474 | 12 | 75 |
How can I display the DeviceIDs which had more than, let's say,3 consecutive NAs in temperature or humidity?
Also, I would like to see the length of consecutive NAs segments each DeviceID has in Temperature and Humidity.(i.e Device 4 has 4 consecutive NAs in temperature on this day; Device 80 has 6 consecutive Nas in humidity on that day and so on)
What can I do?

Related

Take the sum of every N rows per group in a pandas DataFrame

Just want to mention that this question is not a duplicate of this: Take the sum of every N rows in pandas series
My problem is a bit different as I want to calculate N rows per group. My current code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':['AA','AA','AA','BB','BB','BB'],
'DATE':['2021-01-01','2021-01-03','2021-01-08','2021-03-04','2021-03-06','2021-03-08'],
'VALUE':[10,15,25,40,60,90]})
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
df.head(10)
Sample DataFrame:
+----+------------+-------+
| ID | DATE | VALUE |
+----+------------+-------+
| AA | 2021-01-01 | 10 |
+----+------------+-------+
| AA | 2021-01-03 | 15 |
+----+------------+-------+
| AA | 2021-01-08 | 25 |
+----+------------+-------+
| BB | 2021-03-04 | 40 |
+----+------------+-------+
| BB | 2021-03-06 | 60 |
+----+------------+-------+
| BB | 2021-03-08 | 90 |
+----+------------+-------+
I apply this preprocessing based on the post:
#Calculate result
df.groupby(['ID', df.index//2]).agg({'VALUE':'mean', 'DATE':'median'}).reset_index()
I get this:
+----+-------+------------+
| ID | VALUE | DATE |
+----+-------+------------+
| AA | 12.5 | 2021-01-02 |
+----+-------+------------+
| AA | 25 | 2021-01-08 |
+----+-------+------------+
| BB | 40 | 2021-03-04 |
+----+-------+------------+
| BB | 75 | 2021-03-07 |
+----+-------+------------+
But I want this:
+----+-------+------------+
| ID | VALUE | DATE |
+----+-------+------------+
| AA | 12.5 | 2021-01-02 |
+----+-------+------------+
| AA | 25 | 2021-01-08 |
+----+-------+------------+
| BB | 50 | 2021-03-05 |
+----+-------+------------+
| BB | 90 | 2021-03-08 |
+----+-------+------------+
It seems like pandas index does not work well when my groups are not perfectly aligned and it messes up the beginning of the next series and how aggregations happen. Any suggestions? My dates can be completely irregular by the way.
You can use groupby.cumcount to form a subgroup:
N = 2
group = df.groupby('ID').cumcount()//N
out = (df.groupby(['ID', group])
.agg({'VALUE':'mean', 'DATE':'median'})
.droplevel(1).reset_index()
)
output:
ID VALUE DATE
0 AA 12.5 2021-01-02
1 AA 25.0 2021-01-08
2 BB 50.0 2021-03-05
3 BB 90.0 2021-03-08

How can you take the average of values in a certain time interval on a specific day?

I have created a data frame with multiple days and multiple moments throughout each day where the heart rate was measured:
code: df = df_HR.set_index('date') &
df.drop(['2022-05-23', '2022-05-27', '2022-06-10', '2022-06-13'])
The resulting data frame looks like this:
| date | time | heartRate |
| ---- | ---- | --------- |
| 2022-05-24 | 00:00 | 54 |
| 2022-05-24 | 00:01 | 54 |
| 2022-05-24 | 00:02 | 54 |
| 2022-05-24 | 00:03 | 54 |
| 2022-05-24 | 00:04 | 54 |
This goes for multiple days. Now I want to calculate the average from a specific time interval, let's say 00:01-00:03. I only want to calculate this for 2022-05-24, because as an example for 2022-05-26 I would want to calculate the average heart rate for the time interval 00:12-00:28.
How do I write this code?

How do I sum columns on the basis of a condition using pandas?

The df has the following columns,
col1 | col2 | col3 | Jan-19 | Feb-19 | Mar-19 | Apr-19 | May-19 | Jun-19 | Jul-19 | Aug-19 | Sep-19 | Oct-19 | Nov-19 | Dec-19 | Jan-20 | Feb-20 | Mar-20 | Apr-20 | May-20 | Jun-20 | Jul-20 | Aug-20 | Sep-20 | Oct-20 | Nov-20 | Dec-20
ab | cd | | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98
The month columns have numbers. I want to sum the month columns on the following condition,
Condition,
If the datetime.now().strftime('%b-%Y') is anything from Jun-19(for example) to Oct-19, then I want to sum the month columns from Oct-19 to Feb-20. If it was anything from Jun-20 to Oct-20, then sum of columns from Oct-20 to Feb-21 and so on.
If the datetime.now().strftime('%b-%Y') is anything from Nov-19 to May-19, then I want to sum the month columns from Mar-20 to Sep-20. If it was anything from Nov-20 to May-20, then sum of columns Mar-21 to Sep-21 and so on.
There should be a Total column at the end.
col1 | col2 | col3 | Jan-19 | Feb-19 | Mar-19 | Apr-19 | May-19 | Jun-19 | Jul-19 | Aug-19 | Sep-19 | Oct-19 | Nov-19 | Dec-19 | Jan-20 | Feb-20 | Mar-20 | Apr-20 | May-20 | Jun-20 | Jul-20 | Aug-20 | Sep-20 | Oct-20 | Nov-20 | Dec-20 | Total
ab | cd | | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 296
Is there a way to create a generic condition for this so that it may work for x month and y year?
It is still confusing about what you are actually want to do.
But to your case, my suggestion is you can select the columns by their names and transpose the table.
Then you can sum the values along the row axis.
It is not very time costing on DataFrame.
In my opinion, the operation across the col axis in DataFrame is always harder than across the row axis.
Since in the row operation, one can use .query() function to easily filter the entries they want.
But not in the col direction.

Difference of sum of consecutive years pandas

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

How to determine issue of weirdly shaped hourly PV power output curve from PVlib

My hourly pv power using pvlib is unusually high in the mornings and low in the evenings. It seems like the peak is shifted towards the morning.
This is one random day's output power with corresponding irradiance data (W/m2):
Time | AC Power [kW] | GHI | DHI | DNI |
-------|---------------|-----|-----|-----|
6:00 | 0 | 4 | 1 | 0 |
7:00 | 22 | 161 | 66 | 589 |
8:00 | 29 | 390 | 153 | 608 |
9:00 | 35 | 592 | 220 | 629 |
10:00 | 37 | 754 | 262 | 654 |
11:00 | 36 | 830 | 283 | 635 |
12:00 | 34 | 874 | 291 | 638 |
13:00 | 31 | 894 | 292 | 668 |
14:00 | 24 | 828 | 280 | 659 |
15:00 | 15 | 695 | 251 | 631 |
16:00 | 5 | 514 | 198 | 601 |
17:00 | 3 | 299 | 128 | 550 |
18:00 | 1 | 74 | 39 | 430 |
As can be seen, the power does not really match the irradiance data and seems to be shifted to earlier times. It has to be mentioned that the irradiance data is simulated GHI and DHI and calculated DNI. The maximum AC output of the system is 40 kW, limited by the inverter.
Do you have any idea why this happens? Do I oversee something obvious?
I have tried to change the timezone declaration which didn't change anything. I also tried to change the inclination angle from 5 to 45 which weirdly resulted in higher PV output powers. This should definitely not be the case for this latitude.
Thanks, heaps!
Here is the code for my PVlib model:
'''
TMY_Dataframe creation --> uses function from tmyDataImport Module
'''
tmy_df=ti.tmyData('DHI.csv','Weather.csv',highres=False)
"""
Location declaration
"""
lat_ref=0.20
long_ref=35
tz_ref='Africa/Nairobi'
alt_ref=1155.0
loc=Location(latitude=lat_ref, longitude=long_ref, tz=tz_ref,
altitude=alt_ref)
"""
PVSystem declaration
"""
cec_modules = pvlib.pvsystem.retrieve_sam('CECMod')
# sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
cec_inverters=pvlib.pvsystem.retrieve_sam('cecinverter')
tilt_ref=5
azi_ref=180 #South
alb_ref=None
surf_type_ref='grass'
mod_ref=None
mod_para_ref=cec_modules['Trina_Solar_TSM_325PD14']
mod_p_str_ref=19
str_p_inv_ref=1
inv_ref=None
inv_para_ref=cec_inverters['Fronius_USA__IG_Plus_5_0_1_UNI__240V__240V__CEC_2018_']
rack_ref='open_rack_cell_glassback'
losses_ref=None
pvsyst=PVSystem(surface_tilt=tilt_ref, surface_azimuth=azi_ref, albedo=alb_ref, surface_type=surf_type_ref,
module=mod_ref, module_parameters=mod_para_ref, modules_per_string=mod_p_str_ref, strings_per_inverter=str_p_inv_ref,
inverter=inv_ref, inverter_parameters=inv_para_ref, racking_model=rack_ref, losses_parameters=losses_ref)
"""
ModelChain declaration
"""
pvsys_ref=pvsyst
loc_ref=loc
orient_strat_ref=None
sky_mod_ref='ineichen'
transp_mod_ref='haydavies'
sol_pos_mod_ref='nrel_numpy'
airm_mod_ref='kastenyoung1989'
dc_mod_ref='cec'
ac_mod_ref=None
aoi_mod_ref='physical'
spec_mod_ref='no_loss'
temp_mod_ref='sapm'
loss_mod_ref='no_loss'
moch=ModelChain(system=pvsys_ref, location=loc_ref, orientation_strategy=orient_strat_ref,
clearsky_model=sky_mod_ref, transposition_model=transp_mod_ref, solar_position_model=sol_pos_mod_ref,
airmass_model=airm_mod_ref, dc_model=dc_mod_ref, ac_model=ac_mod_ref, aoi_model=aoi_mod_ref,
spectral_model=spec_mod_ref, temp_model=temp_mod_ref, losses_model=loss_mod_ref)
moch.run_model(times=tmy_df.index, weather=tmy_df)
ac_power=moch.ac*8/1000
ac_power = ac_power.reset_index(drop=False)
ac_power = ac_power.rename(columns={0: "PV Power [kW]"})
ac_power.loc[(ac_power['PV Power [kW]'] < 0, 'PV Power [kW]')]=0
ac_power.to_csv('pvPower.csv')
I solved it =D.
The problem was in the TMY file. When I created the timestamp in my tmyData() function, I did not specify tz=pytz.timezone('Africa/Nairobi'), which apparently sets it to UTC by default. Now, the power output makes sense.
Cheers Axel

Categories