Concatenate arrays into a single table using pandas

Concatenate arrays into a single table using pandas - python

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.

If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100

The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

Related

Pandas: Group by operation on dynamically selected columns with conditional filter

I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).
Perform group_by to compute mean using last two month's Value and Duration.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values

Use GroupBy.agg instead transform if need output filled by aggregate values:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
Group Value Duration
0 A 10.0 60.0
1 B 27.5 224.0
2 C 15.0 130.0
If need last value per groups use GroupBy.last:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)

Python - Count row between interval in dataframe

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you

here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10

I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Python List - set every n-th value None

As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much.
I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None
The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date
Here is my code how i generated the dates with pandas
import pandas as pd
numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
The expected Output:
01.01.1990
01.02.1990
01.03.1990
01.04.1990
01.05.1990
01.06.1990
01.07.1990
01.08.1990
01.09.1990
01.10.1990
01.11.1990
01.12.1990
None
01.01.1991
.
.
.

If I understood correctly:
import pandas as pd
numdays = 370
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
for i in range(12,len(mydates),13): # add this
mydates.insert(i, None)

I saw some of the answers above, but there's a way of doing this without having to loop over the complete list:
date_lst[12::12] = [None] * len(date_lst[12::12])
The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.

You add a step in iloc and set values this way.
lets generate some dummy data.
df = pd.DataFrame({'Vals' :
pd.date_range('01-01-19','02-02-19',freq='D')})
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 2019-01-06
6 2019-01-07
7 2019-01-08
now, you can decide your step
step = 5
new_df = df.iloc[step::step]
print(new_df)
Vals
5 2019-01-06
10 2019-01-11
15 2019-01-16
20 2019-01-21
25 2019-01-26
30 2019-01-31
now, if you want to write a value to a specific column then -
df['Vals'].iloc[step::step] = pd.NaT
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 NaT

Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0)
data = [1,2,3,4,5,6,7,8,9]
data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)]
print(data)
output:
[1, 2, None, 4, 5, None, 7, 8, None]
According to your code try this:
date_lst = list(date_all)
dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)]
print(dateWithNone)

Average time between timestamps per group not in order

I would like to get the mean time between timestamps per group. However, the groups are not ordered.
Code to create df:
d = {'ID': ['AI100', 'AI200', 'AI200', 'AI100','AI200','AI100'],
'Date': ['2019-01-10', '2018-06-01', '2018-06-11','2019-01-15','2018-06-21', '2019-01-22']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
data
ID Date
0 AI100 2019-01-10
1 AI200 2018-06-01
2 AI200 2018-06-11
3 AI100 2019-01-15
4 AI200 2018-06-21
5 AI100 2019-01-22
I tried the following:
data = data.sort_values(['ID','Date'],ascending=True).groupby('ID').head(3) #group the IDs
data['diffs'] = data['Date'].diff()
data['diffs'] = data['diffs'].apply(lambda x: x.days)
data = data.groupby(['ID'])[('diffs')].agg('mean')
However, this yields:
data.add_suffix('ID').reset_index()
ID diffs
0 AI100ID 6.000000
1 AI200ID -71.666667
The mean time for group AI100ID is correct, but not for group AI200ID.
What is going wrong?

I think the issue you're having here is that you aren't calculating your diffs by the group so it's calculating the difference between the previous group's last value and the new group's first value.
Change your line to this and you should get the expected result:
data['diffs'] = data.groupby('ID')['Date'].diff()
Footnote:
Another other tip unrelated to the main problem, but just in case you were unaware:
data['diffs'] = data['diffs'].apply(lambda x: x.days)
Can be written to use faster vectorised operations using the .dt accessor:
data['diffs'] = data['diffs'].dt.days

Multiple input and multiple output function application to Pandas DataFrame raises shape exception

I have a dataframe with 6 columns (excluding the index), 2 of which are relevant inputs to a function and that function has two outputs. I'd like to insert these outputs to the original dataframe as columns.
I'm following toto_tico's answer here. I'm copying for convenience (with slight modifications):
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [10, 10, 10], "D": [1, 1, 1]})
def fab(row):
return row['A'] * row['B'], row['A'] + row['B']
df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
This code works without a problem. My code, however, doesn't. My dataframe has the following structure:
Date Station Insolation Daily Total Temperature(avg) Latitude
0 2011-01-01 Aksaray 1.7 72927.6 -0.025000 38.3705
1 2011-01-02 Aksaray 5.6 145874.7 2.541667 38.3705
2 2011-01-03 Aksaray 6.3 147197.8 6.666667 38.3705
3 2011-01-04 Aksaray 2.9 100350.9 5.312500 38.3705
4 2011-01-05 Aksaray 0.7 42138.7 4.639130 38.3705
The function I'm applying takes a row as input, and returns two values based on Latitude and Date. Here's that function:
def h0(row):
# Get a row from a dataframe, give back H0 and daylength
# Leap year must be taken into account
# row['Latitude'] and row['Date'] are relevant inputs
# phi is taken in degrees, all angles are assumed to be degrees as well in formulas
# numpy defaults to radians however...
gsc = 1367
phi = np.deg2rad(row['Latitude'])
date = row['Date']
year = pd.DatetimeIndex([date]).year[0]
month = pd.DatetimeIndex([date]).month[0]
day = pd.DatetimeIndex([date]).day[0]
if year % 4 == 0:
B = (day-1) * (360/366)
else:
B = (day-1) * (360/365)
B = np.deg2rad(B)
delta = (0.006918 - 0.399912*np.cos(B) + 0.070257*np.sin(B)
- 0.006758*np.cos(2*B) + 0.000907*np.sin(2*B)
- 0.002697*np.cos(3*B) + 0.00148*np.sin(3*B))
ws = np.arccos(-np.tan(phi) * np.tan(delta))
daylenght = (2/15) * np.rad2deg(ws)
if year % 4 == 0:
dayangle = np.deg2rad(360*day/366)
else:
dayangle = np.deg2rad(360*day/365)
h0 = (24*3600*gsc/np.pi) * (1 + 0.033*np.cos(dayangle)) * (np.cos(phi)*np.cos(delta)*np.sin(ws) +
ws*np.sin(phi)*np.sin(delta))
return h0, daylenght
When I use
ak['h0'], ak['N'] = zip(*ak.apply(h0, axis=1))
I get the error: Shape of passed values is (1816, 2), indices imply (1816, 6)
I'm unable to find what's wrong with my code. Can you help?

So as mentioned in my previous comment, if you'd like to create multiple NEW columns in the DataFrame based on multiple EXISTING columns of the DataFrame. You can create a new field in the row Series WITHIN your h0 function.
Here's an overly simple example to showcase what I mean:
>>> def simple_func(row):
... row['new_column1'] = row.lat * 1000
... row['year'] = row.date.year
... row['month'] = row.date.month
... row['day'] = row.date.day
... return row
...
>>> df
date lat
0 2018-01-29 1000
1 2018-01-30 5000
>>> df.date
0 2018-01-29
1 2018-01-30
Name: date, dtype: datetime64[ns]
>>> df.apply(simple_func, axis=1)
date lat new_column1 year month day
0 2018-01-29 1000 1000000 2018 1 29
1 2018-01-30 5000 5000000 2018 1 30
In your case, inside your h0 function, setrow['h0'] = h0 and row['N'] = daylength then return row. Then when it comes to calling the function the DF your line changes to ak = ak.apply(h0, axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate arrays into a single table using pandas - python

Related

Pandas: Group by operation on dynamically selected columns with conditional filter

Python - Count row between interval in dataframe

Python List - set every n-th value None

Average time between timestamps per group not in order

Multiple input and multiple output function application to Pandas DataFrame raises shape exception

Categories

Resources