Concatenate arrays into a single table using pandas - python
I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function
Related
Pandas: Group by operation on dynamically selected columns with conditional filter
I have a dataframe as follows: Date Group Value Duration 2018-01-01 A 20 30 2018-02-01 A 10 60 2018-01-01 B 15 180 2018-02-01 B 30 210 2018-03-01 B 25 238 2018-01-01 C 10 235 2018-02-01 C 15 130 I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months. As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows: Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C). Perform group_by to compute mean using last two month's Value and Duration. So my approach is as follows: For Task 1: s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group']) df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0) can we have a better method for achieving the 'add row' task? The reference is found here. For Task 2: def cond_grp_by(df,grp_by:str,cols_list:list,*args): df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean()) return df_grp df_cols = df.columns.tolist() df = cond_grp_by(dealer_f_filt,'Group',df_cols) Reference of the above approach is found here. The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected The expected output is Group Value Duration A 10 60 <--------- Since a row is added for 2018-03-01 with B 27.5 224 same value as 2018-02-01,we are C 15 130 <--------- computing mean for last two values
Use GroupBy.agg instead transform if need output filled by aggregate values: def cond_grp_by(df,grp_by:str,cols_list:list,*args): return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index() df = cond_grp_by(df,'Group',df_cols) print (df) Group Value Duration 0 A 10.0 60.0 1 B 27.5 224.0 2 C 15.0 130.0 If need last value per groups use GroupBy.last: def cond_grp_by(df,grp_by:str,cols_list:list,*args): return df.groupby(grp_by)[cols_list].last().reset_index() df = cond_grp_by(df,'Group',df_cols)
Python - Count row between interval in dataframe
I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%) The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either. import pandas as pd import matplotlib.pyplot as plt import numpy as np from numpy.ma.extras import _ezclump as ez data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020', '01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'], 'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a', 'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',], 'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50, 50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100], 'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100, 100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100] } df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax']) df['date'] = df['date'].astype('datetime64[ns]') df = df.set_index('date') df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 )) liste = [] engine_off = ez((df['inter'] == 1).to_numpy()) for i in engine_off: if df.iloc[(i.start)-1, 3] == 0: engine_off.remove(i) elif df.iloc[(i.stop), 3] == 0: engine_off.remove(i) else: liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start]) dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure']) dfend['month'] = dfend['begin'].dt.month_name() dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1) Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop) Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice. Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop. Then I create a dataframe from the list, then I average, sum, etc. The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33) Can someone help me? Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently # another way to create inter, probably faster on big dataframe df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01], labels=[0,1,2], right=False) # mask if inter is equal to 1 and groupby engine gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine']) # create a mask to get True for the rows you want m = (df['inter'].eq(1) # the row are 1s & ~gr.ffill().eq(0) # the row before 1s is not 0 & ~gr.bfill().eq(0) # the row after 1s is not 0 ) #create dfend with similar shape to yours dfend = (df.assign(date=df.index) #create a column date for the agg .where(m) # replace the rows not interesting by nan .groupby(['engine', #groupby per engine m.ne(m.shift()).cumsum()]) # and per group of following 1s .agg(begin=('date','first'), #agg date with both start date end = ('date','last')) # and end date ) # create the colum nb_hours (although here it seems to be nb_days) dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1 print (dfend) begin end nb_hours engine inter a 2 2020-01-08 2020-01-12 5 4 2020-01-28 2020-01-31 4 b 4 2020-01-01 2020-01-02 2 6 2020-01-20 2020-01-25 6 8 2020-01-28 2020-01-29 2 and you got the three segment for engine b as required, then you can #create dfgroupe dfgroupe = (dfend.groupby(['engine', #groupby engine dfend['begin'].dt.month_name()]) #and month name .agg(['mean','max','min','std','count','sum']) #agg .fillna(1) ) print (dfgroupe) nb_hours mean max min std count sum engine begin a January 4.500000 5 4 0.707107 2 9 b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology: - 80 <= energy <= 100 ---> df['inter'] == 2, normal mode. - 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode. - 20 > energy ---> df['inter'] == 0, stop mode. I reckon you want to find those periods of time in which: 1) The machine is operating in intermediate mode. 2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode. # df['before']: this is to compare each row of df['inter'] with the previous row # df['after']: this is to compare each row of df['inter'] with the next row # df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met. # In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on. df['before'] = df['inter'].shift(periods=1, fill_value=0) df['after'] = df['inter'].shift(periods=-1, fill_value=0) df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0) df['month'] = df['date'].dt.month mask = df['target'] == 1 df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])
Python List - set every n-th value None
As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much. I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date Here is my code how i generated the dates with pandas import pandas as pd numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019 date1 = '1990-01-01' date2 = '2019-06-01' mydates = pd.date_range(date1, date2,).tolist() date_all = pd.date_range(start=date1, end=date2, freq='1BMS') date_lst = [date_all] The expected Output: 01.01.1990 01.02.1990 01.03.1990 01.04.1990 01.05.1990 01.06.1990 01.07.1990 01.08.1990 01.09.1990 01.10.1990 01.11.1990 01.12.1990 None 01.01.1991 . . .
If I understood correctly: import pandas as pd numdays = 370 date1 = '1990-01-01' date2 = '2019-06-01' mydates = pd.date_range(date1, date2,).tolist() date_all = pd.date_range(start=date1, end=date2, freq='1BMS') date_lst = [date_all] for i in range(12,len(mydates),13): # add this mydates.insert(i, None)
I saw some of the answers above, but there's a way of doing this without having to loop over the complete list: date_lst[12::12] = [None] * len(date_lst[12::12]) The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.
You add a step in iloc and set values this way. lets generate some dummy data. df = pd.DataFrame({'Vals' : pd.date_range('01-01-19','02-02-19',freq='D')}) print(df) Vals 0 2019-01-01 1 2019-01-02 2 2019-01-03 3 2019-01-04 4 2019-01-05 5 2019-01-06 6 2019-01-07 7 2019-01-08 now, you can decide your step step = 5 new_df = df.iloc[step::step] print(new_df) Vals 5 2019-01-06 10 2019-01-11 15 2019-01-16 20 2019-01-21 25 2019-01-26 30 2019-01-31 now, if you want to write a value to a specific column then - df['Vals'].iloc[step::step] = pd.NaT print(df) Vals 0 2019-01-01 1 2019-01-02 2 2019-01-03 3 2019-01-04 4 2019-01-05 5 NaT
Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0) data = [1,2,3,4,5,6,7,8,9] data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)] print(data) output: [1, 2, None, 4, 5, None, 7, 8, None] According to your code try this: date_lst = list(date_all) dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)] print(dateWithNone)
Average time between timestamps per group not in order
I would like to get the mean time between timestamps per group. However, the groups are not ordered. Code to create df: d = {'ID': ['AI100', 'AI200', 'AI200', 'AI100','AI200','AI100'], 'Date': ['2019-01-10', '2018-06-01', '2018-06-11','2019-01-15','2018-06-21', '2019-01-22']} data = pd.DataFrame(data=d) data = data[['ID', 'Date']] data['Date'] = pd.to_datetime(data['Date']) data ID Date 0 AI100 2019-01-10 1 AI200 2018-06-01 2 AI200 2018-06-11 3 AI100 2019-01-15 4 AI200 2018-06-21 5 AI100 2019-01-22 I tried the following: data = data.sort_values(['ID','Date'],ascending=True).groupby('ID').head(3) #group the IDs data['diffs'] = data['Date'].diff() data['diffs'] = data['diffs'].apply(lambda x: x.days) data = data.groupby(['ID'])[('diffs')].agg('mean') However, this yields: data.add_suffix('ID').reset_index() ID diffs 0 AI100ID 6.000000 1 AI200ID -71.666667 The mean time for group AI100ID is correct, but not for group AI200ID. What is going wrong?
I think the issue you're having here is that you aren't calculating your diffs by the group so it's calculating the difference between the previous group's last value and the new group's first value. Change your line to this and you should get the expected result: data['diffs'] = data.groupby('ID')['Date'].diff() Footnote: Another other tip unrelated to the main problem, but just in case you were unaware: data['diffs'] = data['diffs'].apply(lambda x: x.days) Can be written to use faster vectorised operations using the .dt accessor: data['diffs'] = data['diffs'].dt.days
Multiple input and multiple output function application to Pandas DataFrame raises shape exception
I have a dataframe with 6 columns (excluding the index), 2 of which are relevant inputs to a function and that function has two outputs. I'd like to insert these outputs to the original dataframe as columns. I'm following toto_tico's answer here. I'm copying for convenience (with slight modifications): import pandas as pd df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [10, 10, 10], "D": [1, 1, 1]}) def fab(row): return row['A'] * row['B'], row['A'] + row['B'] df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1)) This code works without a problem. My code, however, doesn't. My dataframe has the following structure: Date Station Insolation Daily Total Temperature(avg) Latitude 0 2011-01-01 Aksaray 1.7 72927.6 -0.025000 38.3705 1 2011-01-02 Aksaray 5.6 145874.7 2.541667 38.3705 2 2011-01-03 Aksaray 6.3 147197.8 6.666667 38.3705 3 2011-01-04 Aksaray 2.9 100350.9 5.312500 38.3705 4 2011-01-05 Aksaray 0.7 42138.7 4.639130 38.3705 The function I'm applying takes a row as input, and returns two values based on Latitude and Date. Here's that function: def h0(row): # Get a row from a dataframe, give back H0 and daylength # Leap year must be taken into account # row['Latitude'] and row['Date'] are relevant inputs # phi is taken in degrees, all angles are assumed to be degrees as well in formulas # numpy defaults to radians however... gsc = 1367 phi = np.deg2rad(row['Latitude']) date = row['Date'] year = pd.DatetimeIndex([date]).year[0] month = pd.DatetimeIndex([date]).month[0] day = pd.DatetimeIndex([date]).day[0] if year % 4 == 0: B = (day-1) * (360/366) else: B = (day-1) * (360/365) B = np.deg2rad(B) delta = (0.006918 - 0.399912*np.cos(B) + 0.070257*np.sin(B) - 0.006758*np.cos(2*B) + 0.000907*np.sin(2*B) - 0.002697*np.cos(3*B) + 0.00148*np.sin(3*B)) ws = np.arccos(-np.tan(phi) * np.tan(delta)) daylenght = (2/15) * np.rad2deg(ws) if year % 4 == 0: dayangle = np.deg2rad(360*day/366) else: dayangle = np.deg2rad(360*day/365) h0 = (24*3600*gsc/np.pi) * (1 + 0.033*np.cos(dayangle)) * (np.cos(phi)*np.cos(delta)*np.sin(ws) + ws*np.sin(phi)*np.sin(delta)) return h0, daylenght When I use ak['h0'], ak['N'] = zip(*ak.apply(h0, axis=1)) I get the error: Shape of passed values is (1816, 2), indices imply (1816, 6) I'm unable to find what's wrong with my code. Can you help?
So as mentioned in my previous comment, if you'd like to create multiple NEW columns in the DataFrame based on multiple EXISTING columns of the DataFrame. You can create a new field in the row Series WITHIN your h0 function. Here's an overly simple example to showcase what I mean: >>> def simple_func(row): ... row['new_column1'] = row.lat * 1000 ... row['year'] = row.date.year ... row['month'] = row.date.month ... row['day'] = row.date.day ... return row ... >>> df date lat 0 2018-01-29 1000 1 2018-01-30 5000 >>> df.date 0 2018-01-29 1 2018-01-30 Name: date, dtype: datetime64[ns] >>> df.apply(simple_func, axis=1) date lat new_column1 year month day 0 2018-01-29 1000 1000000 2018 1 29 1 2018-01-30 5000 5000000 2018 1 30 In your case, inside your h0 function, setrow['h0'] = h0 and row['N'] = daylength then return row. Then when it comes to calling the function the DF your line changes to ak = ak.apply(h0, axis=1)