Apply a function with multiple parameters in input in groupby pandas - python

I would like to substitute the NaN and NaT values of the Value1 column, with others calculated with a function that takes in input Value2 and Value3 (if they exist) of the same row of Value1. This is done for each ID. To do this, I would use 'groupby' and then 'apply'.But I get an error: 'Series' objects are mutable, thus they cannot be hashed. Could you help me? Thanks in advance!
ID1 = [2002070, 2002070, 2002740,2002740,2003010]
ID2 = [2002070, 200800, 200800,2002740,2002740]
ID3 = [2002740, 2002740, 2002070, 2002070,2003010]
Value1 = [4.5, 4.2, 3.7, 4.8, 4.4]
Value2 = [7.2, 6.4, 10, 2.3, 1.5]
Value3 = [8.4, 8.4, 8.4, 7.4, 7.4]
date1 = ['2008-05-14', '2005-12-07','2008-10-27', '2009-04-20', '2012-03-01']
date2 = ['2005-12-07','2003-10-10', '2004-05-14', '2011-06-03', '2015-07-05']
date3 = ['2010-10-22', '2012-03-01', '2013-11-28', '2005-12-07', '2012-03-01']
date1=pd.to_datetime(date1)
date2=pd.to_datetime(date2)
date3=pd.to_datetime(date3)
df1=pd.DataFrame({'ID': ID1, 'Value1': Value1, 'Date1':date1}).sort_values('Date1')
df2=pd.DataFrame({'ID': ID2, 'Value2': Value2, 'Date2':date2}).sort_values('Date2')
df3=pd.DataFrame({'ID': ID3, 'Value3': Value3, 'Date3':date3}).sort_values('Date3')
ok = df1.merge(df2, left_on=['ID','Date1'],right_on=['ID','Date2'], how='outer', sort=True)
ok1 = ok.merge(df3, left_on='ID',right_on='ID', how='inner', sort=True )
the df I obtain is this:
ID Value1 Date1 Value2 Date2 Value3 Date3
0 2002070 4.2 2005-12-07 7.2 2005-12-07 7.4 2005-12-07
1 2002070 4.2 2005-12-07 7.2 2005-12-07 8.4 2013-11-28
2 2002070 4.5 2008-05-14 NaN NaT 7.4 2005-12-07
3 2002070 4.5 2008-05-14 NaN NaT 8.4 2013-11-28
4 2002740 3.7 2008-10-27 NaN NaT 8.4 2010-10-22
5 2002740 3.7 2008-10-27 NaN NaT 8.4 2012-03-01
6 2002740 4.8 2009-04-20 NaN NaT 8.4 2010-10-22
7 2002740 4.8 2009-04-20 NaN NaT 8.4 2012-03-01
8 2002740 NaN NaT 2.3 2011-06-03 8.4 2010-10-22
9 2002740 NaN NaT 2.3 2011-06-03 8.4 2012-03-01
10 2002740 NaN NaT 1.5 2015-07-05 8.4 2010-10-22
11 2002740 NaN NaT 1.5 2015-07-05 8.4 2012-03-01
12 2003010 4.4 2012-03-01 NaN NaT 7.4 2012-03-01
this is the function I made:
def func(Value2, Value3):
return Value2/((Value3/100)**2)
result = ok1.groupby("ID").Value1.apply(func(ok1.Value2, ok1.Value3))
Do you know how to apply this function only to a NaN Value1? And how to put the NaT Date1 equal to Date2?

The output of func is another Series, and pandas is not sure what you want to do with it - what would it mean to apply this series to the groups?
Is it that you want the values of this series to be assigned wherever there is a missing Value1 in the original DataFrame?
In that case
imputes = ok1.Value2.div(ok1.Value3.div(100).pow(2)) # same as your function
# overwrite missing values with the corresponding imputed values
ok1.Value1.fillna(imputes, inplace=True)
# overwrite missing dates with dates from another column
ok1.Date1.fillna(ok1.Date2, inplace=True)
However, it's not clear to me that this is quite what you wanted, given the presence of the groupby.

Related

Generate weeks from column with dates

I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0

Loop through Series to find which have the same index value

I'd like to concatenate/merge my pandas Series together. This is my data structure (For extra information)
dictionary = { 'a':{'1','2','3','4'}, 'b':{'1','2','3','4'} }
There are many more values at both levels, and each number corresponds to a series that contains timeseries data. I would like to merge all of 'a' together into one dataframe, the only trouble is that some of the data is yearly, some quarterly and some monthly.
so what I'm looking to do is loop through my data, something like this:
for level1 in dictData:
for level2 in dictData[level1]:
dictData[level1][level2].index.equals(dictData[level1][level2])
but obviously here I'm just comparing the series to itself! How would I compare each element to all the others? I know I'm missing something fairly fundamental. Thank you.
EDIT:
Here's some samples of actual data:
{'noT10101': {'A191RL': Gross domestic product
1947-01-01 -1.1
1947-04-01 -1.0
1947-07-01 -0.8
1947-10-01 6.4
1948-01-01 4.1
... ...
2020-01-01 -5.0
2020-04-01 -31.4
2020-07-01 33.4
2020-10-01 4.3
2021-01-01 6.4
[370 rows x 1 columns], 'DGDSRL': Goods
1947-01-01 2.9
1947-04-01 7.4
1947-07-01 2.7
1947-10-01 1.5
1948-01-01 2.0
... ...
2020-01-01 0.1
2020-04-01 -10.8
2020-07-01 47.2
2020-10-01 -1.4
2021-01-01 26.6
[370 rows x 1 columns], 'A191RP': Gross domestic product, current dollars
1947-01-01 9.7
1947-04-01 4.7
1947-07-01 6.0
1947-10-01 17.3
1948-01-01 10.0
... ...
2020-01-01 -3.4
2020-04-01 -32.8
2020-07-01 38.3
2020-10-01 6.3
2021-01-01 11.0
[370 rows x 1 columns], 'DSERRL': Services
1947-01-01 0.4
1947-04-01 5.9
1947-07-01 -0.8
1947-10-01 -2.1
1948-01-01 2.7
... ...
2020-01-01 -9.8
2020-04-01 -41.8
2020-07-01 38.0
2020-10-01 4.3
2021-01-01 4.2
[370 rows x 1 columns],
As you can see, dictionary key 'not10101' corresponds to a series of keys 'A191RL', 'DGDSRL', 'A191RP', etc. whose associated value is a Series. So when I am accessing .index I am looking at the index of that Series aka the datetime values. In this example they all match but in some cases they don't.
You can use the pandas concat function. It would be something like this:
import pandas as pd
import numpy as np
df1 = pd.Series(np.random.random_sample(size=5),
index=pd.Timestamp("2021-01-01") + np.arange(5) * pd.Timedelta(days=365),
dtype=float)
df2 = pd.Series(np.random.random_sample(size=12),
index=pd.Timestamp("2021-01-15") + np.arange(12) * pd.Timedelta(days=30),
dtype=float)
dictData= {"a": {"series": df, "same_series": df}, "b": {"series":df, "different_series": df2}}
new_dict = {}
for level1 in dictData:
new_dict[level1] = pd.concat(list(dictData[level1].values()))
Notice that I tried to mimic both yearly and monthly granularity. I want to say with this is that it doesn't matter the granularity of the series that are being concatenated.
The result will be something like this:
{'a': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
dtype: float64,
'b': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-05-01 0.614485
2021-05-31 0.611967
2021-06-30 0.820435
2021-07-30 0.839613
2021-08-29 0.507669
2021-09-28 0.471049
2021-10-28 0.550482
2021-11-27 0.723789
2021-12-27 0.209169
2022-01-26 0.664584
2022-02-25 0.901832
2022-03-27 0.946750
dtype: float64}
Take a look at the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

How to flatten a pandas DataFrameGroupBy

I have a grouped object which is of type DataFrameGroupBy. I want to use this to aggregate some data like so:
aggregated = grouped.aggregate([np.sum, np.mean], axis=1)
This returns a DataFrame with the format:
aggregated[:3].to_dict()
"""
{('VALUE1', 'sum'): {
('US10adam034', 'PRCP'): 701,
('US10adam036', 'PRCP'): 1015,
('US10adam036', 'SNOW'): 46},
('VALUE1', 'mean'): {
('US10adam034', 'PRCP'): 100.14285714285714,
('US10adam036', 'PRCP'): 145.0,
('US10adam036', 'SNOW'): 46.0}}
"""
Printing out the head produces this:
VALUE1
sum mean
ID ELEMENT
US10adam034 PRCP 701 100.142857
US10adam036 PRCP 1015 145.000000
SNOW 46 46.000000
US10adam046 PRCP 790 131.666667
US10adam051 PRCP 5 0.555556
US10adam056 PRCP 540 31.764706
SNOW 25 1.923077
SNWD 165 15.000000
This works great. It easily computes sums and means for my sample where the grouped indices are (ID, ELEMENT). However, I'd really like to get this into a single row format where ID is unique and the columns are a combination of ELEMENT & (sum|mean). I can almost get there using apply like so:
def getNewSeries(t):
# type(t) => Series
element = t.name[1] # t.name is a tuple ('ID', 'ELEMENT')
sum_index=f'{element}sum'
mean_index=f'{element}mean'
return pd.Series(t['VALUE1'].values, index=[sum_index, mean_index])
aggregated.apply(getNewSeries, axis=1, result_type='expand')
Printing out the head again I get:
PRCPmean PRCPsum SNOWmean SNOWsum SNWDmean ...
ID ELEMENT
US10adam034 PRCP 100.142857 701.0 NaN NaN NaN
US10adam036 PRCP 145.000000 1015.0 NaN NaN NaN
SNOW NaN NaN 46.000000 46.0 NaN
US10adam046 PRCP 131.666667 790.0 NaN NaN NaN
US10adam051 PRCP 0.555556 5.0 NaN NaN NaN
US10adam056 PRCP 31.764706 540.0 NaN NaN NaN
SNOW NaN NaN 1.923077 25.0 NaN
SNWD NaN NaN NaN NaN 15.0
I would like my final DataFrame to look like this:
PRCPmean PRCPsum SNOWmean SNOWsum SNWDmean ...
ID
US10adam034 100.142857 701.0 NaN NaN NaN
US10adam036 145.000000 1015.0 46.000000 46.0 NaN
US10adam046 131.666667 790.0 NaN NaN NaN
US10adam051 0.555556 5.0 NaN NaN NaN
US10adam056 31.764706 540.0 1.923077 25.0 15.0
Is there a way, using apply, agg or transform to aggregate this data into single rows? I've tried also creating my own iterator over unique IDs but it was painfully slow. I like the ease of using agg to compute sum/mean.
I like using f-string with list comprehensions.. Python 3.6+ required for f-string formatting.
df_out = df.unstack()['VALUE1']
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out
Output:
PRCPsum SNOWsum PRCPmean SNOWmean
US10adam034 701.0 NaN 100.142857 NaN
US10adam036 1015.0 46.0 145.000000 46.0
You can do:
new_df = agg_df.unstack(level=1)
new_df.columns = [c+b for _,b,c in new_df.columns.values]
Output:
PRCPsum SNOWsum PRCPmean SNOWmean
US10adam034 701.0 NaN 100.142857 NaN
US10adam036 1015.0 46.0 145.000000 46.0
IIUC
aggregated = grouped['VALUE1'].aggregate([np.sum, np.mean], axis=1)
aggregated=aggregated.unstack()
aggregated.columns=aggregated.columns.map('{0[1]}|{0[0]}'.format)
Please check if reset_index is working as per your need
aggregated.apply(getNewSeries, axis=1, result_type='expand').reset_index()
I think you can try with unstack() to move the innermost row index to become the innermost column index to reshape you data.
And you can also use fill_value to change NaNs to 0

Group by and find sum for groups but return NaN as NaN, not 0

I have a dataframe where each unique group has 4 rows.
So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average.
But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN?
For example:
df
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 8 5 NaN
2018-02-11 14:00:00 1 a 12 1 NaN NaN
2018-02-11 14:00:00 1 a 12 3 7 NaN
2018-02-11 14:00:00 1 a 12 4 12 NaN
2018-02-11 14:00:00 2 a 5 NaN 5 5
2018-02-11 14:00:00 2 a 5 NaN 3 2
2018-02-11 14:00:00 2 a 5 NaN NaN 6
2018-02-11 14:00:00 2 a 5 NaN 7 NaN
So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on).
I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.
Desired output:
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 16 24 NaN
2018-02-11 14:00:00 2 a 5 NaN 15 13
Any help is welcomed.
Change parameter min_count to 1 - this working in last pandas version 0.22.0:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.
df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
time id el conn column1 column2 column3
0 2018-02-11 14:00:00 1 a 12 16.0 24.0 NaN
1 2018-02-11 14:00:00 2 a 5 NaN 15.0 13.0
I think it should be something like this.
df.groupby(['time','id','el','conn']).sum()
Output in Python 2:
Some little tutorial for groupby I find interesting in these cases:
https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_groups/
https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

Categories