Python: doing multiple column aggregation in pandas - python

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?

I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0

You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Pandas create new date rows and forward fill column values for maximum next 3 consecutive months

I want to forward fill rows for the next 3 consecutive months but stops if a new data row is available for the same ID within that 3 months window.
Here is a sample data
id date value1 Value2
1 2016-09-01 5 2
1 2016-11-01 7 15
2 2015-09-01 11 6
2 2015-12-01 13 4
2 2016-05-01 3 5
I would like to get
id date value1 value2
1 2016-09-01 5 2
1 2016-10-01 5 2
1 2016-11-01 7 15
1 2016-12-01 7 15
1 2017-01-01 7 15
1 2017-02-01 7 15
2 2015-09-01 11 6
2 2015-10-01 11 6
2 2015-11-01 11 6
2 2015-12-01 13 4
2 2016-01-01 13 4
2 2016-02-01 13 4
2 2016-03-01 13 4
2 2016-05-01 3 5
...
I tried a bunch of forward-fill methods and crossed join with the calendar but couldn't figure it out.
Any help will be appreciated!
I think it might be done like this:
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'id': [1, 1, 2, 2, 2],
'date': [
dt.datetime.fromisoformat(s) for s in [
'2016-09-01',
'2016-11-01',
'2015-09-01',
'2015-12-01',
'2016-05-01'
]
],
'value1': [5, 7, 11, 13, 2],
'value2': [2, 15, 6, 4, 5]
}).set_index('id')
result = []
for _id, data in df.groupby('id'):
tmp_df = pd.DataFrame({
'date': pd.period_range(
start=min(data.date),
end=max(data.date + dt.timedelta(days=31 * 3)),
freq='M'
).to_timestamp()
})
tmp_df = tmp_df.join(data.set_index('date'), on='date')
tmp_df['id'] = _id
result.append(tmp_df.set_index('id'))
result = pd.concat(result).fillna(method='ffill', limit=3).dropna()
print(result)
Result:
date value1 value2
id
1 2016-09-01 5.0 2.0
1 2016-10-01 5.0 2.0
1 2016-11-01 7.0 15.0
1 2016-12-01 7.0 15.0
1 2017-01-01 7.0 15.0
1 2017-02-01 7.0 15.0
2 2015-09-01 11.0 6.0
2 2015-10-01 11.0 6.0
2 2015-11-01 11.0 6.0
2 2015-12-01 13.0 4.0
2 2016-01-01 13.0 4.0
2 2016-02-01 13.0 4.0
2 2016-03-01 13.0 4.0
2 2016-05-01 2.0 5.0
2 2016-06-01 2.0 5.0
2 2016-07-01 2.0 5.0
2 2016-08-01 2.0 5.0

Impute missing dates and values using Python

I have a file named df that has the following values:
Date Available Used Total Free
06072019 5 19 24 5
06202019 14 10 24 6
07072019 6 16 24 6
07202019 20 4 24 20
08072019 23 1 24 23
I am missing the date 08202019 and am looking to impute the missing values with the average of the existing data that I have.
This is what I am currently doing:
import numpy as np
import datetime as dt
df.groupby([df.index.date]).transform(lambda x: x.fill(x.mean()))
However, I know that some of the syntax is not correct here and would like some suggestions.
Use list comprehension to get the means of all the columns excluding the first date column and create a transposed dataframe, which will eseentially be one row. Then concat this new 1-row dataframe with the main dataframe.
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
s.columns = df.columns[1:]
s['Date'] = '08-20-2019'
df_new = pd.concat([df, s])
df_new
Out[1]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
In regards to your comment, you could create a missing_dates list and everything else will be automatic with the loop:
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
t = s
missing_dates = ['08-20-2019' , '08-30-2019']
for i in range(len(missing_dates) - 1):
t = t.append(s)
t.columns = df.columns[1:]
t['Date'] = missing_dates
df_new = pd.concat([df, t])
df_new
Out[2]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
0 08-30-2019 13.6 10.0 24.0 12.0
Consider your dataframe:
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"], format="%m%d%Y")
df = df.set_index("Date")
print(df)
Available Used Total Free
Date
2019-06-07 5 19 24 5
2019-06-20 14 10 24 6
2019-07-07 6 16 24 6
2019-07-20 20 4 24 20
2019-08-07 23 1 24 23
You can create a list of missing dates, convert it to a pandas datetime array, create a new index that you'll then fill the na with the column means.
missing_dates = ["2019-08-20", "2019-09-07", "2019-09-20"]
missing_dates = pd.to_datetime(missing_dates)
new_index = df.index.union(missing_dates)
df = df.reindex(new_index).fillna(df.mean(numeric_only=True))
print(df)
Available Used Total Free
2019-06-07 5.0 19.0 24.0 5.0
2019-06-20 14.0 10.0 24.0 6.0
2019-07-07 6.0 16.0 24.0 6.0
2019-07-20 20.0 4.0 24.0 20.0
2019-08-07 23.0 1.0 24.0 23.0
2019-08-20 13.6 10.0 24.0 12.0
2019-09-07 13.6 10.0 24.0 12.0
2019-09-20 13.6 10.0 24.0 12.0
In case you want to check for dates that are non-existent in the data, you can try this:
# suppose data series goes from '2020-09-30' to '2020-10-04' and data on some dates may be missing.
df = pd.DataFrame.from_dict(
dict(datetime=['09302020','10012020','10042020'],
val1 = [1,3,5],
val2 = [6,10,12]))
df.datetime = pd.to_datetime(df.datetime, format='%m%d%Y')
print(df)
dates_missing = pd.date_range(start = '2020-09-30', end = '2020-10-04' ).difference(df.datetime)
val_means = {col: df[col].mean() for col in list(df.columns) if col != 'datetime'}
df = df.append(pd.DataFrame.from_dict(
dict(datetime=dates_missing, **val_means)))
df = df.sort_values(by=['datetime'])
df

Pandas DataFrame.Shift returns incorrect result when shift by column axis

I am experiencing something really weird, not sure if it is a bug (hopefully not). Anyway, when I perform DataFrame.shift method by columns, the columns either shifted incorrectly or the values returned incorrect (see output below).
Does anyone know if I am missing something or it is simply a bug with the library.
# Example 2
ind = pd.date_range('01 / 01 / 2019', periods=5, freq='12H')
df2 = pd.DataFrame({"A": [1, 2, 3, 4, 5],
"B": [10, 20, np.nan, 40, 50],
"C": [11, 22, 33, np.nan, 55],
"D": [-11, -24, -51, -36, -2],
'D1': [False] * 5,
'E': [True, False, False, True, True]},
index=ind)
df2.shift(freq='12H', periods=1, axis=1)
df2.shift(periods=1, axis=1)
print(df2.shift(periods=1, axis=1)) # shift by column -> incorrect
# print(df2.shift(periods=1, axis=0)) # correct
Output:
A B C D D1 E
2019-01-01 00:00:00 1 10.0 11.0 -11 False True
2019-01-01 12:00:00 2 20.0 22.0 -24 False False
2019-01-02 00:00:00 3 NaN 33.0 -51 False False
2019-01-02 12:00:00 4 40.0 NaN -36 False True
2019-01-03 00:00:00 5 50.0 55.0 -2 False True
A B C D D1 E
2019-01-01 00:00:00 NaN NaN 10.0 1.0 NaN False
2019-01-01 12:00:00 NaN NaN 20.0 2.0 NaN False
2019-01-02 00:00:00 NaN NaN NaN 3.0 NaN False
2019-01-02 12:00:00 NaN NaN 40.0 4.0 NaN False
2019-01-03 00:00:00 NaN NaN 50.0 5.0 NaN False
[Finished in 0.4s]
You are right, it is bug, problem is DataFrame.shift with axis=1 shifts object columns to the next column with same dtype.
In sample columns A and D are filled by integers so A is moved to D, columns B and C are filled by floats, so B is moved to C and similar in boolean D1 and E columns.
Solution should be convert all columns to objects, shift and then use DataFrame.infer_objects:
df3 = df2.astype(object).shift(1, axis=1).infer_objects()
print (df3)
A B C D D1 E
2019-01-01 00:00:00 NaN 1 10.0 11.0 -11 False
2019-01-01 12:00:00 NaN 2 20.0 22.0 -24 False
2019-01-02 00:00:00 NaN 3 NaN 33.0 -51 False
2019-01-02 12:00:00 NaN 4 40.0 NaN -36 False
2019-01-03 00:00:00 NaN 5 50.0 55.0 -2 False
print (df3.dtypes)
A float64
B int64
C float64
D float64
D1 int64
E bool
dtype: object
If use shift with axis=0 then dtypes are always same, so working correctly.

Dropping values before date X for each column in DataFrame

Say I have the following DataFrame:
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
A B
2010-01-01 20.0 5
2010-04-01 0.5 10
2010-07-01 40.0 6
2010-10-01 45.0 8
2011-01-01 40.0 9
2011-04-01 35.0 7
2011-07-01 20.0 5
2011-10-01 25.0 8
Also, assume I have the following series of dates:
D = d.idxmax()
A 2010-10-01
B 2010-04-01
dtype: datetime64[ns]
What I'm trying to do, is to essentially "drop" the values in the DataFrame, d, that occur before the dates in the series D for each column
That is, what I'm looking for is:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0
Notice that all the values in column A before 2010-10-01 are dropped and all the values in column B are dropped before 2010-04-01.
It's fairly simply to iterate through the columns to do this, but the DataFrame i'm working with is extremely large and this process takes a lot of time.
Is there a simpler way that does this in bulk, rather than column by column?
Thanks
Not sure that this is the most elegant answer, but since there aren't any other answers yet, I figured I would offer a working solution:
import pandas as pd
import numpy as np
import datetime
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
D = d.idxmax()
for column in D.index:
d.loc[d.index < D[column], column] = np.nan
Output:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0

Categories