Dropping values before date X for each column in DataFrame - python

Say I have the following DataFrame:
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
A B
2010-01-01 20.0 5
2010-04-01 0.5 10
2010-07-01 40.0 6
2010-10-01 45.0 8
2011-01-01 40.0 9
2011-04-01 35.0 7
2011-07-01 20.0 5
2011-10-01 25.0 8
Also, assume I have the following series of dates:
D = d.idxmax()
A 2010-10-01
B 2010-04-01
dtype: datetime64[ns]
What I'm trying to do, is to essentially "drop" the values in the DataFrame, d, that occur before the dates in the series D for each column
That is, what I'm looking for is:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0
Notice that all the values in column A before 2010-10-01 are dropped and all the values in column B are dropped before 2010-04-01.
It's fairly simply to iterate through the columns to do this, but the DataFrame i'm working with is extremely large and this process takes a lot of time.
Is there a simpler way that does this in bulk, rather than column by column?
Thanks

Not sure that this is the most elegant answer, but since there aren't any other answers yet, I figured I would offer a working solution:
import pandas as pd
import numpy as np
import datetime
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
D = d.idxmax()
for column in D.index:
d.loc[d.index < D[column], column] = np.nan
Output:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Pandas create new date rows and forward fill column values for maximum next 3 consecutive months

I want to forward fill rows for the next 3 consecutive months but stops if a new data row is available for the same ID within that 3 months window.
Here is a sample data
id date value1 Value2
1 2016-09-01 5 2
1 2016-11-01 7 15
2 2015-09-01 11 6
2 2015-12-01 13 4
2 2016-05-01 3 5
I would like to get
id date value1 value2
1 2016-09-01 5 2
1 2016-10-01 5 2
1 2016-11-01 7 15
1 2016-12-01 7 15
1 2017-01-01 7 15
1 2017-02-01 7 15
2 2015-09-01 11 6
2 2015-10-01 11 6
2 2015-11-01 11 6
2 2015-12-01 13 4
2 2016-01-01 13 4
2 2016-02-01 13 4
2 2016-03-01 13 4
2 2016-05-01 3 5
...
I tried a bunch of forward-fill methods and crossed join with the calendar but couldn't figure it out.
Any help will be appreciated!
I think it might be done like this:
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'id': [1, 1, 2, 2, 2],
'date': [
dt.datetime.fromisoformat(s) for s in [
'2016-09-01',
'2016-11-01',
'2015-09-01',
'2015-12-01',
'2016-05-01'
]
],
'value1': [5, 7, 11, 13, 2],
'value2': [2, 15, 6, 4, 5]
}).set_index('id')
result = []
for _id, data in df.groupby('id'):
tmp_df = pd.DataFrame({
'date': pd.period_range(
start=min(data.date),
end=max(data.date + dt.timedelta(days=31 * 3)),
freq='M'
).to_timestamp()
})
tmp_df = tmp_df.join(data.set_index('date'), on='date')
tmp_df['id'] = _id
result.append(tmp_df.set_index('id'))
result = pd.concat(result).fillna(method='ffill', limit=3).dropna()
print(result)
Result:
date value1 value2
id
1 2016-09-01 5.0 2.0
1 2016-10-01 5.0 2.0
1 2016-11-01 7.0 15.0
1 2016-12-01 7.0 15.0
1 2017-01-01 7.0 15.0
1 2017-02-01 7.0 15.0
2 2015-09-01 11.0 6.0
2 2015-10-01 11.0 6.0
2 2015-11-01 11.0 6.0
2 2015-12-01 13.0 4.0
2 2016-01-01 13.0 4.0
2 2016-02-01 13.0 4.0
2 2016-03-01 13.0 4.0
2 2016-05-01 2.0 5.0
2 2016-06-01 2.0 5.0
2 2016-07-01 2.0 5.0
2 2016-08-01 2.0 5.0

Pandas DataFrame.Shift returns incorrect result when shift by column axis

I am experiencing something really weird, not sure if it is a bug (hopefully not). Anyway, when I perform DataFrame.shift method by columns, the columns either shifted incorrectly or the values returned incorrect (see output below).
Does anyone know if I am missing something or it is simply a bug with the library.
# Example 2
ind = pd.date_range('01 / 01 / 2019', periods=5, freq='12H')
df2 = pd.DataFrame({"A": [1, 2, 3, 4, 5],
"B": [10, 20, np.nan, 40, 50],
"C": [11, 22, 33, np.nan, 55],
"D": [-11, -24, -51, -36, -2],
'D1': [False] * 5,
'E': [True, False, False, True, True]},
index=ind)
df2.shift(freq='12H', periods=1, axis=1)
df2.shift(periods=1, axis=1)
print(df2.shift(periods=1, axis=1)) # shift by column -> incorrect
# print(df2.shift(periods=1, axis=0)) # correct
Output:
A B C D D1 E
2019-01-01 00:00:00 1 10.0 11.0 -11 False True
2019-01-01 12:00:00 2 20.0 22.0 -24 False False
2019-01-02 00:00:00 3 NaN 33.0 -51 False False
2019-01-02 12:00:00 4 40.0 NaN -36 False True
2019-01-03 00:00:00 5 50.0 55.0 -2 False True
A B C D D1 E
2019-01-01 00:00:00 NaN NaN 10.0 1.0 NaN False
2019-01-01 12:00:00 NaN NaN 20.0 2.0 NaN False
2019-01-02 00:00:00 NaN NaN NaN 3.0 NaN False
2019-01-02 12:00:00 NaN NaN 40.0 4.0 NaN False
2019-01-03 00:00:00 NaN NaN 50.0 5.0 NaN False
[Finished in 0.4s]
You are right, it is bug, problem is DataFrame.shift with axis=1 shifts object columns to the next column with same dtype.
In sample columns A and D are filled by integers so A is moved to D, columns B and C are filled by floats, so B is moved to C and similar in boolean D1 and E columns.
Solution should be convert all columns to objects, shift and then use DataFrame.infer_objects:
df3 = df2.astype(object).shift(1, axis=1).infer_objects()
print (df3)
A B C D D1 E
2019-01-01 00:00:00 NaN 1 10.0 11.0 -11 False
2019-01-01 12:00:00 NaN 2 20.0 22.0 -24 False
2019-01-02 00:00:00 NaN 3 NaN 33.0 -51 False
2019-01-02 12:00:00 NaN 4 40.0 NaN -36 False
2019-01-03 00:00:00 NaN 5 50.0 55.0 -2 False
print (df3.dtypes)
A float64
B int64
C float64
D float64
D1 int64
E bool
dtype: object
If use shift with axis=0 then dtypes are always same, so working correctly.

Return first matching value/column name in new dataframe

import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=6, freq='H')
df = pd.DataFrame({'A': [0, 1, 2, 3, 4,5],
'B': [0, 1, 2, 3, 4,5],
'C': [0, 1, 2, 3, 4,5],
'D': [0, 1, 2, 3, 4,5],
'E': [1, 2, 3, 3, 7,6],
'F': [1, 1, 3, 3, 7,6],
'G': [0, 0, 1, 0, 0,0]
},
index=rng)
A simple dataframe to help me explain:
df
A B C D E F G
2011-01-01 00:00:00 0 0 0 0 1 1 0
2011-01-01 01:00:00 1 1 1 1 2 1 0
2011-01-01 02:00:00 2 2 2 2 3 3 1
2011-01-01 03:00:00 3 3 3 3 3 3 0
2011-01-01 04:00:00 4 4 4 4 7 7 0
2011-01-01 05:00:00 5 5 5 5 6 6 0
When I filter for a value greater than 2 I get the following output:
df[df >= 2]
A B C D E F G
2011-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 NaN NaN NaN NaN 2.0 NaN NaN
2011-01-01 02:00:00 2.0 2.0 2.0 2.0 3.0 3.0 NaN
2011-01-01 03:00:00 3.0 3.0 3.0 3.0 3.0 3.0 NaN
2011-01-01 04:00:00 4.0 4.0 4.0 4.0 7.0 7.0 NaN
2011-01-01 05:00:00 5.0 5.0 5.0 5.0 6.0 6.0 NaN
For each row I want to know which column has the matching value first (working from left to right). So on the row for 2011-01-01 01:00:00 it was row E ands the value was 2.0.
Desired output:
What I would like to get is a new dataframe with the first match value in a column named 'Value' and another column called "From Col" which captures the column name this came from.
If no match seen then the output from the last column (G in this case). Thanks for any help.
"Value" "From Col"
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
Try this:
def get_first_valid(ser):
if len(ser) == 0:
return pd.Series([np.nan,np.nan])
mask = pd.isnull(ser.values)
i = mask.argmin()
if mask[i]:
return pd.Series([np.nan, ser.index[-1]])
else:
return pd.Series([ser[i], ser.index[i]])
In [113]: df[df >= 2].apply(get_first_valid, axis=1)
Out[113]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2.0 E
2011-01-01 02:00:00 2.0 A
2011-01-01 03:00:00 3.0 A
2011-01-01 04:00:00 4.0 A
2011-01-01 05:00:00 5.0 A
or:
In [114]: df[df >= 2].T.apply(get_first_valid).T
Out[114]:
0 1
2011-01-01 00:00:00 NaN G
2011-01-01 01:00:00 2 E
2011-01-01 02:00:00 2 A
2011-01-01 03:00:00 3 A
2011-01-01 04:00:00 4 A
2011-01-01 05:00:00 5 A
PS i took a source code of the Series.first_valid_index() function and made a dirty hack out of it...
Explanation:
In [221]: ser = pd.Series([np.nan, np.nan, 5, 7, np.nan])
In [222]: ser
Out[222]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
dtype: float64
In [223]: mask = pd.isnull(ser.values)
In [224]: mask
Out[224]: array([ True, True, False, False, True], dtype=bool)
In [225]: i = mask.argmin()
In [226]: i
Out[226]: 2
In [227]: ser.index[i]
Out[227]: 2
In [228]: ser[i]
Out[228]: 5.0
Firstly, filter values according to criterion and drop the row containing all NaNs. Then, use idxmax to return the first occurence of a True condition. This resembles our first series.
To create the second series, iterate over (index, value) tuple pairs of the first series and simultaneously append these locations present in the original DF.
ser1 = (df[df.ge(2)].dropna(how='all').ge(2)).idxmax(1)
ser2 = pd.concat([pd.Series(df.loc[i,r], pd.Index([i])) for i, r in ser1.iteritems()])
Create a new DF whose index pertains to that of the original DF and fill the missing values in From Col with that of it's last column name.
req_df = pd.DataFrame({"From Col": ser1, "Value": ser2}, index=df.index)
req_df['From Col'].fillna(df.columns[-1], inplace=True)
req_df
I don't work with pandas, so this can be considered just as a footnote, but in pure python there is also possibility to find first non-None index using reduce.
>>> a
[None, None, None, None, 6, None, None, None, 3, None]
>>> print( reduce(lambda x, y: (x or y[1] and y[0]), enumerate(a), None))
4

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?
I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Categories