I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.
Related
I have this dataset
Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6
I want to repeat the same data but at 4 different time intervals for example
Value1 Value2
2000-01-01 00:00:00 1 2
2000-01-01 06:00:00 1 2
2000-01-01 12:00:00 1 2
2000-01-01 18:00:00 1 2
2000-01-02 00:00:00 3 4
2000-01-02 06:00:00 3 4
2000-01-02 12:00:00 3 4
2000-01-02 18:00:00 3 4
and so on.
your dates are contiguous, this solution will also work for none contiguous dates
generate a series that are the expanded times, then outer join
import pandas as pd
import io
df = pd.read_csv(io.StringIO(""" Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6"""), sep="\s\s+", engine="python")
df = df.set_index(pd.to_datetime(df.index))
df = df.join(
pd.Series(df.index, index=df.index)
.rename("expanded")
.dt.date.apply(lambda d: pd.date_range(d, freq="6H", periods=4))
.explode()
).set_index("expanded")
df
expanded
Value1
Value2
2000-01-01 00:00:00
1
2
2000-01-01 06:00:00
1
2
2000-01-01 12:00:00
1
2
2000-01-01 18:00:00
1
2
2000-01-02 00:00:00
3
4
2000-01-02 06:00:00
3
4
2000-01-02 12:00:00
3
4
2000-01-02 18:00:00
3
4
2000-01-03 00:00:00
5
6
2000-01-03 06:00:00
5
6
2000-01-03 12:00:00
5
6
2000-01-03 18:00:00
5
6
I have a pandas DataFrame called df1, which looks like:
value analysis_date hour error
7 2000-01-01 00:00:00 9 None
8 2000-01-01 00:00:00 10 None
9 2000-01-01 00:00:00 11 None
And a second DataFrame, df2:
value analysis_date hour error
4 2000-01-01 09:00:00 1 None
5 2019-01-01 00:00:00 2 None
6 2000-01-01 08:00:00 3 None
I want to
compare 'corresponding' rows, which means rows in which analysis_date + hour are equivalent between df1 and df2; meaning that df1 rows 2 and 3 correspond with df2 rows 3 and 1 respectively
Then, I want to set the error column in df1 to be df1['value'][row] - df2['value'][row] for that corresponding row. So in this case, df1 should end up looking like this:
value analysis_date hour error
7 2000-01-01 00:00:00 9 None
8 2000-01-01 00:00:00 10 4
9 2000-01-01 00:00:00 11 3
Is there a way I can do this beyond looping through every single row and individually comparing them using iterrows()?
you could go about it like :
df1['analysis_date'] = pd.to_datetime(df1['analysis_date'])
df2['analysis_date'] = pd.to_datetime(df2['analysis_date'])
df2['total_date'] = df2.analysis_date + df2.hour.astype('timedelta64[h]')
df1['total_date'] = df1.analysis_date + df1.hour.astype('timedelta64[h]')
mr_df = df1.merge(df2.loc[:,['value', 'total_date']], on = 'total_date', how = 'left')
df1['error'] = mr_df['value_x'] - mr_df['value_y']
df1
# value date hour error total_date
# 0 7 2000-01-01 9 NaN 2000-01-01 09:00:00
# 1 8 2000-01-01 10 4.0 2000-01-01 10:00:00
# 2 9 2000-01-01 11 3.0 2000-01-01 11:00:00
I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()
I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]
let’s look at some one-minute data:
In [513]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [514]: ts = Series(np.arange(12), index=rng)
In [515]: ts
Out[515]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:
In [516]: ts.resample('5min', how='sum')
Out[516]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T
However I don't want to use the resample method and still want the same input-output. How can I use group_by or reindex or any of such other methods?
You can use a custom pd.Grouper this way:
In [78]: ts.groupby(pd.Grouper(freq='5min', closed='right')).sum()
Out [78]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The closed='right' ensures that the output is exactly the same.
However, if your aim is to do more custom grouping, you can use .groupby with your own vector:
In [78]: buckets = (ts.index - ts.index[0]) / pd.Timedelta('5min')
In [79]: grp = ts.groupby(np.ceil(buckets.values))
In [80]: grp.sum()
Out[80]:
0 0
1 15
2 40
3 11
dtype: int64
The output is not exactly the same, but the method is more flexible (e.g. can create uneven buckets).