I have a dataframe looking like this
Hour Minute Second Value
0 14.0 57.0 17.0 0.0
1 14.0 57.0 18.0 0.0
2 14.0 57.0 19.0 138.6
3 14.0 57.0 20.0 138.6
4 14.0 57.0 21.0 138.6
5 14.0 57.0 22.0 138.6
I want to combine the hour/minute/second columns into a timestamp index. I have a date i want to use. I managed to do this using df.apply with datetime.datetime.combine(mydate, datetime.time(hour, min, sec)) but it is too slow.
Is there a way to do this efficiently using built in pandas functions?
Idea is multiple Hour and Minutes, sum and add string datetime in to_datetime:
s = df['Hour'].mul(10000) + df['Minute'].mul(100) + df['Second']
df['date'] = pd.to_datetime('2015-01-01 ' + s.astype(str), format='%Y-%m-%d %H%M%S.%f')
print (df)
Hour Minute Second Value date
0 14.0 57.0 17.0 0.0 2015-01-01 14:57:17
1 14.0 57.0 18.0 0.0 2015-01-01 14:57:18
2 14.0 57.0 19.0 138.6 2015-01-01 14:57:19
3 14.0 57.0 20.0 138.6 2015-01-01 14:57:20
4 14.0 57.0 21.0 138.6 2015-01-01 14:57:21
5 14.0 57.0 22.0 138.6 2015-01-01 14:57:22
Another option is to multiply Hour and Minute with respective numbers, convert the sum to timedelta and add to the date:
mydate = pd.to_datetime('2020-02-05')
df['timestamp'] = pd.to_timedelta(df.Hour*3600+df.Minute*60+df.Second,
unit='sec').add(mydate)
Output:
Hour Minute Second Value timestamp
0 14.0 57.0 17.0 0.0 2020-02-05 14:57:17
1 14.0 57.0 18.0 0.0 2020-02-05 14:57:18
2 14.0 57.0 19.0 138.6 2020-02-05 14:57:19
3 14.0 57.0 20.0 138.6 2020-02-05 14:57:20
4 14.0 57.0 21.0 138.6 2020-02-05 14:57:21
5 14.0 57.0 22.0 138.6 2020-02-05 14:57:22
0 2020-02-05 14:57:17
1 2020-02-05 14:57:18
2 2020-02-05 14:57:19
3 2020-02-05 14:57:20
4 2020-02-05 14:57:21
5 2020-02-05 14:57:22
dtype: datetime64[ns]
Related
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.
Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]
# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index
# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))
This answers both parts
build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})
df[df.isna().any(axis=1)]
ystart = df.Date.dt.year.min()
# generate rolling means for month/day. bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
.agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
.to_frame().rename_axis(["month","day"])
)
# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()
# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)
# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])
analysis
taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22
dfm.query("key_0==4 & key_1==11").head(7)
key_0
key_1
key_2
Date
vals
0
100
4
11
1918
1918-04-11 00:00:00
nan
22
465
4
11
1919
1919-04-11 00:00:00
12
22
831
4
11
1920
1920-04-11 00:00:00
2
22
1196
4
11
1921
1921-04-11 00:00:00
47
27
1561
4
11
1922
1922-04-11 00:00:00
47
36
1926
4
11
1923
1923-04-11 00:00:00
2
34.6
2292
4
11
1924
1924-04-11 00:00:00
37
29.4
I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements
Need an arbitrary number of averages
Use those averages to fill in the NA
I have addressed the
Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)
df.loc[df['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 NaN NaN 34.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN 49.0 NaN NaN NaN 32.0 NaN NaN 76 21.6
04-11 NaN 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 NaN 5.0 33.0 3 29.7
04-12 NaN 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 NaN 1.0 12.0 3 21.3
04-13 NaN 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 NaN 2.0 46.0 3 21.3
04-14 NaN 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 NaN 16.0 8.0 3 24.7
df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 21.6 21.6 34.0 21.6 21.6 21.6 2.0 21.6 21.6 21.6 ... 21.6 49.0 21.6 21.6 21.6 32.0 21.6 21.6 76.0 21.6
04-11 29.7 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 29.7 5.0 33.0 3.0 29.7
04-12 21.3 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 21.3 1.0 12.0 3.0 21.3
04-13 21.3 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 21.3 2.0 46.0 3.0 21.3
04-14 24.7 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 24.7 16.0 8.0 3.0 24.7
I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000
I have the foll. dataframe with hourly data:
tmp min_tmp max_tmp
dates
2017-07-19 14:00:00 19.0 19.0 19.0
2017-07-19 15:00:00 18.0 18.0 18.0
2017-07-19 16:00:00 16.0 16.0 16.0
2017-07-19 17:00:00 16.0 16.0 16.0
2017-07-19 18:00:00 15.0 15.0 15.0
Is there a way we can compute daily minimum and maximum values of tmp in min_tmp and max_tmp respectively. I tried this
df['min_temp'] = df['tmp'].min()
but this does not work for dataframe data that spans multiple days
Use resample and transform:
g = df.resample('D')['tmp']
df['min_tmp'] = g.transform('min')
df['max_tmp'] = g.transform('max')
Output
tmp min_tmp max_tmp
dates
2017-07-19 14:00:00 19.0 15.0 19.0
2017-07-19 15:00:00 18.0 15.0 19.0
2017-07-19 16:00:00 16.0 15.0 19.0
2017-07-19 17:00:00 16.0 15.0 19.0
2017-07-19 18:00:00 15.0 15.0 19.0
Here is the min/max computed by day. You have to group by day, month, and year simultaneously:
pd.groupby(df['tmp'], by=[df.index.day, df.index.month, df.index.year]).min()
transform in pandas
df['Date']=df.index
df.Date=pd.to_datetime(df.Date)
map={'min_tmp':min, 'max_tmp':max}
for key,value in map.items():
print(key,value)
df[key]=df.groupby(df.Date.dt.date)['tmp'].transform(func=value)
df.drop(['Date'],axis=1)
Out[469]:
tmp min_tmp max_tmp
Date
7/19/2017 14:00 19 15 19
7/19/2017 15:00 18 15 19
7/19/2017 16:00 16 15 19
7/19/2017 17:00 16 15 19
7/19/2017 18:00 15 15 19
I am lazy for repeating, but you can simply do this to achieve.
df['max_tmp']=df.groupby(df.Date.dt.date)['tmp'].transform(max)
df['min_tmp']=df.groupby(df.Date.dt.date)['tmp'].transform(min)
A dataframe contains only a few timestamps per day and I need to select the latest one for each date (not the values, the time stamp itself). the df looks like this:
A B C
2016-12-05 12:00:00+00:00 126.0 15.0 38.54
2016-12-05 16:00:00+00:00 131.0 20.0 42.33
2016-12-14 05:00:00+00:00 129.0 18.0 43.24
2016-12-15 03:00:00+00:00 117.0 22.0 33.70
2016-12-15 04:00:00+00:00 140.0 23.0 34.81
2016-12-16 03:00:00+00:00 120.0 21.0 32.24
2016-12-16 04:00:00+00:00 142.0 22.0 35.20
I managed to achieve what i needed by defining the following function:
def find_last_h(df,column):
newindex = []
df2 = df.resample('d').last().dropna()
for x in df2[column].values:
newindex.append(df[df[column]==x].index.values[0])
return pd.DatetimeIndex(newindex)
with which I specify which column's values to use as a filter to get the desired timestamps. The issue here is in the case of non unique values this might not work as desired.
Another way that is used is:
grouped = df.groupby([df.index.day,df.index.hour])
grouped.groupby(level=0).last()
and then reconstruct the timestamps but it is even more verbose. What is the smart way?
Use boolean indexing with mask created by duplicated and floor for truncate times:
idx = df.index.floor('D')
df = df[~idx.duplicated(keep='last') | ~idx.duplicated(keep=False)]
print (df)
A B C
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
Another solution with reset_index + set_index:
df = df.reset_index().groupby([df.index.date]).last().set_index('index')
print (df)
A B C
index
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
resample and groupby dates only lost times:
print (df.resample('1D').last().dropna())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
print (df.groupby([df.index.date]).last())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
how about
df.resample('24H',kind='period').last().dropna() ?
You can groupby the date and just take the max of each datetime to get the last datetime on each date.
This may look like:
df.groupby(df["datetime"].dt.date)["datetime"].max()
or something like
df.groupby(pd.Grouper(freq='D'))["datetime"].max()