I want to convert my datetime column to be pandas dataframe index. This is my dataframe
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
And I want the Date to be the index for the dataframe.
I've looked for answers and have tried this code
dateparse = lambda dates: pd.datetime.strptime(dates, '%m/%d/%Y %I:%M:%S').strftime('%m/%d/%Y %I:%M:%S %p')
data = pd.read_csv('mandol.csv', sep=';', parse_dates=['Date'], index_col = 'Date', date_parser=dateparse)
data.head()
but the result is still error -> ValueError: unconverted data remains: AM
how can I solve this?
Use pd.to_datetime() to convert the Date column and set_index() to set it as your dataframe index.
import pandas as pd
>>>df
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
>>>df
Unnamed: 0 Observed Min Max Sum Count
Date
2018-09-15 00:00:00 0 2 0 2 10 5
2018-09-15 01:00:00 1 1 0 2 25 20
2018-09-15 02:00:00 2 1 0 1 21 21
2018-09-15 03:00:00 3 1 0 2 23 22
2018-09-15 04:00:00 4 1 0 1 21 21
We can set the index to be the Date column values converted with to_datetime (I'm using pop to get values of the Date column and remove it from the DataFrame at the same time):
df.index = pd.to_datetime(df.pop('Date'))
print(df)
Output:
Observed Min Max Sum Count
Date
2018-09-15 00:00:00 2 0 2 10 5
2018-09-15 01:00:00 1 0 2 25 20
2018-09-15 02:00:00 1 0 1 21 21
2018-09-15 03:00:00 1 0 2 23 22
2018-09-15 04:00:00 1 0 1 21 21
Have a look at set_index() method.
If you use this code, it sets the second column (Date) as index and transforms it with the standard datetime parser provided by pandas.to_datetime:
ds = pd.read_csv('mandol.csv', sep=';', index_col=1, parse_dates=True)
parse_dates=True automatically transforms the index to a pandas Datetime object.
Related
i want to interpolate (Linear interpolation) data. but There is no NA.
Here is my data.with many missing values.
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383265200000
1
-0.460714706261982
My expected :
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383261600000
1
Linear interpolated data
1383262200000
1
Linear interpolated data
1383262800000
1
Linear interpolated data
1383263400000
1
Linear interpolated data
1383264000000
1
Linear interpolated data
1383264600000
1
Linear interpolated data
1383265200000
1
-0.460714706261982
timestamp starts 1383260400000, ends 1383343800000
and another id(from 1 to 2025) has same issues.
Idea is create datetimes, convert to DatetimeIndex and in lambda function add missing datetimes by Series.asfreq with interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
f = lambda x: x.asfreq('10Min').interpolate()
df = df.set_index('timestamp').groupby('id')['strength'].apply(f).reset_index()
print (df)
id timestamp strength
0 1 2013-10-31 23:00:00 -0.380390
1 1 2013-10-31 23:10:00 -0.421960
2 1 2013-10-31 23:20:00 -0.427497
3 1 2013-10-31 23:30:00 -0.433033
4 1 2013-10-31 23:40:00 -0.438569
5 1 2013-10-31 23:50:00 -0.444106
6 1 2013-11-01 00:00:00 -0.449642
7 1 2013-11-01 00:10:00 -0.455178
8 1 2013-11-01 00:20:00 -0.460715
Last if need original format of timestamps:
df['timestamp'] = df['timestamp'].astype(np.int64) // 1000000
print (df)
id timestamp strength
0 1 1383260400000 -0.380390
1 1 1383261000000 -0.421960
2 1 1383261600000 -0.427497
3 1 1383262200000 -0.433033
4 1 1383262800000 -0.438569
5 1 1383263400000 -0.444106
6 1 1383264000000 -0.449642
7 1 1383264600000 -0.455178
8 1 1383265200000 -0.460715
EDIT:
#data from question
df =pd.DataFrame({'timestamp': [1383260400000, 1383261000000, 1383265200000],
'id': [1, 1, 1],
'strength':[-0.3803901328171995,-0.4219604221945593,-0.460714706261982]})
print (df)
timestamp id strength
0 1383260400000 1 -0.380390
1 1383261000000 1 -0.421960
2 1383265200000 1 -0.460715
Solution create for each id all datetimes by date_range and create missing values by DataFrame.reindex with MultiIndex, last per id is used interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
r = pd.date_range(pd.to_datetime(1383260400000, unit='ms') ,
pd.to_datetime(1383343800000, unit='ms'),
freq='10Min')
ids = df['id'].unique()
mux = pd.MultiIndex.from_product([r, ids], names=['timestamp','id'])
f = lambda x: x.interpolate()
df = (df.set_index(['timestamp', 'id'])
.reindex(mux)
.groupby('id')['strength']
.transform(f)
.reset_index())
print (df)
timestamp id strength
0 2013-10-31 23:00:00 1 -0.380390
1 2013-10-31 23:10:00 1 -0.421960
2 2013-10-31 23:20:00 1 -0.427497
3 2013-10-31 23:30:00 1 -0.433033
4 2013-10-31 23:40:00 1 -0.438569
.. ... .. ...
135 2013-11-01 21:30:00 1 -0.460715
136 2013-11-01 21:40:00 1 -0.460715
137 2013-11-01 21:50:00 1 -0.460715
138 2013-11-01 22:00:00 1 -0.460715
139 2013-11-01 22:10:00 1 -0.460715
[140 rows x 3 columns]
This is a real use case that I am trying to implement in my work.
Sample data (fake data but similar data structure)
Lap Starttime Endtime
1 10:00:00 10:05:00
format: hh:mm:ss
Desired output
Lap time
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
so far only trying to think of the logic and techniques required... the codes are not correct
import re
import pandas as pd
df = pd.read_csv('sample.csv')
#1. to determine how many rows to generate. eg. 1000 to 1005 is 6 rows
df['time'] = df['Endtime'] - df['Startime']
#2. add one new row with 1 added minute. eg. 6 rows
for i in No_of_rows:
if df['time'] < df['Endtime']: #if 'time' still before end time, then continue append
df['time'] = df['Startime'] += 1 #not sure how to select Minute part only
else:
continue
pardon my limited coding skills. appreciate all the help from you experts.. thanks!
Try with pd.date_range and explode:
#convert to datetime if needed
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
#create list of 1min ranges
df["Range"] = df.apply(lambda x: pd.date_range(x["Starttime"], x["Endtime"], freq="1min"), axis=1)
#explode, drop unneeded columns and keep only time
df = df.drop(["Starttime", "Endtime"], axis=1).explode("Range")
df["Range"] = df["Range"].dt.time
>>> df
Range
Lap
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
Input df:
df = pd.DataFrame({"Lap": [1],
"Starttime": ["10:00:00"],
"Endtime": ["10:05:00"]}).set_index("Lap")
>>> df
Starttime Endtime
Lap
1 10:00:00 10:05:00
You can convert the times to datetimes, that will arbitrarily prepend the date of today (at whatever date you’re running) but we can then remove that later and it allows for easier manupulation:
>>> bounds = df[['Starttime', 'Endtime']].transform(pd.to_datetime)
>>> bounds
Starttime Endtime
0 2021-09-29 10:00:00 2021-09-29 10:05:00
1 2021-09-29 10:00:00 2021-09-29 10:02:00
Then we can simply use pd.date_range with a 1 minute frequency:
>>> times = bounds.agg(lambda s: pd.date_range(*s, freq='1min'), axis='columns')
>>> times
0 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
1 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
dtype: object
Now joining that with the Lap info and using df.explode():
>>> result = df[['Lap']].join(times.rename('time')).explode('time').reset_index(drop=True)
>>> result
Lap time
0 1 2021-09-29 10:00:00
1 1 2021-09-29 10:01:00
2 1 2021-09-29 10:02:00
3 1 2021-09-29 10:03:00
4 1 2021-09-29 10:04:00
5 1 2021-09-29 10:05:00
6 2 2021-09-29 10:00:00
7 2 2021-09-29 10:01:00
8 2 2021-09-29 10:02:00
Finally we wanted to remove the day:
>>> result['time'] = result['time'].dt.time
>>> result
Lap time
0 1 10:00:00
1 1 10:01:00
2 1 10:02:00
3 1 10:03:00
4 1 10:04:00
5 1 10:05:00
6 2 10:00:00
7 2 10:01:00
8 2 10:02:00
The objects in your series are now datetime.time
Here is another way without using apply/agg:
Convert to datetime first:
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
Get difference between the end and start times and then using index.repeat, repeat the rows. Then using groupby & cumcount, get pd.to_timedelta in minutes and add to the existing start time:
repeats = df['Endtime'].sub(df['Starttime']).dt.total_seconds()//60
out = df.loc[df.index.repeat(repeats+1),['Lap','Starttime']]
out['Starttime'] = (out['Starttime'].add(
pd.to_timedelta(out.groupby("Lap").cumcount(),'min')).dt.time)
print(out)
Lap Starttime
0 1 10:00:00
0 1 10:01:00
0 1 10:02:00
0 1 10:03:00
0 1 10:04:00
0 1 10:05:00
I have a pandas data frame columns Time Stamp,ID,Latitude,Longitude. This data frame about one month period. How do I plot number of distinct visited location VS time(per day or week)?
Time Stamp Id Latitude Longitude
08/10/2016 15:22:51:700 1 23 50
08/10/2016 16:28:08:026 1 23 50
08/10/2016 16:28:09:026 1 12 45
08/10/2016 19:00:08:026 2 23 50
08/10/2016 20:28:08:026 1 23 50
08/10/2016 19:00:08:000 2 23 50
09/10/2016 01:02:33:123 2 23 50
09/10/2016 06:15:08:500 1 23 50
09/10/2016 10:01:07:022 3 28 88
I believe you need:
First create datetime Series by to_datetime - times was removes, so get datetimes with no times. (thanks cᴏʟᴅsᴘᴇᴇᴅ for idea)
Then groupby and get nunique, last plot:
days = pd.to_datetime(df['Time Stamp'].str.split().str[0])
s1 = df['Id'].groupby(days).nunique()
print (s1)
Time Stamp
2016-08-10 2
2016-09-10 3
Name: Id, dtype: int64
s1.plot()
For weeks convert to weeks:
weeks = days.dt.week
s2 = df['Id'].groupby(weeks).nunique()
print (s2)
Time Stamp
32 2
36 3
Name: Id, dtype: int64
s2.plot()
Another approach for all dates is resample:
df['Days'] = pd.to_datetime(df['Time Stamp'].str.split().str[0])
s2 = df.resample('D', on='Days')['Id'].nunique()
print (s2)
Days
2016-08-10 2
2016-08-11 0
2016-08-12 0
2016-08-13 0
2016-08-14 0
2016-08-15 0
2016-08-16 0
2016-08-17 0
2016-08-18 0
2016-08-19 0
2016-08-20 0
2016-08-21 0
2016-08-22 0
2016-08-23 0
2016-08-24 0
2016-08-25 0
2016-08-26 0
2016-08-27 0
2016-08-28 0
2016-08-29 0
2016-08-30 0
2016-08-31 0
2016-09-01 0
2016-09-02 0
2016-09-03 0
2016-09-04 0
2016-09-05 0
2016-09-06 0
2016-09-07 0
2016-09-08 0
2016-09-09 0
2016-09-10 3
Freq: D, Name: Id, dtype: int64
s2.plot()
For weeks:
s2 = df.resample('W', on='Days')['Id'].nunique()
print (s2)
Days
2016-08-14 2
2016-08-21 0
2016-08-28 0
2016-09-04 0
2016-09-11 3
Freq: W-SUN, Name: Id, dtype: int64
s2.plot()
I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1
I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1