I have a column with hh:mm:ss and a separate column with the decimal seconds.
I have quite a horrible text files to process and the decimal value of my time is separated into another column. Now I'd like to concatenate them back in.
For example:
df = {'Time':['01:00:00','01:00:00 AM','01:00:01 AM','01:00:01 AM'],
'DecimalSecond':['14','178','158','75']}
I tried the following but it didn't work. It gives me "01:00:00 AM.14" LOL
df = df['Time2'] = df['Time'].map(str) + '.' + df['DecimalSecond'].map(str)
The goal is to come up with one column named "Time2" which has the first row 01:00:00.14 AM, second row 01.00.00.178 AM, etc)
Thank you for the help.
You can convert ouput to datetimes and then call Series.dt.time:
#Time column is splitted by space and extracted values before first space
s = df['Time'].astype(str).str.split().str[0] + '.' + df['DecimalSecond'].astype(str)
df['Time2'] = pd.to_datetime(s).dt.time
print (df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00.140000
1 01:00:00 AM 178 01:00:00.178000
2 01:00:01 AM 158 01:00:01.158000
3 01:00:01 AM 75 01:00:01.750000
Please see the python code below
In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame({'Time':['01:00:00','01:00:00','01:00:01','01:00:01'],
'DecimalSecond':['14','178','158','75']})
In [3]:
df['Time2'] = df[['Time','DecimalSecond']].apply(lambda x: ' '.join(x), axis = 1)
print(df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00 14
1 01:00:00 178 01:00:00 178
2 01:00:01 158 01:00:01 158
3 01:00:01 75 01:00:01 75
In [4]:
df.iloc[:,2]
Out[4]:
0 01:00:00 14
1 01:00:00 178
2 01:00:01 158
3 01:00:01 75
Name: Time2, dtype: object
Related
I have a column of Call Duration formatted as mm.ss and I would like to convert it to all seconds.
It looks like this:
CallDuration
25 29.02
183 5.40
213 3.02
290 10.27
304 2.00
...
4649990 13.02
4650067 5.33
4650192 19.47
4650197 3.44
4650204 14.15
In excel I would separate the column at the ".", multiply the minutes column by 60 and then add it to the seconds column for my total seconds. I feel like this should be much easier with pandas/python, but I cannot figure it out.
I tried using pd.to_timedelta but that did not give me what I need - I can't figure out how to put in there how the time is formatted. When I put in 'm' it does not return correctly with seconds being after the "."
pd.to_timedelta(post_group['CallDuration'],'m')
25 0 days 00:29:01.200000
183 0 days 00:05:24
213 0 days 00:03:01.200000
290 0 days 00:10:16.200000
304 0 days 00:02:00
...
4649990 0 days 00:13:01.200000
4650067 0 days 00:05:19.800000
4650192 0 days 00:19:28.200000
4650197 0 days 00:03:26.400000
4650204 0 days 00:14:09
Name: CallDuration, Length: 52394, dtype: timedelta64[ns]
Tried doing it this way, but now can't get the 'sec' column to convert to an integer because there are blanks, and it won't fill the blanks...
post_duration = post_group['CallDuration'].str.split(".",expand=True)
post_duration.columns = ["min","sec"]
post_duration['min'] = post_duration['min'].astype(int)
post_duration['min'] = 60*post_duration['min']
post_duration.loc['Total', 'min'] = post_duration['min'].sum()
post_duration
min sec
25 1740.0 02
183 300.0 4
213 180.0 02
290 600.0 27
304 120.0 None
... ... ...
4650067 300.0 33
4650192 1140.0 47
4650197 180.0 44
4650204 840.0 15
Total 24902700.0 NaN
post_duration2 = post_group['CallDuration'].str.split(".",expand=True)
post_duration2.columns = ["min","sec"]
post_duration2['sec'].astype(float).astype('Int64')
post_duration2.fillna(0)
post_duration2.loc['Total', 'sec'] = post_duration2['sec'].sum()
post_duration2
TypeError: object cannot be converted to an IntegerDtype
Perhaps there's a more efficient way, but I would still convert to a timedelta format then use apply with the Timedelta.total_seconds() method to get the column in seconds.
import pandas as pd
pd.to_timedelta(post_group['CallDuration'], 'm').apply(pd.Timedelta.total_seconds)
You can find more info on attributes and methods you can call on timedeltas here
import pandas as pd
import numpy as np
import datetime
def convert_to_seconds(col_data):
col_data = pd.to_datetime(col_data, format="%M:%S")
# The above line adds the 1900-01-01 as a date to the time, so using subtraction to remove it
col_data = col_data - datetime.datetime(1900,1,1)
return col_data.dt.total_seconds()
df = pd.DataFrame({'CallDuration':['2:02',
'5:50',
np.nan,
'3:02']})
df['CallDuration'] = convert_to_seconds(df['CallDuration'])
Here's the result:
CallDuration
0 122.0
1 350.0
2 NaN
3 182.0
You can also use the above code to convert string HH:MM to total seconds in float but only if the number of hours are less than 24.
And if you want to convert multiple columns in your dataframe replace
df['CallDuration'] = convert_to_seconds(df['CallDuration'])
with
new_df = df.apply(lambda col: convert_to_seconds(col) if col.name in colnames_list else col)
I have a csv file containing 2 columns: id, val
where id is the number of the day (total 365)
Is it possible to convert the number id to dates in format '%d-%m-%Y'?
In fact I want to add all the days of year 2015 e.g. 01-01-2015 etc.
How can i do this with pandas in python?
following is a part of the file and the desired output
"id" "val"
1 49
2 48
3 46
4 45
"date" "val"
01-01-2015 49
02-01-2015 48
03-01-2015 46
04-01-2015 45
Use pd.tseries.offsets.Day:
df['date'] = pd.Timestamp('2015-01-01') \
+ df['id'].sub(1).apply(pd.tseries.offsets.Day)
Alternative, proposed by #HenryEcker:
df['date'] = pd.Timestamp('2015-01-01') \
- pd.Timedelta(days=1) \
+ df['id'].apply(pd.tseries.offsets.Day)
>>> df['id'].sub(1).apply(pd.tseries.offsets.Day)
0 <0 * Days>
1 <Day>
2 <2 * Days>
3 <3 * Days>
Name: id, dtype: object
>>> df
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
You can convert id to datetime and format the output with strftime:
df['Date'] = pd.to_datetime(df['id'].astype(str)+"-2015", format='%j-%Y').dt.strftime('%d-%m-%Y')
Result:
id
val
Date
0
1
49
01-01-2015
1
2
48
02-01-2015
2
3
46
03-01-2015
3
4
45
04-01-2015
df.columns['date', 'val']
for i, contents in enumerate(df['date']):
info = str(contents)
if contents < 10:
info = str(0) + info
df['date'][i] = "01-" + info + "-2015"
This iterates through your column and converts it to date formatting
Or like this:
df['Date'] = pd.Timestamp('2014-12-31') + df['id'].apply(lambda x: pd.Timedelta(days=x))
Output:
id val Date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
You can use pd.to_timedelta() on id column to turn its values into date offsets for adding to the base date, as follows:
df['date'] = pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')
Result:
print(df)
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
If you want the date in dd-mm-YYYY format, you can use together with .dt.strftime(), as follows:
df['date2'] = (pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')).dt.strftime('%d-%m-%Y')
Result:
print(df)
id val date date2
0 1 49 2015-01-01 01-01-2015
1 2 48 2015-01-02 02-01-2015
2 3 46 2015-01-03 03-01-2015
3 4 45 2015-01-04 04-01-2015
I'm not sure about the years as the day count doesn't speak about which year to choose but you can convert it into months and dates.
change your csv column called id into the date. Then >>>
df['Date'] = pd.to_datetime(df['Date'], format='%j').dt.strftime('%m-%d')
it will change it into date. Then you can manually add year.
I am having some troubles pivoting a dataframe with a datetime value as the index.
my df looks like this:
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
Where the timestamp is a, datetime64[ns]. I am trying to pivot the table so that it looks like this.
Hour 0 1 2 4 ....
Date
2016-01-01 16.5 16.4 16.3 17 ....
....
....
I've tried using the code below but am getting an error when I run it.
df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
KeyError Traceback (most recent call last)
in ()
1 # Pivot Table
----> 2 df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
~\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
56 for i in values:
57 if i not in data:
---> 58 raise KeyError(i)
59
60 to_filter = []
KeyError: 16.5469
Any help or insights would be greatly appreciated.
A different way of accomplishing this without the lambda is to create the indices from the DateTimeIndex.
df2 = pd.pivot_table(df, index=df.index.date, columns=df.index.hour, values="Value")
I slightly extended input data like below (assuming no duplicated entries in the same date/hour)
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
2016-01-01 04:00:00 16.023928
2016-01-03 04:00:00 16.101919
2016-01-05 23:00:00 13.405928
It looks a bit awkward, but something like below works.
df2['Date'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[0])
df2['Hour'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[1].split(":")[0])
df3 = pd.pivot_table(df2, values='Value', index='Date', columns='Hour')
[Output]
Hour 00 01 02 04 23
Date
2016-01-01 16.5469 16.402375 16.32425 16.023928 NaN
2016-01-03 NaN NaN NaN 16.101919 NaN
2016-01-05 NaN NaN NaN NaN 13.405928
Finally if your columns need to be integer,
df3.columns = [int(x) for x in df3.columns]
Hope this helps.
Adapting #Seanny123 's answer above for an arbitrary cadence:
start = [2018, 1, 1, 0, 0, 0]
end = [date.today().year, date.today().month, date.today().day]
quant='freq'
sTime_tmp = datetime.datetime(start[0], start[1], start[2], tzinfo = pytz.UTC)
eTime_tmp = datetime.datetime(end[0], end[1], end[2], tzinfo = pytz.UTC)
cadence = '5min'
t = pd.date_range(start=sTime_tmp,
end=eTime_tmp,
freq = cadence)
keo = pd.DataFrame(np.nan, index=t, columns=[quant])
keo[quant] = 0
keo = pd.pivot_table(keo, index=keo.index.time, columns=keo.index.date, values=quant)
keo
I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.
I have extracted the table below from a csv file :
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
For this purpose I used the following statement :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'timestamp')
pivoted = df.pivot('timestamp','user_id')
But this last line generate the error message : no item named timestamp.
Many thanks in advance for your help.
looks like column name timestamp is not present in the dataframe.
Try index_col = 'date' instead of index_col = 'timestamp' also use pares_dates = ['date'] while using pd.read_csv.
This should work:
df = pd.read_csv('expenses.csv', header = False, names = newnames, index_col = 'date', parse_dates = ['date'])
Hope this helps.