How to create a "duration" column from two "dates" columns? - python

I have two columns ("basecamp_date" and "highpoint_date") in my "expeditions" dataframe, they have a start date (basecamp_date) and an end date ("highpoint_date") and I would like to create a new column that expresses the duration between these two dates but I have no idea how to do it.
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

In read_csv convert columns to datetimes and then subtrat columns with Series.dt.days for days:
file = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv"
expeditions = pd.read_csv(file, parse_dates=['basecamp_date','highpoint_date'])
expeditions['diff'] = expeditions['highpoint_date'].sub(expeditions['basecamp_date']).dt.days

You can convert those columns to datetime and then subtract them to get the duration:
tstart = pd.to_datetime(expeditions['basecamp_date'])
tend = pd.to_datetime(expeditions['highpoint_date'])
expeditions['duration'])= pd.Timedelta(tend - tstart)

Related

Converting unix time into datetime within csv

i have a csv file with many lines and three column. first column is the unix time, second column the price, and third column represents the volume of the symbol that has been traded at that specific price. what i'm doing is, calculating ohlc for different time frames (e.g. 1h, 4h, 12h, 1d) out of tha csv file. that is working very well by first converting the unix time into datetime
code:
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['date'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('date')
df = df['price'].resample('4h').ohlc()
df.to_csv('file_4h_ohlc.csv')
result:
date,open,high,low,close
2017-05-01 20:00:00,0.757881,1.07,0.650011,1.069999
target:
i wanna now converte the datetime (2017-05-01 20:00:00) back to the unix time (1493658000) within the same file by keeping the ohlc values. or if not possible so, to save into a different file.
thanks a lot for support and sorry if such question has been already answered, but i didnt find it
-hotshot
You can create a new date column instead of overwriting the existing one, so you can re-use it as the index.
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['datestamp'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('datestamp')
df = df['price'].resample('4h').ohlc()
# Set the index back to the original (after calculating ohlc)
df = df.set_index('date')
# Optional: Drop the datestamp column
df = df.drop(columns=['datestamp'])
df.to_csv('file_4h_ohlc.csv')
Alternatively, you can convert the existing datetime column to a Unix timestamp like so:
df['date'].apply(lambda x : (x - datetime.datetime(1970, 1, 1)).total_seconds())

Remove time from pandas dataframe datetime64[ns] index

I am trying to merge two pandas dataframes, and to do this I want to make it so that they both have the same index. The problem is, one df has an index of datatype object which just includes the date while the other df has an index of datatype datetime64[ns] which includes the date and time. Is there a way to make these both the same data type so that I can merge the two dataframes?
Convert both date types into a pandas datetime format and convert them with having just dates.
df['date_only'] = df['dates'].dt.date
You could convert a date and time format to just date as below
import pandas as pd
date_n_time='2015-01-08 22:44:09'
date=pd.to_datetime(date_n_time).date()
make your index as a column using
df.reset_index()
set it back to index using
df.set_index()

How to join pandas Series of numbers to make it one number

I'm using Pandas library.
I have three columns in dataset named 'hours', 'minutes' and 'seconds'
I want to join the three columns to make it in time format.
For e.g the first column should read as 9:33:09
How can I do that?
Convert to timedelta and add -
pd.to_timedelta(df["hours"], unit='h') + pd.to_timedelta(df["minutes"], unit='m') + pd.to_timedelta(df["sec"], unit='S')
Viewing you example, I think that the sec column is actually microseconds, if that's the case use -
pd.to_timedelta(df["hours"], unit='h') + pd.to_timedelta(df["minutes"], unit='m') + pd.to_timedelta(df["sec"], unit='us')
You can use string operations and pandas for this.
import pandas as pd
# Read csv
data=pd.read_csv("data.csv")
# Create a DataFrame object
df=pd.DataFrame(data,columns=["hour","mins","sec"])
# Iterate through records and print the values.
for ind in df.index:
hour=str(df['hour'][ind])
min=str(df['mins'][ind])
sec=str(df['sec'][ind])
sec=sec[:len(sec)-4]
if(len(sec)==1):
sec="0"+sec
print(hour+":"+min+":"+sec)
Output:
HH:MM:SS
It appends 0 if seconds are of 1 digit.

Python: Reading excel file but Index should be DateTime not Sequential Numbers

Hey I am loading in data from an excel sheet. The excel sheet has 5 columns. The first colume is a DateTime, and the next 4 are datasets corresponding to that time. Here is the code:
import os
import numpy as np
import pandas as pd
df = pd.read_excel (r'path\test.xlsx', sheet_name='2018')
I thought it would load it in such that the DateTime is the index, but instead it has another column called Index which is just a set of numbers going from 0 up to the end of the array. How do I have the DateTime column be the index and remove the other column?
Try this after you read the excel, it is two extra lines
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
"""
Assuming that Datetime is the name of the Datetime column and the format of the column is 07/15/2020 12:24:45 -"%m/%d/%Y, %H:%M:%S"
if the format of the date time string is different change the format mentioned
"""
df = df.set_index(pd.DatetimeIndex(df['Datetime']))
"""
This will set the index as datetime index
"""
There is a solution for this problem:
import pandas as pd
df = pd.read_excel (r'path\test.xlsx', sheet_name='2018')
df = df.set_index('timestamp') #Assuming thename of your datetime column is timestamp
You can try this method for setting the Datetime column as the index.

Select Pandas dataframe rows based on 'hour' datetime

I have a pandas dataframe 'df' with a column 'DateTimes' of type datetime.time.
The entries of that column are hours of a single day:
00:00:00
.
.
.
23:59:00
Seconds are skipped, it counts by minutes.
How can I choose rows by hour, for example the rows between 00:00:00 and 00:01:00?
If I try this:
df.between_time('00:00:00', '00:00:10')
I get an error that index must be a DateTimeIndex.
I set the index as such with:
df=df.set_index(keys='DateTime')
but I get the same error.
I can't seem to get 'loc' to work either. Any suggestions?
Here a working example of what you are trying to do:
times = pd.date_range('3/6/2012 00:00', periods=100, freq='S', tz='UTC')
df = pd.DataFrame(np.random.randint(10, size=(100,1)), index=times)
df.between_time('00:00:00', '00:00:30')
Note the index has to be of type DatetimeIndex.
I understand you have a column with your dates/times. The problem probably is that your column is not of this type, so you have to convert it first, before setting it as index:
# Method A
df.set_index(pd.to_datetime(df['column_name'], drop=True)
# Method B
df.index = pd.to_datetime(df['column_name'])
df = df.drop('col', axis=1)
(The drop is only necessary if you want to remove the original column after setting it as index)
Check out these links:
convert column to date type: Convert DataFrame column type from string to datetime
filter dataframe on dates: Filtering Pandas DataFrames on dates
Hope this helps

Categories