I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)
I have a data frame data types like below
usr_id year
0 t961 00:50:03.158000
1 t964 03:25:57
2 t335 00:55:00
3 t829 00:04:25.714000
usr_id object
year object
dtype: object
I want to convert the year column data type to a datetime. I used the below code.
timefmt = "%H:%M"
test['year'] = pd.to_datetime(
test['year'], format=timefmt, errors='coerce').dt.time
I get below output
usr_id year
0 t961 NaT
1 t964 NaT
2 t335 NaT
3 t829 NaT
How can I convert the data time of this column (object to datetime)?
How can I drop seconds & microseconds?
Expected output
usr_id year
0 t961 00:50
1 t964 03:25
2 t335 00:55
3 t829 00:04
Use to_datetime with Series.dt.strftime:
timefmt = "%H:%M"
test['year'] = pd.to_datetime(test['year'], errors='coerce').dt.strftime(timefmt)
print (test)
usr_id year
0 t961 00:50
1 t964 03:25
2 t335 00:55
3 t829 00:04
Or you can use Series.str.rsplit with n=1 for split by last : and select first lists by indexing:
test['year'] = test['year'].str.rsplit(':', n=1).str[0]
print (test)
usr_id year
0 t961 00:50
1 t964 03:25
2 t335 00:55
3 t829 00:04
Or solution by #Akira:
test['year'] = test['year'].astype(str).str[:5]
As there is currently no actual date in your year column, you need to set a default one. Then you you can pass a format to pandas to_datetime function.
This could be done in a one-liner like this:
test['year'] = pd.to_datetime(test['year'].apply(lambda x: '1900-01-01 '+ x),format='%Y-%m-%d %H:%M:%S')
Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)
I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]
I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125