I have a timestamp that looks like this: "1994-10-01:00:00:00" and when I've trying with pd.read_csv or pd.read_table to read this dataset, it imports everything including the date column ([0]) but not even as an object. This is part of my code:
namevar = ['timestamp', 'nsub',
'sub_cms', # var 1 [cms]
'sub_gwflow', # var 2 [cfs]
'sub_interflow', # var 3 [cfs]
'sub_sroff', # var 4 [cfs]
....
'subinc_sroff', # var 13
'subinc_tavgc'] # var 14
df = pd.read_csv(root.anima, delimiter='\t', skiprows=1, header=avar+6, index_col=0,
names=namevar, infer_datetime_format=True,
parse_dates=[0])
print(df)
Results in:
nsub sub_cms ... subinc_sroff subinc_tavgc
timestamp
1994-10-01:00:00:00 1 4.4180 ... 0.0 59.11000
1994-10-01:00:00:00 2 2.6690 ... 0.0 89.29000
1994-10-01:00:00:00 3 4.3170 ... 0.0 77.02000
...
2000-09-30:00:00:00 2 2.3580 ... 0.0 0.19570
2000-09-30:00:00:00 3 2.2250 ... 0.0 0.73340
2000-09-30:00:00:00 4 0.8876 ... 0.0 0.07124
[8768 rows x 15 columns]
print(df.dtypes)
Results in:
nsub int64
sub_cms float64
sub_gwflow float64
sub_interflow float64
sub_sroff float64
subinc_actet float64
...
subinc_sroff float64
subinc_tavgc float64
dtype: object
my ultimate goal is that once the timestamp is in the dataframe I could modify it by getting rid of the time, with:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d', infer_datetime_format=True)
but when I run this now, it is telling me " KeyError: 'timestamp' "
Any help in getting the timestamp in the dataframe is much appreciated.
As highlighted by #s13wr81, the way bring 'timstamp' into the dataframe as a column was by removing index_col='timestamp' from the statement.
In order to edit timestamp properly, I needed to remove the :Hr:Min:Sec portion of it by using:
df['timestamp'] = df.timestamp.str.split(":", expand=True)
and then to convert timestamp as a Panda datetime I used:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d', infer_datetime_format=True)
I am not sure about the data, but i think timestamp is not a column, but index. this kind of problem, sometimes happen when we doing grouping.
Try :
"timestamp" in df.columns
if the result false, then :
df = df.reset_index()
next to strip time in timestamp try :
df['timestamp'] = pd.to_datetime(df.timestamp,unit='ns')
df = pd.read_csv(root.anima, delimiter='\t', skiprows=1, header=avar+6, index_col=0,
names=namevar, infer_datetime_format=True,
parse_dates=[0])
I think you are explicitly telling pandas to consider column 0 as the index which happens to be your datetime column.
Kindly try removing the index_col=0 from the pd.read_csv() and i think it will work.
I think the issue is that the timestamp is in a non-standard format. There is a colon between the date and time parts. Here is a way to convert the value in the example:
import datetime
# note ':' between date part and time part
raw_timestamp = '1994-10-01:00:00:00'
format_string = '%Y-%m-%d:%H:%M:%S'
result = datetime.datetime.strptime(raw_timestamp, format_string)
print(result)
1994-10-01 00:00:00
You could use pd.to_datetime() with the format_string in this example, to process an entire column of timestamps.
UPDATE
Here is an example that uses a modified version of the original data (timestamp + one column; every entry is unique):
from io import StringIO
import pandas as pd
data = '''timestamp nsub
1994-10-01:00:00:00 1
1994-10-02:00:00:00 2
1994-10-03:00:00:00 3
2000-09-28:00:00:00 4
2000-09-29:00:00:00 5
2000-09-30:00:00:00 6
'''
df = pd.read_csv(StringIO(data), sep='\s+')
df['timestamp'] = pd.to_datetime(df['timestamp'],
format='%Y-%m-%d:%H:%M:%S',
errors='coerce')
print(df, end='\n\n')
print(df.dtypes)
timestamp nsub
0 1994-10-01 1
1 1994-10-02 2
2 1994-10-03 3
3 2000-09-28 4
4 2000-09-29 5
5 2000-09-30 6
timestamp datetime64[ns]
nsub int64
dtype: object
Related
I have the following DataFrame with a Date column,
0 2021-12-13
1 2021-12-10
2 2021-12-09
3 2021-12-08
4 2021-12-07
...
7990 1990-01-08
7991 1990-01-05
7992 1990-01-04
7993 1990-01-03
7994 1990-01-02
I am trying to find the index for a specific date in this DataFrame using the following code,
# import raw data into DataFrame
df = pd.DataFrame.from_records(data['dataset']['data'])
df.columns = data['dataset']['column_names']
df['Date'] = pd.to_datetime(df['Date'])
# sample date to search for
sample_date = dt.date(2021,12,13)
print(sample_date)
# return index of sample date
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
The output of the program is,
2021-12-13
[]
I can't understand why. I have cast the Date column in the DataFrame to a DateTime and I'm doing a like-for-like comparison.
I have reproduced your Dataframe with minimal samples. By changing the way that you can compare the date will work like this below.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2021-12-13','2021-12-10','2021-12-09','2021-12-08']})
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
sample_date = dt.datetime.strptime('2021-12-13', '%Y-%m-%d')
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
output:
[0]
The search data was in the index number 0 of the DataFrame
Please let me know if this one has any issues
I'm trying to create a new date column based on an existing date column in my dataframe. I want to take all the dates in the first column and make them the first of the month in the second column so:
03/15/2019 = 03/01/2019
I know I can do this using:
df['newcolumn'] = pd.to_datetime(df['oldcolumn'], format='%Y-%m-%d').apply(lambda dt: dt.replace(day=1)).dt.date
My issues is some of the data in the old column is not valid dates. There is some text data in some of the rows. So, I'm trying to figure out how to either clean up the data before I do this like:
if oldcolumn isn't a date then make it 01/01/1990 else oldcolumn
Or, is there a way to do this with try/except?
Any assistance would be appreciated.
At first we generate some sample data:
df = pd.DataFrame([['2019-01-03'], ['asdf'], ['2019-11-10']], columns=['Date'])
This can be safely converted to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
mask = df['Date'].isnull()
df.loc[mask, 'Date'] = dt.datetime(1990, 1, 1)
Now you don't need the slow apply
df['New'] = df['Date'] + pd.offsets.MonthBegin(-1)
Try with the argument errors=coerce.
This will return NaT for the text values.
df['newcolumn'] = pd.to_datetime(df['oldcolumn'],
format='%Y-%m-%d',
errors='coerce').apply(lambda dt: dt.replace(day=1)).dt.date
For example
# We have this dataframe
ID Date
0 111 03/15/2019
1 133 01/01/2019
2 948 Empty
3 452 02/10/2019
# We convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y', errors='coerce')
Output
ID Date
0 111 2019-03-15
1 133 2019-01-01
2 948 NaT
3 452 2019-02-10
I have a dataset which looks like below
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:16 000]
[25/May/2015:23:11:16 000]
Now i have made this into a DF and df[0] has [25/May/2015:23:11:15 and df[1] has 000]. I want to send all the data which ends with same seconds to a file. in the above example they end with 15 and 16 as seconds. So all ending with 15 seconds into one and the other into a different one and many more
I have tried the below code
import pandas as pd
data = pd.read_csv('apache-access-log.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df[0],df[1].str[-2:])
Converting that column to a datetime would make it easier to work on, e.g.:
df['date'] = pd.to_datetime(df['date'], format='%d/%B/%Y:%H:%m:%S')
The you can simply iterate over a groupby(), e.g.:
In []:
for k, frame in df.groupby(df['date'].dt.second):
#frame.to_csv('file{}.csv'.format(k))
print('{}\n{}\n'.format(k, frame))
Out[]:
15
date value
0 2015-11-25 23:00:15 0
1 2015-11-25 23:00:15 0
16
date value
2 2015-11-25 23:00:16 0
3 2015-11-25 23:00:16 0
You can set your datetime as the index for the dataframe, and then use loc and to_csv Pandas' functions. Obviously, as other answers points out, you should convert your date to datetime while reading your dataframe.
Example:
df = df.set_index(['date'])
df.loc['25/05/2018 23:11:15':'25/05/2018 23:11:15'].to_csv('df_data.csv')
Try out this,
## Convert a new column with seconds value
df['seconds'] = df.apply(lambda row: row[0].split(":")[3].split(" ")[0], axis=1)
for sec in df['seconds'].unique():
## filter by seconds
print("Resutl ",df[df['seconds'] == sec])
I have a file with two different dates: one has a timestamp and one does not. I need to read the file, disregard the timestamp, and compare the two dates. If the two dates are the same then I need to spit it to the output file and disregard any other rows.
I'm having trouble knowing if I should be using a datetime function on the input and formatting the date there and then simply seeing if the two are equivalent? Or should I be using a timedelta?
I've tried a couple different ways but haven't had success.
df = pd.read_csv("File.csv", dtype={'DATETIMESTAMP': np.datetime64, 'DATE':np.datetime64})
Gives me : TypeError: the dtype < M8 is not supported for parsing, pass this column using parse_dates instead
I've also tried to just remove the timestamp and then compare, but the strings end up with different date formats and that doesn't work either.
df['RemoveTimestamp'] = df['DATETIMESTAMP'].apply(lambda x: x[:10])
df = df[df['RemoveTimestamp'] == df['DATE']]
Any guidance appreciated.
Here is my sample input CSV file:
"DATE", "DATETIMESTAMP"
"8/6/2014","2014-08-06T10:18:38.000Z"
"1/15/2013","2013-01-15T08:57:38.000Z"
"3/7/2013","2013-03-07T16:57:18.000Z"
"12/4/2012","2012-12-04T10:59:37.000Z"
"5/6/2014","2014-05-06T11:07:46.000Z"
"2/13/2013","2013-02-13T15:51:42.000Z"
import pandas as pd
import numpy as np
# your data, both columns are in string
# ================================================
df = pd.read_csv('sample_data.csv')
df
DATE DATETIMESTAMP
0 8/6/2014 2014-08-06T10:18:38.000Z
1 1/15/2013 2013-01-15T08:57:38.000Z
2 3/7/2013 2013-03-07T16:57:18.000Z
3 12/4/2012 2012-12-04T10:59:37.000Z
4 5/6/2014 2014-05-06T11:07:46.000Z
5 2/13/2013 2013-02-13T15:51:42.000Z
# processing
# =================================================
# convert string to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['DATETIMESTAMP'] = pd.to_datetime(df['DATETIMESTAMP'])
# cast timestamp to date
df['DATETIMESTAMP'] = df['DATETIMESTAMP'].values.astype('<M8[D]')
DATE DATETIMESTAMP
0 2014-08-06 2014-08-06
1 2013-01-15 2013-01-15
2 2013-03-07 2013-03-07
3 2012-12-04 2012-12-04
4 2014-05-06 2014-05-06
5 2013-02-13 2013-02-13
# compare
df['DATE'] == df['DATETIMESTAMP']
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
How about:
import time
filename = dates.csv
with open(filename) as f:
contents = f.readlines()
for i in contents:
date1, date2 = i.split(',')
date1 = date1.strip('"')
date2 = date2.split('T')[0].strip('"')
date1a = time.strftime("%Y-%m-%d",time.strptime(date1, "%m/%d/%Y"))
print i if date1a == date2 else None
I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?
Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.
If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date
Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)
It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]
Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column