I have a df of shape 3000,125
The first row of my df represents bond tickers
The 2nd row represents the date they were sold
My index is a historical time series, and the values within the df represent the daily stock prices
e.g
AAPL GOOGLE IBM
16/02/2018 15/03/2022 22/08/2020
2019/jan/02 5 4 3
2019/jan/03. 4 4 4
2019/jan/04. 4 4 5
2019/jan/05 3 5 2
2012/Mar/03 10 20 22
I would like to run a loop on the values however to do so, the index and df.iloc[0] aka the first row needs to be in the same format.
I was able to convert the index to datetime format using the following code w/o issue:
dftest2.index = pd.to_datetime(dftest2.index, format='%Y%m%d')
Problem statement is that I'd like to convert the first row of the df to match the index format. The first row is in string format in the form '%d/%m/Y%') however in order for it to match the index it needs to be in '%Y%m%d'.
I've used the following code in order for it to match the date format of the index:
dftest2.iloc[0] = pd.to_datetime(dftest2.iloc[0]).dt.strftime('%Y-%m-%d')
And running the below code also produces the following error:
dftest2.iloc[0] = pd.to_datetime(dftest2.iloc[0]).datetime.strptime('%Y-%m-%d')
AttributeError: 'Series' object has no attribute 'datetime'
Stuck on how to convert this now into to datetime format matching index. Previous attempts to convert to datetime have resulted in the row being converted into int format with nonsensical numbers, 187745300000 etc.
How do i convert the row to match the index. The error I am getting now when running the loop is:
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
I've looked all over stackoverflow for possible variations of my problem but w/o success.
IIUC, you just want to turn the first row into a datetime object to do some further operations?
if so, this worked for me -
test_ = pd.to_datetime(df.iloc[0].str.replace("*", "").str.replace(".", ""))
print(test_)
AAPL 2017-04-01
Google 2021-02-03
IBM 2020-03-03
Name: 0, dtype: datetime64[ns]
if you pass the .strftime method you will end up with an object.
hope that helps.
Related
I tried to convert two columns to the same format, datetime in this case.
a['sale_date'] = pd.to_datetime(a['sale_date'])
a['last_date'] = pd.to_datetime(a['last'])
a[a.last_date>a.sale_date]
When I output the dtypes they both show up as the same:
sale_date datetime64[ns]
last_date datetime64[ns]
But I get an error from the comparison of sale_date with last that says:
Invalid comparison between dtype=datetime64[ns] and method
Does this mean they are different types? Why does this not show up when I use .dtypes? Visually the outputs look comparable.
last is the name of an existing pandas method. So, it is better to avoid using last as a column name. If you can't avoid it, then you have to select the column using square brackets.
a = pd.DataFrame({'sale_date': pd.date_range('2018-04-09', periods=4, freq='3D'),
'last': pd.date_range('2018-04-12', periods=4, freq='1D')})
a[a["last"] > a.sale_date]
# sale_date last
# 0 2018-04-09 2018-04-12
# 1 2018-04-12 2018-04-13
I have a CSV (which I converted to a dataframe) consisting of company/stock data:
Symbol Quantity Price Cost date
0 DIS 9 NaN 20 20180531
1 SBUX 5 NaN 30 20180228
2 PLOW 4 NaN 40 20180731
3 SBUX 2 NaN 50 20191130
4 DIS 11 NaN 25 20171031
And I am trying to use the IEX Cloud API to pull in the stock Price for a given date. And then ultimately write that to the dataframe. Per the IEX Cloud API documentation, I can use the get_historical_data function, where the 2nd argument is the date: df = get_historical_data("SBUX", "20190617", close_only=True)
Everything works fine so long as I pass in a raw date directly to the function (e.g., 20190617), but if I try using a variable instead, I get ValueError: year 20180531 is out of range. I'm guessing something is wrong with the date format in my original CSV?
Here is my full code:
import os
from iexfinance.stocks import get_historical_data
import pandas as pd
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tsk_5798c0ab124d49639bb1575b322841c4'
input_df = pd.read_csv("all.csv")
for index, row in input_df.iterrows():
symbol = row['Symbol']
date = row['date']
temp_df = get_historical_data(symbol, date, close_only=True, output_format='pandas')
price = temp_df['close'].values[0]
print(temp_df)
Note that this is a public token, so it's okay to use
When you called get_historical_data("SBUX", "20190617", close_only=True)
you passed the date as a string.
But when you read a DataFrame using read_csv, this column
(containing 8-digit strings) is converted to an integer.
This difference can be the source of problem.
Try 2 things:
convert this column to string, or
while reading the DataFrame, pass dtype={'date': str},
so that this column will be read as a string.
You should be fine if you transform your date row into datetime.
import pandas as pd
df = pd.DataFrame(['20180531'])
pd.to_datetime(df.values[:, 0])
Out[43]: DatetimeIndex(['2018-05-31'], dtype='datetime64[ns]', freq=None)
Then, your column will be correctly formatted for use elsewhere. You can insert this line below pd.read_csv():
df['date'] = pd.to_datetime(df['date'])
I have tried many things and cannot seem to get this to work. In essence, I want to do this because an error occurs when I'm trying to convert this ndarray to a DataFrame. The following error occurs when finding missing Datetime64 values within the Dataframe:
"Out of bounds nanosecond timestamp: 1-01-01 00:00:00"
Therefore I wish to convert these DateTime64 columns into Strings and Recode '1-01-01 00:00:00' within the ndarray, then convert them back to DateTime variables in a DataFrame in order to avoid facing the error shown above.
with sRW.SavReaderNp('C:/Users/Sam/Downloads/data.sav') as reader:
record = reader.all()
prints:
[(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '2019-08-05T00:00:00.000000',
(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '2019-08-05T00:00:00.000000',
(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '0001-01-01T00:00:00.000000',)]
First of all please check if your post is valid, i.e. contains runnable code.
Your example returns a syntax error and the code where you tried what you explained is simply not there.
However, I assume your data looks like
arr = [(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '2019-08-05T00:00:00.000000'),
(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '2019-08-05T00:00:00.000000'),
(b'61D8894E-7FB0-3DE6-E053-6C04A8C01207', 250000., '0001-01-01T00:00:00.000000')]
which looks converted to a dataframe like
df = pd.DataFrame(arr, columns=['ID', 'value', 'date'])
# ID ... date
# 0 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' ... 2019-08-05T00:00:00.000000
# 1 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' ... 2019-08-05T00:00:00.000000
# 2 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' ... 0001-01-01T00:00:00.000000
Then your attempt to convert the date strings into datetime objects was probably
df.date = pd.to_datetime(df.date)
# OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00
which results in the error message you posted in your question.
You can catch these parsing errors with the errors kwarg of pd.to_datetime:
df.date = pd.to_datetime(df.date, 'coerce')
# ID value date
# 0 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' 250000.0 2019-08-05
# 1 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' 250000.0 2019-08-05
# 2 b'61D8894E-7FB0-3DE6-E053-6C04A8C01207' 250000.0 NaT
I have a column of type object it contains 500 rows of dates. I converted the column type to date and I am trying to get a count of the incorrect values, in order to fix them.
Sample of the column, you can see examples of the wrong values in rows: 3 and 5
0 2018-06-14
1 2018-11-12
2 2018-10-09
3 2018-24-08
4 2018-11-12
5 11-02-2018
6 2018-12-31
I can fix the dates if I use this code:
dirtyData['date'] = pd.to_datetime(dirtyData['date'],dayfirst=True)
But I would like to check that the format in every row is %Y-%m-%d' and get the count of the inconsistent formats first. Then change the values.
Is it possible to achieve this?
The below code will work. However, as Michael Gardner mentioned it wont distinguish between days and months if the day 12 or less
import datetime
import pandas as pd
date_list = ["2018-06-14", "2018-11-12", "2018-10-09", "2018-24-08",
"2018-11-12", "11-02-2018", "2018-12-31"]
series1 = pd.Series(date_list)
print(series1)
#The above code is to replicate your date series
count = 0
for item in series1:
try:
datetime.datetime.strptime(item, "%Y-%m-%d") #checks if the date format is Year, Month,Day.
except ValueError: #if there is a value error then it will count these errors
count += 1
print(count)
I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample
datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013
Then I tried converting the dates into years using
df['year']=df['datestart'].astype('timedelta64[Y]')
But it gave me an error:
ValueError: Value cannot be converted into object Numpy Time delta
Using Datetime64
df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')
it gave:
"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"
Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.
Is there a try loop, if statement, or something that I can skip over problems like these?
If date time fails I will be force to use a str.extract script which also works:
year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")
del df['month'], df['day']
and use concat to take the year out.
With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]') The error message is:
Message File Name Line Position
Traceback
<module> C:\Users\0\Desktop\python\Example.py 23
astype C:\Python33\lib\site-packages\pandas\core\generic.py 2062
astype C:\Python33\lib\site-packages\pandas\core\internals.py 2491
apply C:\Python33\lib\site-packages\pandas\core\internals.py 3728
astype C:\Python33\lib\site-packages\pandas\core\internals.py 1746
_astype C:\Python33\lib\site-packages\pandas\core\internals.py 470
_astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]
You first have to convert the column with the date values to datetime's with to_datetime():
df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
This should normally parse the different formats flexibly (the coerce=True is important here to convert invalid dates to NaT).
If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with values you can get the underlying numpy array):
df['datestart'].values.astype('datetime64[Y]')
The problem with this is that it gives again an error when assigning this to a column due to the NaT value (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns] as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:
df['year'] = pd.DatetimeIndex(df['datestart']).year
This last one will return the year as an integer.