I have a CSV (which I converted to a dataframe) consisting of company/stock data:
Symbol Quantity Price Cost date
0 DIS 9 NaN 20 20180531
1 SBUX 5 NaN 30 20180228
2 PLOW 4 NaN 40 20180731
3 SBUX 2 NaN 50 20191130
4 DIS 11 NaN 25 20171031
And I am trying to use the IEX Cloud API to pull in the stock Price for a given date. And then ultimately write that to the dataframe. Per the IEX Cloud API documentation, I can use the get_historical_data function, where the 2nd argument is the date: df = get_historical_data("SBUX", "20190617", close_only=True)
Everything works fine so long as I pass in a raw date directly to the function (e.g., 20190617), but if I try using a variable instead, I get ValueError: year 20180531 is out of range. I'm guessing something is wrong with the date format in my original CSV?
Here is my full code:
import os
from iexfinance.stocks import get_historical_data
import pandas as pd
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tsk_5798c0ab124d49639bb1575b322841c4'
input_df = pd.read_csv("all.csv")
for index, row in input_df.iterrows():
symbol = row['Symbol']
date = row['date']
temp_df = get_historical_data(symbol, date, close_only=True, output_format='pandas')
price = temp_df['close'].values[0]
print(temp_df)
Note that this is a public token, so it's okay to use
When you called get_historical_data("SBUX", "20190617", close_only=True)
you passed the date as a string.
But when you read a DataFrame using read_csv, this column
(containing 8-digit strings) is converted to an integer.
This difference can be the source of problem.
Try 2 things:
convert this column to string, or
while reading the DataFrame, pass dtype={'date': str},
so that this column will be read as a string.
You should be fine if you transform your date row into datetime.
import pandas as pd
df = pd.DataFrame(['20180531'])
pd.to_datetime(df.values[:, 0])
Out[43]: DatetimeIndex(['2018-05-31'], dtype='datetime64[ns]', freq=None)
Then, your column will be correctly formatted for use elsewhere. You can insert this line below pd.read_csv():
df['date'] = pd.to_datetime(df['date'])
Related
I would like to create a YrWeek column in YYYY-WW format, e.g. 2022-01, based on a Date column in YYYY-MM-DD format. I would like to keep the YrWeek column in datetime format, so it will make my life easier when I plot it out.
Below are the steps that I tried
First convert the Date to datetime64
df['Date'] = pd.to_datetime(df.Date, format = %Y-%m-%d)
Then tried the following codes that I research here and there but still cannot figure out a way to create the YrWeek column in YYYY-WW in datetime64 format
df['YrWeek'] = df.Date.dt.to_period('M') #this show 2021-01-01 to 2021-01-06 in the column and in the plot later
df['YrWeek'] = pd.to_datetime(df.Date.apply(lambda x:'{0}-{1}'.format(x.year, x.isocalendar().week)), format='%Y-%w', errors='coerce') # which return "NAT" in the column
df['Yrweek'] = pd.to_datetime(df.Date.dt.year.astype(str) + '-' + df.Date.isocalendar().week.astype(str), format='%Y-%w') # this seems an unsuccessful operation
Thanks first for your helps. I am quite sure I've seen it somewhere, but unable to recall it or get my head round on this issue at the moment.
thanks.
From the documentation of datetime object,
A datetime object is a single object containing all the information from a date object and a time object.
It is not likely to export YYYY-WW in datetime64 format, but you may use strftime() documentation to create an explicit format string.
Here is the sample code:
import pandas as pd
df = pd.DataFrame({'date_time': pd.date_range(start='2022-01-01', end='2022-01-31')})
df['YrWeek'] = df['date_time'].dt.strftime('%Y-%W')
df.head(5) # print sample result
date_time YrWeek
0 2022-01-01 2022-00
1 2022-01-02 2022-00
2 2022-01-03 2022-01
3 2022-01-04 2022-01
4 2022-01-05 2022-01
I'm using pandas read_csv to extract data and reformat it. For example, "10/28/2018" from the column "HBE date" will be reformatted to read "eHome 10/2018"
It mostly works except I am getting reformatted values like "ehome 1.0/2015.0"
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
#extract month and year values
eMonths=[]
eYears =[]
eHomeDates = eHomeHBEdata['HBE date']
for eDate in eHomeDates:
eMonth = eDate.month
eYear = eDate.year
eMonths.append(eMonth)
eYears.append(eYear)
At this point, if I print(type(eMonth)) it returns as 'int.' And if I print the eYears list, I get values like 2013, 2014, 2015 etc.
But then I assign the lists to columns in the data frame . . .
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)
. . . after which print(ehomeHomeHBEdata['workshop Month']) returns values like 2013.0, 2014.0, 2015.0. That's type float, right?
When I try to use the following code I get the misformatted errors mentioned above
eHomeHBEdata['course session'] = "ehome " + eHomeHBEdata['workshop Month'].astype(str) + "/" + eHomeHBEdata['workshop Year'].astype(str)
eHomeHBEdata['start'] = eHomeHBEdata['workshop Month'].astype(str) + "/1/" + eHomeHBEdata['workshop Year'].astype(str) + " 12:00 PM"
Could someone explain what's going on here and help me fix it?
Solution
To convert (reformat) your date columns as MM/YYYY, all you need to do is:
df["Your_Column_Name"].dt.strftime('%m/%Y')
See Section-A and Section-B for two different use-cases.
A. Example
I have created some dummy data for this illustration with a column called: Date. To reformat this column as MM/YYYY I am using df.Dates.dt.strftime('%m/%Y') which is equivalent to df["Dates"].dt.strftime('%m/%Y').
import pandas as pd
## Dummy Data
dates = pd.date_range(start='2020/07/01', end='2020/07/07', freq='D')
df = pd.DataFrame(dates, columns=['Dates'])
# Solution
df['Reformatted_Dates'] = df.Dates.dt.strftime('%m/%Y')
print(df)
## Output:
# Dates Reformatted_Dates
# 0 2020-07-01 07/2020
# 1 2020-07-02 07/2020
# 2 2020-07-03 07/2020
# 3 2020-07-04 07/2020
# 4 2020-07-05 07/2020
# 5 2020-07-06 07/2020
# 6 2020-07-07 07/2020
B. If your input data is in the following format
In this case, first you could convert the date using .astype('datetime64[ns, US/Eastern]') on the column. This lets you apply pandas datetime specific methods on the column. Try running df.Dates.astype('datetime64[ns, US/Eastern]').dt.to_period(freq='M') now.
## Dummy Data
dates = [
'10/2018',
'11/2018',
'8/2019',
'5/2020',
]
df = pd.DataFrame(dates, columns=['Dates'])
print(df.Dates.dtype)
print(df)
## To convert the column to datetime and reformat
df['Dates'] = df.Dates.astype('datetime64[ns, US/Eastern]') #.dt.strftime('%m/%Y')
print(df.Dates.dtype)
C. Avoid using the for loop
Try this. You can use the inbuilt vectorization of pandas on a column, instead for looping over each row. I have used .dt.month and .dt.year on the column to get the month and year as int.
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
eHomeDates = eHomeHBEdata['HBE date'] # this should be in datetime.datetime format
## This is what I changed
>>> eMonths = eHomeDates.dt.month
>>> eYears = eHomeDates.dt.year
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)
I am a python beginner and am trying to read a csv file with pandas. The issue is that the date column in the csv has following format: 2020-03-12 00:00:00+00:00. Within the read_csv function already, I want to change the date format into isoformat (%Y-%m-%d). I tried all stackoverflow solutions but none of them work. This is my code:
import time
from datetime import date
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url,
usecols=[2, 5, 8],
index_col=['Landkreis', 'Meldedatum'],
parse_dates=['Meldedatum'],
squeeze=True
).sort_index()
Current result
The column "Meldedatum" should only show the date, not the hours and minutes. Yet, I can't change the format because it is an index column.
Your help is much appreciated!
Read your csv normally into dataframe without specifying any format.
Then do this:
countries['Meldedatum'] = pd.to_datetime(countries['Meldedatum'])
This should give you the format you want.
That's just how pandas displays a datetime object. It always stores fields for hours/minutes/seconds/miliseconds, even if they are all set to zero. You can't change this internal representation.
You can, however, cast datetime objects to string, in order to format their representation the way you want. Keep in mind that you lose all functionality of a datetime object along the way.
It looks like you want to count the number of occurrences per day. If that's the case, you should use a groupby object. We don't need to set the index columns or parse dates in this case. We can also convert the representation of the datetime objects to strings, if that's your preference:
import time
from datetime import date
import pandas as pd
# get the data
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url, usecols=[2, 5, 8], index_col=None, squeeze=True).sort_index()
# modify dates to strings
countries['Meldedatum'] = countries.Meldedatum.astype(str).apply(lambda x: x.split('T')[0])
# group by Landkreis and Meldedatum
grouped_countries = countries.groupby(['Landkreis', 'Meldedatum']).count()
print(grouped_countries)
# output:
AnzahlFall
Landkreis Meldedatum
LK Ahrweiler 2020-03-12 5
2020-03-13 2
2020-03-14 1
2020-03-16 3
2020-03-17 5
... ...
StadtRegion Aachen 2020-04-14 8
2020-04-15 37
2020-04-16 23
2020-04-17 18
2020-04-18 5
my dataframe contains numerous incorrect datetime values that have been fat-fingered in by the people who entered those data. The errors are mostly 2019-11-12 was entered at 0019-12-12 and 2018 entered as 0018. There are so many of them, so I want came up with a script to correct them en mass. I used the following code:
df['A'].loc[df.A.dt.year<100]=df.A.dt.year+2000
Basically, I want tell python to detect any of the years less than 100 then add 2000 to the year. However, I am getting error :"Out of bounds nanosecond timestamp: 19-11-19 00:00:00" Is there any solution to my problem? Thanks
This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.
Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:
import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})
# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df
Here is the result
A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08
You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.
An option would be to export to csv. Then make the changes and import again.
df.to_csv('path/csvfile.csv')
text = open("path/csvfile.csv", "r")
text = ''.join([i for i in text]) \
.replace("0019-", "2019-")
x = open("path/newcsv.csv","w")
x.writelines(text)
x.close()
df_new = pd.read_csv("path/newcsv.csv")
I am doing a cleaning of my Database. In one of the tables, the time column has values like 0.013391204. I am unable to convert this to time [mm:ss] format. Is there a function to convert this to the required format [mm:ss]
The head for the column
0 20:00
1 0.013391204
2 0.013333333
3 0.012708333
4 0.012280093
Use the below reproducible data:
import pandas as pd
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333", "0.012708333", "0.012280093"]})
I expect the output to be like the first row of the column values shown above.
What is the correct time interpretation for say the first entry? 0.013391204 is it 48 seconds?
Because, if we use datetime module we can convert float into the time format:
Updating answer to add the new information
import datetime
datetime.timedelta(days = 0.013391204)
str(datetime.timedelta(days = 0.013391204))
Output:'0:19:17.000026'
Hope this helps :))
First convert values by to_numeric with errors='coerce' for replace non floats to missing values and then replace them by original values with 00: for hours, last convert by to_timedelta with unit='d':
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333",
"0.012708333", "0.012280093"]})
s = pd.to_numeric(df['time'], errors='coerce').fillna(df['time'].radd('00:'))
df['new'] = pd.to_timedelta(s, unit='d')
print (df)
time new
0 20:00 00:20:00
1 0.013391204 00:19:17.000025
2 0.013333333 00:19:11.999971
3 0.012708333 00:18:17.999971
4 0.012280093 00:17:41.000035