Say I'm looking at the Rdataset acme.csv found here. How do I import this with appropriately coarse date? Using parse_dates, it assigns the day to the present day (today being the 18th of July), since no day was specified. Can I make it deal with just month/year like the table is but keep using the date functionality of PANDAS?
import pandas as pd
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/boot/acme.csv'
df = pd.read_csv(url, parse_dates=[1])
df.drop('Unnamed: 0', axis=1, inplace=True)
Don't parse dates in read_csv() but use to_datetime with format
df['month'] = pd.to_datetime(df['month'], format='%m/%y')
or you can use that function in read_csv() using lambda
df = pd.read_csv(url, parse_dates=['month'], date_parser=lambda x:pd.to_datetime(x, format='%m/%y'))
But you always get some day number in datetime.
BTW: In datetime you have always time too, but sometimes pandas doesn't show it.
print df['month'].head()
print df['month'].apply(lambda x:x.time()).head()
Related
I've searched for 2 hours but can't find an answer for this that works.
I have this dataset I'm working with and I'm trying to find the latest date, but it seems like my code is not taking the year into account. Here are some of the dates that I have in the dataset.
Date
01/09/2023
12/21/2022
12/09/2022
11/19/2022
Here's a snippet from my code
import pandas as pd
df=pd.read_csv('test.csv')
df['Date'] = pd.to_datetime(df['Date'])
st.write(df['Date'].max())
st.write gives me 12/21/2022 as the output instead of 01/09/2023 as it should be. So it seems like the code is not taking the year into account and just looking at the month and date.
I tried changing the format to
df['Date'] = df['Date'].dt.strftime('%Y%m%d').astype(int) but that didn't change anything.
pandas.read_csv allows you to designate column for conversion into dates, let test.csv content be
Date
01/09/2023
12/21/2022
12/09/2022
11/19/2022
then
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=["Date"])
print(df['Date'].max())
gives output
2023-01-09 00:00:00
Explanation: I provide list of names of columns holding dates, which then read_csv parses.
(tested in pandas 1.5.2)
I am pulling a time series from a csv file which has dates in "mm/dd/yyyy" format
df = pd.read_csv(lib_file.csv)
df['Date'] = df['Date'].apply(lambda x:datetime.strptime(x,'%m/%d/%Y').strftime('%d/%m/%Y'))
below is the output
I convert dtypes for ['Date'] from object to datetime64
df['Date'] = pd.to_datetime(df['Date'])
but that changes my dates as well
how do I fix it?
Try this:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
This will infer your dates based on the first non-NaN element which is being correctly parsed in your case and will not infer the format for each and every row of the dataframe.
just using the below code helped
df = pd.read_csv(lib_file.csv)
df['Date'] = pd.to_datetime(df['Date])
i have a csv file with many lines and three column. first column is the unix time, second column the price, and third column represents the volume of the symbol that has been traded at that specific price. what i'm doing is, calculating ohlc for different time frames (e.g. 1h, 4h, 12h, 1d) out of tha csv file. that is working very well by first converting the unix time into datetime
code:
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['date'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('date')
df = df['price'].resample('4h').ohlc()
df.to_csv('file_4h_ohlc.csv')
result:
date,open,high,low,close
2017-05-01 20:00:00,0.757881,1.07,0.650011,1.069999
target:
i wanna now converte the datetime (2017-05-01 20:00:00) back to the unix time (1493658000) within the same file by keeping the ohlc values. or if not possible so, to save into a different file.
thanks a lot for support and sorry if such question has been already answered, but i didnt find it
-hotshot
You can create a new date column instead of overwriting the existing one, so you can re-use it as the index.
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['datestamp'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('datestamp')
df = df['price'].resample('4h').ohlc()
# Set the index back to the original (after calculating ohlc)
df = df.set_index('date')
# Optional: Drop the datestamp column
df = df.drop(columns=['datestamp'])
df.to_csv('file_4h_ohlc.csv')
Alternatively, you can convert the existing datetime column to a Unix timestamp like so:
df['date'].apply(lambda x : (x - datetime.datetime(1970, 1, 1)).total_seconds())
I'm having an issue where the date format is not matching up. Meaning in my .csv file the dates are as follows %m/%d/%Y (ex. 11/3/2001) but in the error it saying %Y/%m/%d or %Y/%d/%m. I've tried all the possible permutations as far as year, month and day and I continue to recieve the same error of ValueError: time data '2001-11-03 ' %Y:%m %d %H:%M:%S'. Below is my code. Thanks.
df = pd.read_excel('.xlsx', header=None)
df.to_csv('.csv', header=None, index=False)
df= pd.read_csv('.csv', index_col[5,8,9,12], date_parser=lambda x: datetime.datetime.strptime(x, '%Y/%m/$d %H:%M:%S').strptime('%m/%d/%Y))
Note: What I'm trying to do is convert an .xlsx file to .csv and then remove the trailing 0:00 from multiple columns within the .csv file. Hope this helps.
Use the parse from dateutil.parser to parse the date appropriately. It is an easy access. The fastest way to parse dates.
from dateutil.parser import parse
df = pd.read_csv('filename.csv', date_parser = parse, index_..)
our you can use to_datetime native to Pandas
pd.to_datetime(df['Date Col'])
In order to format the date properly, you should use the following:
date_parser=lambda x: parse(x)
#parse from dateutil.parser
df['Date Col'] = df['Date Col'].strftime('%m/%d/%Y')
df.to_csv('New File.csv')
You can use to_datetime since you are using pandas. MoreInfo
import pandas as pd
df = pd.DataFrame({"a": ["11/3/2001", '2001-11-03']})
df["a"] = pd.to_datetime(df["a"])
print(df["a"])
Output:
0 2001-11-03
1 2001-11-03
Name: a, dtype: datetime64[ns]
I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):
Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736
df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()
The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?
stepwise_fit = auto_arima(df[Series_name]....
pandas has a built in function pandas.infer_freq()
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'
Alternatively you could also make use of the datetime functionality of the columns.
df.Date.dt.freq
#'MS'
Of course if your data doesn't actually have a real frequency, then you won't get anything.
pd.infer_freq(df.Date3)
#
The frequency descriptions are docmented under offset-aliases.