I am converting my date into matlab datenum and saving values in text file. then I am reading that csv and trying to add those values in existing dataframe after id column. I can direct add it at the end or as a 1st column but I can't add it after a specific column. number of rows are eual in both dataframes. it throws following error
KeyError: 'datenum'
Here is what i am doing
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'C:\example.csv')
def datenum(d):
return 366 + d.toordinal() + (d - dt.fromordinal(d.toordinal())).total_seconds()/(24*60*60)
d = dt.strptime('2021-01-01 00:15:00','%Y-%m-%d %H:%M:%S')
column = df['date']
new_column = [dt.strptime(i,'%Y-%m-%d %H:%M:%S') for i in column]
end_column = [datenum(i) for i in new_column]
for i in end_column:
print(i)
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
np.savetxt('timePminus.txt', end_column,fmt='% 1.5f')
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
after that reading this csv
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'example.csv')
df1=pd.read_csv(r'time.csv')
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2)
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2.get('datenum'))
df3=pd.DataFrame().assign(B=df2['B'],A=df2['A'],datenum=df2['datenum'],D=df2['D'])
print(df3)
it throws error keyerror:datenum
here is the dataframe or csv
A,B,C
2021-01-02 00:15:00,"43289,95698800",236985
2021-01-01 00:30:00,"425962,555555",236985
2021-01-01 00:45:00,"2368,56980000",236985
2021-01-01 01:00:00,"2368,56980000",236985
2021-01-15 01:15:00,"2368,56980000",236985
2021-05-01 01:30:00,"2368,56980000",236985
if I do
print(df2.get('datenum'))
output is none.
2nd dataframe
datenum
738157.01042
738157.02083
738157.03125
738157.04167
738157.05208
738157.06250
can some one please guide me what is wrong. I am trying it for hours.
Thanks in advance.
If you are looking to just re-arrange your dataframe columns after concat you can do,
column_order = ['id', 'datenum', 'A', 'B', 'C']
df = df[column_order]
Related
I am working with parquet and I need to use date32[day] objects for my dates but I am unclear how to use pandas to generate this exact datatype, rather than a timestamp.
Consider this example:
from datetime import datetime, date
import pyarrow.parquet as pq
import pandas as pd
df1 = pd.DataFrame({'date': [date.today()]})
df1.to_parquet('testdates.parquet')
pq.read_table("testdates.parquet") # date32[day]
# pandas version
df2 = pd.DataFrame({'date': [pd.to_datetime('2022-04-07')]})
df2.to_parquet('testdates2.parquet')
pq.read_table("testdates2.parquet") # timestamp[us]
From pandas integraton with pyarrow here
import pyarrow as pa
from datetime import date
df2 = pd.Series({'date':[date(2022,4,7)]})
df2_dat32 = pa.array(df2)
print("dataframe:", df2)
print("value of dataframe:", df2_dat32[0])
print("datatype:", df2_dat32.type)
Output
dataframe: date [2022-04-07]
dtype: object
value of dataframe: [datetime.date(2022, 4, 7)]
datatype: list<item: date32[day]>
Edit: If you have entire column of dates, you will need to first convert datetime to date and then use same method as above. See example below:
import pyarrow as pa
from datetime import date
#create pandas DataFrame with one column with five
#datetime values through a dictionary
datetime_df = pd.DataFrame({'DateTime': ['2021-01-15 20:02:11',
'1989-05-24 20:34:11',
'2020-01-18 14:43:24',
'2021-01-15 20:02:10',
'1999-04-04 20:34:11']})
datetime_df['Date'] = pd.to_datetime(datetime_df['DateTime']).dt.date
date_series = pd.Series(datetime_df['Date'])
print(date_series)
Output:
0 2021-01-15
1 1989-05-24
2 2020-01-18
3 2021-01-15
4 1999-04-04
Name: Date, dtype: object
Then use pyarrow for conversion:
df2_dat32 = pa.array(date_series)
print("datatype of values in the dataframe with dates:", type(date_series[0]))
print("value of dataframe after converting using pyarrow:", df2_dat32[0])
print("datatype after converting using pyarrow :", df2_dat32.type)
Output:
datatype of values in the dataframe with dates: <class 'datetime.date'>
value of dataframe after converting using pyarrow: 2021-01-15
datatype after converting using pyarrow : date32[day]
This is how my dataframe looks like:
datetime open high low close
2006-01-02 4566.95 4601.35 4542.00 4556.25
2006-01-03 4531.45 4605.45 4531.45 4600.25
2006-01-04 4619.55 4707.60 4616.05 4694.14
.
.
.
Need to calculate the Monthly Returns in %
Formula: (Month Closing Price - Month Open Price) / Month Open Price
I can't seem to get the open price and closing price of a month, because in my df most months dont have a log for the 1st of the month. So having trouble calculating it.
Any help would be very much appreciated!
You need to use groupby and agg function in order to get the first and last value of each column in each month:
import pandas as pd
df = pd.read_csv("dt.txt")
df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index("datetime", inplace=True)
resultDf = df.groupby([df.index.year, df.index.month]).agg(["first", "last"])
resultDf["new_column"] = (resultDf[("close", "last")] - resultDf[("open", "first")])/resultDf[("open", "first")]
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf
The code above will result in a dataframe that has multiindex column. So, if you want to get, for example, rows with year of 2010, you can do something like:
resultDf[resultDf["year"] == 2010]
You can create a custom grouper such as follow :
import pandas as pd
import numpy as np
from io import StringIO
csvfile = StringIO(
"""datetime\topen\thigh\tlow\tclose
2006-01-02\t4566.95\t4601.35\t4542.00\t4556.25
2006-01-03\t4531.45\t4605.45\t4531.45\t4600.25
2006-01-04\t4619.55\t4707.60\t4616.05\t4694.14""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.datetime = pd.to_datetime(df.datetime, format = "%Y-%m-%d")
dg = df.groupby(pd.Grouper(key='datetime', axis=0, freq='M'))
Then each group of dg is separate by month, and since we convert datetime as pandas.datetime we can use classic arithmetic on it :
def monthly_return(datetime, close_value, open_value):
index_start = np.argmin(datetime)
index_end = np.argmax(datetime)
return (close_value[index_end] - open_value[index_start]) / open_value[index_start]
dg.apply(lambda x : monthly_return(x.datetime, x.close, x.open))
Out[97]:
datetime
2006-01-31 0.02785
Freq: M, dtype: float64
Of course a pure functional approach is possible instead of using monthly_return function
I have the problem that the dataframe from my import (stock prices from Yahoo) are not correct for a specific time period. I want to clear the data from 2010-01-01 until 2017-10-17 for "VAR1.DE" and replace it empty or with NaN. I have found the panda function "drop" but this will delete the hole column.
How can I solve the problem?
Here is my code:
from pandas_datareader import data as web
import pandas as pd
import numpy as np
from datetime import datetime
assets = ['1211.HK','BABA','BYND','CAP.DE','JKS','PLUG','QCOM','VAR1.DE']
weights = np.array([0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125])
stockStartDate='2010-01-01'
today = datetime.today().strftime('%Y-%m-%d')
df = pd.DataFrame()
for stock in assets:
df[stock]=web.DataReader(stock, data_source='yahoo', start=stockStartDate,end=today)['Adj Close']
instead of having a for loop, you can simply do:
df = web.DataReader(name=assets, data_source='yahoo', start=stockStartDate, end=today)['Adj Close']
since the return dataframe would be indexed by datetime. (i.e. pd.DatetimeIndex)
so you can simply do:
df.loc[:'2017-10-17', 'VAR1.DE'] = np.nan
reassigning values as NaN for column='VAR1.DE' that are before '2017-10-17'.
I'm creating a dataframe that has a range of dates in datetime. This works but I know there must be a more elegant way to do this. Any thoughts?
date_range = pd.DataFrame(pd.date_range(date(2019,8,30), date.today(), freq='D'))
date_range.rename(columns = {0:'date'}, inplace=True)
date_range = pd.DataFrame(set(date_range['date'].dt.date))
date_range.rename(columns = {0:'date'}, inplace=True)
To avoid the rename parts you can name them directly
from datetime import date
import pandas as pd
date_range = pd.DataFrame({'date': pd.date_range(date(2019,8,30), date.today(), freq='D')})
date_range = pd.DataFrame({'date':set(date_range['date'].dt.date)})
I'm fairly new at this. I have a csv that has a string date/time column as shown below. I am trying to average flow values based on month-year.
CSV:
X Flow
6/9/16/ 14:00 15000
Code:
import pandas as pd
from datetime import datetime
#import csv
df = pd.read_csv('monthlyaverage.csv', header=True)
date_object = datetime.strptime('6/9/16 14:00', '%m/%d/%y %H:%M')
df.set_index(pd.DatetimeIndex(df))
df1 = df.groupby(pd.TimeGrouper(freq='%m/%y')).mean()
I think you can read_csv and set column X to index by parameter 'index_col'. Then first convert index to DatetimeIndex and then to_period. Last groupby by index (level=0) and aggregate mean:
import pandas as pd
import io
temp=u"""X,Flow
6/9/16/ 14:00,15000
6/9/16/ 14:00,55000
6/9/16/ 14:00,35000"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), index_col='X')
df.index = pd.DatetimeIndex(df.index, format='%m/%d/%y %H:%M').to_period('M')
print (df)
Flow
2016-06 15000
2016-06 55000
2016-06 35000
print (df.groupby(level=0).mean())
Flow
2016-06 35000