def arrange_by_date():
df = pd.read_excel("logfile_Final_test.xlsx")
df = df.sort_values(by="Date", ascending=True).set_index('Date').last('3M')
df = df.sort_values(by="Date", ascending=False)
df.to_excel("Master_logfile.xlsx", index=True)
arrange_by_date()
Data in excel sheet in the form of
Database Date Time Description
Some_DB_name 2022-12-25 some_time Some_Description
Some_DB_name 2023-01-14 some_time Some_Description
.. .. .. ..
Some_DB_name 2022-11-19 some_time Some_Description
Expected Output
Database Date Time Description
Some_DB_name 2023-01-14 some_time Some_Description
Some_DB_name 2022-12-25 some_time Some_Description
.. .. .. ..
Some_DB_name 2022-11-19 some_time Some_Description
Output I am getting in Excel sheet
Date Database Time Description
2023-01-14 00:00:00 Some_DB_name some_time Some_Description
2022-12-25 00:00:00 Some_DB_name some_time Some_Description
.. .. .. ..
2022-11-19 00:00:00 Some_DB_name some_time Some_Description
my Date Column is getting shift towards start with 00:00:00
As indicated by #roganjosh, it shouldn't make any difference but if you must have just the date and drop the time element because your 'date column is the df index, you can use:
df['Date'] = df['Date'].date
Related
I'm trying to import market data from a csv to run some backtests.
I wrote the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("30mindata.csv")
df = df.drop(columns=['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'])
print(df)
I'm getting the error:
KeyError: "['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'] not found in axis"
When I remove the line of code containing drop() the dataframe prints as follows:
Date Time Open High Low Last Volume NumberOfTrades BidVolume AskVolume
0 2018/2/18 14:00:00 2734.50 2741.00 2734.00 2739.75 5304 2787 2299 3005
1 2018/2/18 14:30:00 2739.75 2741.00 2739.25 2740.25 1402 815 648 754
2 2018/2/18 15:00:00 2740.25 2743.50 2739.25 2742.00 4536 2301 2074 2462
3 2018/2/18 15:30:00 2742.25 2744.75 2742.25 2744.00 4102 1826 1949 2153
4 2018/2/18 16:00:00 2744.00 2744.25 2742.25 2742.25 2492 1113 1551 941
... ... ... ... ... ... ... ... ... ... ...
59074 2023/2/17 10:30:00 4076.25 4088.00 4076.00 4086.50 92507 54379 44917 47590
59075 2023/2/17 11:00:00 4086.50 4090.50 4079.25 4081.00 107233 67968 55784 51449
59076 2023/2/17 11:30:00 4081.00 4090.50 4079.50 4088.25 171507 92705 86022 85485
59077 2023/2/17 12:00:00 4088.00 4089.00 4085.25 4086.00 41032 17210 21176 19856
59078 2023/2/17 12:30:00 4086.25 4088.00 4085.25 4085.75 5164 2922 2818 2346
I have another file that uses this exact form of pd.read_csv() and then df.drop(columns=[]) which works just fine. I tried df.loc[:, 'Volume'] and got the same KeyError saying 'Volume' was not found in the axis. I really don't understand how the labels aren't in the dataframe when they get output correctly without the .drop() function
It's very likely that you have blank spaces in the name of your columns.
Try removing those spaces doing this...
import pandas as pd
df = pd.read_csv("30mindata.csv")
df.columns = [col.strip() for col in df.columns]
Then try to drop the columns as before
I have a data frame (df) that looks like this
timestamp datetime date time open \
0 1667520000000 2022-11-04 00:00:00+00:00 2022-11-04 00:00:00 0.2186
1 1667606400000 2022-11-05 00:00:00+00:00 2022-11-05 00:00:00 0.2589
2 1667692800000 2022-11-06 00:00:00+00:00 2022-11-06 00:00:00 0.2459
3 1667779200000 2022-11-07 00:00:00+00:00 2022-11-07 00:00:00 0.2315
4 1667865600000 2022-11-08 00:00:00+00:00 2022-11-08 00:00:00 0.2353
... ... ... ... ...
15012 1675728000000 2023-02-07 00:00:00+00:00 2023-02-07 00:00:00 0.2449
15013 1675814400000 2023-02-08 00:00:00+00:00 2023-02-08 00:00:00 0.2610
15014 1675900800000 2023-02-09 00:00:00+00:00 2023-02-09 00:00:00 0.2555
15015 1675987200000 2023-02-10 00:00:00+00:00 2023-02-10 00:00:00 0.2288
15016 1676073600000 2023-02-11 00:00:00+00:00 2023-02-11 00:00:00 0.2317
high low close volume symbol
0 0.2695 0.2165 0.2588 1.239168e+09 1000LUNC/USDT:USDT
1 0.2788 0.2414 0.2458 1.147000e+09 1000LUNC/USDT:USDT
2 0.2554 0.2292 0.2315 5.137089e+08 1000LUNC/USDT:USDT
3 0.2398 0.2263 0.2352 4.754763e+08 1000LUNC/USDT:USDT
4 0.2404 0.1320 0.1895 1.618936e+09 1000LUNC/USDT:USDT
... ... ... ... ...
15012 0.2627 0.2433 0.2611 8.097549e+07 ZRX/USDT:USDT
15013 0.2618 0.2432 0.2554 7.009100e+07 ZRX/USDT:USDT
15014 0.2651 0.2209 0.2287 1.217487e+08 ZRX/USDT:USDT
15015 0.2361 0.2279 0.2317 6.072029e+07 ZRX/USDT:USDT
15016 0.2418 0.2300 0.2409 2.178281e+07 ZRX/USDT:USDT
I want to apply a function from pandas ta, called bbands to each level of symbol using the column 'close' as the input. The function return multiple variables, but I only want to keep the one labeled 'BBM_20_2.0' and store this as another column in the df.
If I were to just apply the function to entire df ignoring the fact that each symbols has to be treated separately I would do this
daily_df['bbm'] = bbands(daily_df.close, 20, 2)['BBM_20_2.0']
I have tied to use groupby like this
daily_df['bbm'] = daily_df.groupby(["symbol"]).apply(bbands(daily_df.close, 20, 2)['BBM_20_2.0'])
but Im getting errors. Can anyone help?
Did you try
bbm = daily_df.groupby(["symbol"]).apply(
lambda grp: bbands(grp.close, 20, 2)['BBM_20_2.0']
).reset_index()
bbm.columns=["symbol", "bbm"]
df = df.merge(bbm, on=['symbol'])
I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?
You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle
I was wondering how I could group data by month so I can look at the data on a per month basis. How would i do that?
for example, assign january for all data recorded in january in their own dataframe for analysis, etc.
here is my current dataframe:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time
0 55.553640 18 26 2005-01-01 00:10
1 54.204342 18 26 2005-01-01 00:20
2 51.896272 18 26 2005-01-01 00:30
3 49.007770 18 26 2005-01-01 00:40
4 45.825810 18 26 2005-01-01 00:50
help is much appreciated.
Try this:
df1['Date'].to_numpy().astype('datetime64[M]')
You can convert your column with following code.
df1['Date'] = pd.to_datetime(df["Date"].dt.strftime('%d-%m-%Y'))
And you can refer official document for why dayfirst is not working.
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
If you have strings like 2005-01-01 then you can get
df['year-month'] = df['Date'].str[:7]
and later you can use
df.groupby('year-month')
Minimal working code.
I changed dates to have different months in data.
I use io only to simulate file in memory.
text = '''WC_Humidity[%],WC_Htgsetp[C],WC_Clgsetp[C],Date,Time
55.553640,18,26,2005-01-01,00:10
54.204342,18,26,2005-01-01,00:20
51.896272,18,26,2005-02-01,00:30
49.007770,18,26,2005-02-01,00:40
45.825810,18,26,2005-03-01,00:50
'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
df['year-month'] = df['Date'].str[:7]
print(df)
for value, group in df.groupby('year-month'):
print()
print('---', value, '---')
print(group)
print()
print('average WC_Humidity[%]:', group['WC_Humidity[%]'].mean())
Result:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time year-month
0 55.553640 18 26 2005-01-01 00:10 2005-01
1 54.204342 18 26 2005-01-01 00:20 2005-01
2 51.896272 18 26 2005-02-01 00:30 2005-02
3 49.007770 18 26 2005-02-01 00:40 2005-02
4 45.825810 18 26 2005-03-01 00:50 2005-03
--- 2005-01 ---
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time year-month
0 55.553640 18 26 2005-01-01 00:10 2005-01
1 54.204342 18 26 2005-01-01 00:20 2005-01
average WC_Humidity[%]: 54.878991
--- 2005-02 ---
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time year-month
2 51.896272 18 26 2005-02-01 00:30 2005-02
3 49.007770 18 26 2005-02-01 00:40 2005-02
average WC_Humidity[%]: 50.452021
--- 2005-03 ---
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time year-month
4 45.82581 18 26 2005-03-01 00:50 2005-03
average WC_Humidity[%]: 45.82581
If you have objects datetime then you can do
df['year-month'] = df['Date'].dt.strftime('%Y-%m')
and rest is the same
text = '''WC_Humidity[%],WC_Htgsetp[C],WC_Clgsetp[C],Date,Time
55.553640,18,26,2005-01-01,00:10
54.204342,18,26,2005-01-01,00:20
51.896272,18,26,2005-02-01,00:30
49.007770,18,26,2005-02-01,00:40
45.825810,18,26,2005-03-01,00:50
'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
# create datetime objects
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df['year-month'] = df['Date'].dt.strftime('%Y-%m')
print(df)
for value, group in df.groupby('year-month'):
print()
print('---', value, '---')
print(group)
print()
print('average WC_Humidity[%]:', group['WC_Humidity[%]'].mean())
If the column is in this format 2021-01-29, 30-12-2024 then before the above line should take care of it and parse it accordingly.
df1['Date'] = pd.to_datetime(df1['Date'])
Now you can use this code to convert the date column to the way you wanted.
df1['Date'] = df['Date1'].dt.strftime('%d/%m/%Y')
This should get you what you want.
I have an observational data set which contain weather information. Each column contain specific field in which date and time are in two separate column. The time column contain hourly time like 0000, 0600 .. up to 2300. What I am trying to do is to filter the data set based on certain time frame, for example between 0000 UTC to 0600 UTC. When I try to read the data file in pandas data frame, by default the time column is read in float. When I try to convert it in to datatime object, it produces a format which I am unable to convert. Code example is given below:
import pandas as pd
import datetime as dt
df = pd.read_excel("test.xlsx")
df.head()
which produces the following result:
tdate itime moonph speed ... qnh windir maxtemp mintemp
0 01-Jan-17 1000.0 NM7 5 ... $1,011.60 60.0 $32.60 $22.80
1 01-Jan-17 1000.0 NM7 2 ... $1,015.40 999.0 $32.60 $22.80
2 01-Jan-17 1030.0 NM7 4 ... $1,015.10 60.0 $32.60 $22.80
3 01-Jan-17 1100.0 NM7 3 ... $1,014.80 999.0 $32.60 $22.80
4 01-Jan-17 1130.0 NM7 5 ... $1,014.60 270.0 $32.60 $22.80
Then I extracted the time column with following line:
df["time"] = df.itime
df["time"]
0 1000.0
1 1000.0
2 1030.0
3 1100.0
4 1130.0
5 1200.0
6 1230.0
7 1300.0
8 1330.0
.
.
3261 2130.0
3262 2130.0
3263 600.0
3264 630.0
3265 730.0
3266 800.0
3267 830.0
3268 1900.0
3269 1930.0
3270 2000.0
Name: time, Length: 3279, dtype: float64
Then I tried to convert the time column to datetime object:
df["time"] = pd.to_datetime(df.itime)
which produced the following result:
df["time"]
0 1970-01-01 00:00:00.000001000
1 1970-01-01 00:00:00.000001000
2 1970-01-01 00:00:00.000001030
3 1970-01-01 00:00:00.000001100
It appears that it has successfully converted the data to datetime object. However, it added the hour time to ms which is difficult for me to do filtering.
The final data format I would like to get is either:
1970-01-01 06:00:00
or
06:00
Any help is appreciated.
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.
Try
df["time"] = pd.to_datetime(df.itime).dt.strftime('%Y-%m-%d %H:%M:%S')
df["time"] = pd.to_datetime(df.itime).dt.strftime('%H:%M:%S')
For the first and second outputs you want to
Best!