I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)
Related
I have an excel here as shown in this picture:
I am using pandas to read my excel file and it is working fine, this code below can print all the data in my excel:
import pandas as pd
df = pd.read_csv('alpha.csv')
print(df)
I want to get the values from C2 cell to H9 cell which month is October and day is Monday only. And I want to store these values in my python array below:
mynumbers= []
but I am not sure how should I do it, can you please help me?
You should consider slicing your dataframe and then using .values to story them. If you want them as a list, then you can use to_list():
First transform the Date column to a datetime:
df['Date'] = pd.to_datetime(df['Date'],dayfirst=True,infer_datetime_format=True)
Then, slice and return the values for the Column Number 2
mynumbers = df[(df['Date'].dt.month == 10) & \
(df['Date'].dt.weekday == 0)]['Column 2'].values.tolist()
Assigning the following values to mynumbers:
[11,8]
A first step would be to convert your Date column to datetime objects
import datetime
myDate = "10-11-22"
myDate = datetime.datetime.strptime(myDate, '%d-%m-%y')
Then using myDate.month and myDate.weekday() you can select for mondays in October
I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format
I want to convert the date in a column in a dataframe to a different format. Currently, it has this format: '2019-11-20T01:04:18'. I want it to have this format: 20-11-19 1:04.
I think I need to develop a loop and generate a new column for the new date format. So essentially, in the loop, I would refer to the initial column and then generate the variable for the new column in the format I want.
Can someone help me out to complete this task?
The following code works for one occasion:
import datetime
d = datetime.datetime.strptime('2019-11-20T01:04:18', '%Y-%m-%dT%H:%M:%S')
print d.strftime('%d-%m-%y %H:%M')
From a previous answer in this site , this should be able to help you, comments give explanation
You can read your data into pandas from csv or database or create some test data as shown below for testing.
>>> import pandas as pd
>>> df = pd.DataFrame({'column': {0: '26/1/2016', 1: '26/1/2016'}})
>>> # First convert the column to datetime datatype
>>> df['column'] = pd.to_datetime(df.column)
>>> # Then call the datetime object format() method, set the modifiers you want here
>>> df['column'] = df['column'].dt.strftime('%Y-%m-%dT%H:%M:%S')
>>> df
column
0 2016-01-26T00:00:00
1 2016-01-26T00:00:00
NB. Check to ensure that all your columns have similar date strings
You can either achieve it like this:
from datetime import datetime
df['your_column'] = df['your_column'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S').strftime('%d-%m-%y %H:%M'))
I'am trying to calculate the difference between string time values but i could not read microseconds format. Why i have this type of errors ? and how i can fix my code for it ?
I have already tried "datetime.strptime" method to get string to time format then use pandas.dataframe.diff method to calculate the difference between each item in the list and create a column in excel for it.
```
from datetime import datetime
import pandas as pd
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
df = pd.DataFrame(time_list)
ls_cnv = [df.diff()]
print (ls_cnv)
```
I expect the output to be
ls_cnv = [NaN, 00:00:00, 00:00:00]
time_list = ['10:54:05.912783', '10:54:05.912783', '10:54:05.912783']
but i have instead (time data '10:54:05.906224' does not match format '%H %M %S %f')
The error you get is because you are using strptime wrong.
df = pd.DataFrame(datetime.strptime(itemz, '%H:%M:%S.%f'))
The above would be the correct form, the one passed from your time_list but that's not the case. You create the DataFrame in the wrong way too. DataFrame is a table if you wish of data. The following lines will create and replace in every loop a new DataFrame for every itemz which is one element of your list at time. So it will create a DataFrame with one element in the first loop which will be '10:54:05.912783' and it will diff() that with itself while there is no other value.
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
Maybe what you wanted to do is the following:
from datetime import datetime
import pandas as pd
ls_cnv = []
time_list = ['10:54:03.912743', '10:54:05.912783', '10:44:05.912783']
df = pd.to_datetime(time_list)
data = pd.DataFrame({'index': range(len(time_list))}, index=df)
a = pd.Series(data.index).diff()
ls_cnv.append(a)
print (ls_cnv)
Just because your time format must include colons and point like this
"%H:%M:%S.%f"
From the daily stock price data, I want to sample and select end of the month price. I am accomplishing using the following code.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin=end-pd.DateOffset(365*2)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2])).set_index(data.index)
The line above selects end of the month data and here is the output.
If I want to select penultimate value of the month, I can do it using the following code.
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2]))
Here is the output.
However the index shows end of the month value. When I choose penultimate value of the month, I want index to be 2015-12-30 instead of 2015-12-31.
Please suggest the way forward. I hope my question is clear.
Thanking you in anticipation.
Regards,
Abhishek
I am not sure if there is a way to do it with resample. But, you can get what you want using groupby and TimeGrouper.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin = end - pd.DateOffset(365*2)
st = begin.strftime('%Y-%m-%d')
ed = end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
data['Date'] = data.index
mon_data = (
data[['Date', 'Adj Close']]
.groupby(pd.TimeGrouper(freq='M')).nth(-2)
.set_index('Date')
)
simplest solution is to take the index of your newly created dataframe and subtract the number of days you want to go back:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-1-n]))
mon_data.index = mon_data.index - datetime.timedelta(days=n)
also, seeing your data, i think that you should resample not to ' month end frequency' but rather to 'business month end frequency':
.resample('BM')
but even that won't cover it all, because for instance December 29, 2017 is a business month end, but this date doesn't appear in your data (which ends in December 08 2017). so you could add a small fix to that (assuming the original data is sorted by the date):
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
so, the full code will look like:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('BM').apply(lambda x: x[-1-n]))
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
mon_data.index = mon_data.index - datetime.timedelta(days=n)
btw: your .set_index(data.index) throw an error because data and mon_data are in different dimensions (mon_data is monthly grouped_by)