I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format
Related
I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)
I'am trying to calculate the difference between string time values but i could not read microseconds format. Why i have this type of errors ? and how i can fix my code for it ?
I have already tried "datetime.strptime" method to get string to time format then use pandas.dataframe.diff method to calculate the difference between each item in the list and create a column in excel for it.
```
from datetime import datetime
import pandas as pd
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
df = pd.DataFrame(time_list)
ls_cnv = [df.diff()]
print (ls_cnv)
```
I expect the output to be
ls_cnv = [NaN, 00:00:00, 00:00:00]
time_list = ['10:54:05.912783', '10:54:05.912783', '10:54:05.912783']
but i have instead (time data '10:54:05.906224' does not match format '%H %M %S %f')
The error you get is because you are using strptime wrong.
df = pd.DataFrame(datetime.strptime(itemz, '%H:%M:%S.%f'))
The above would be the correct form, the one passed from your time_list but that's not the case. You create the DataFrame in the wrong way too. DataFrame is a table if you wish of data. The following lines will create and replace in every loop a new DataFrame for every itemz which is one element of your list at time. So it will create a DataFrame with one element in the first loop which will be '10:54:05.912783' and it will diff() that with itself while there is no other value.
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
Maybe what you wanted to do is the following:
from datetime import datetime
import pandas as pd
ls_cnv = []
time_list = ['10:54:03.912743', '10:54:05.912783', '10:44:05.912783']
df = pd.to_datetime(time_list)
data = pd.DataFrame({'index': range(len(time_list))}, index=df)
a = pd.Series(data.index).diff()
ls_cnv.append(a)
print (ls_cnv)
Just because your time format must include colons and point like this
"%H:%M:%S.%f"
I have a JSON date data set and trying to calculate the time difference between two different JSON DateTime.
For example :
'2015-01-28T21:41:38.508275' - '2015-01-28T21:41:34.921589'
Please look at the python code below:
#let's say 'time' is my data frame and JSON formatted time values are under the 'due_date' column
time_spent = time.iloc[2]['due_date'] - time.iloc[10]['due_date']
This doesn't work. I also tried to cast each operand to int, but it also didn't help. What are the different ways to perform this calculation?
I use parser from dateutil.
Something like that:
from dateutil.parser import parse
first_date_obj = parse("2015-01-28T21:41:38.508275")
second_date_obj = parse("2015-02-28T21:41:38.508275")
print(second_date_obj - first_date_obj)
You can also access the year, month, day of the date object like that:
print(first_date_obj.year)
print(first_date_obj.month)
print(first_date_obj.day)
# and so on
from datetime import datetime
date_format = '%Y-%m-%dT%H:%M:%S.%f'
d2 = time.iloc[2]['due_date']
d1 = time.iloc[10]['due_date']
time_spent = datetime.strptime(d2, date_format) - datetime.strptime(d1, date_format)
print(time_spent.days) # 0
print(time_spent.microseconds) # 586686
print(time_spent.seconds) # 3
print(time_spent.total_seconds()) # 3.586686
The easiest thing to do is to use the pandas datetime capability (since you are already using iloc I assume you are using pandas). You can convert the entire dataframe column labeled due_date to be a pandas datetime datatype using
import pandas as pd
time['due_date'] = pd.to_datetime(time['due_date']
then calculate the time difference you want using
time_spent = time.iloc[2]['due_date'] - time.iloc[10]['due_date']
time_spent will be a pandas timedelta object that you can then manipulate as necessary.
See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html.
My dataframes contains one field which is a date and it appears in the string format, as example
'2015-07-02T11:22:21.050Z'
I need to filter the DataFrame on the date to get only the records in the last week.
So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:
def map_to_datetime(row):
format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
row.date = datetime.strptime(row.date, format_string)
df = df.map(map_to_datetime)
and then I would apply a filter as
df.filter(lambda row:
row.date >= (datetime.today() - timedelta(days=7)))
I manage to get the mapping working but the filter fails with
TypeError: condition should be string or Column
Is there a way to use a filtering in a way that works or should I change the approach and how?
I figured out a way to solve my problem by using the SparkSQL API with dates in String format.
Here is an example:
last_week = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = df.where(df.date >= last_week)
Spark >= 1.5
You can use INTERVAL
from pyspark.sql.functions import expr, current_date
df_casted.where(col("dt") >= current_date() - expr("INTERVAL 7 days"))
Spark < 1.5
You can solve this without using worker side Python code and switching to RDDs. First of all, since you use ISO 8601 string, your data can be directly casted to date or timestamp:
from pyspark.sql.functions import col
df = sc.parallelize([
('2015-07-02T11:22:21.050Z', ),
('2016-03-20T21:00:00.000Z', )
]).toDF(("d_str", ))
df_casted = df.select("*",
col("d_str").cast("date").alias("dt"),
col("d_str").cast("timestamp").alias("ts"))
This will save one roundtrip between JVM and Python. There are also a few way you can approach the second part. Date only:
from pyspark.sql.functions import current_date, datediff, unix_timestamp
df_casted.where(datediff(current_date(), col("dt")) < 7)
Timestamps:
def days(i: int) -> int:
return 60 * 60 * 24 * i
df_casted.where(unix_timestamp() - col("ts").cast("long") < days(7))
You can also take a look at current_timestamp and date_sub
Note: I would avoid using DataFrame.map. It is better to use DataFrame.rdd.map instead. It will save you some work when switching to 2.0+
from datetime import datetime, timedelta
last_7_days = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = signal1.where(signal1.publication_date >= last_7_days)
You need to import datetime and timedelta for this.
I have an column in excel which has dates in the format ''17-12-2015 19:35". How can I extract the first 2 digits as integers and append it to a list? In this case I need to extract 17 and append it to a list. Can it be done using pandas also?
Code thus far:
import pandas as pd
Location = r'F:\Analytics Materials\files\paymenttransactions.csv'
df = pd.read_csv(Location)
time = df['Creation Date'].tolist()
print (time)
You could extract the day of each timestamp like
from datetime import datetime
import pandas as pd
location = r'F:\Analytics Materials\files\paymenttransactions.csv'
df = pd.read_csv(location)
timestamps = df['Creation Date'].tolist()
dates = [datetime.strptime(timestamp, '%d-%m-%Y %H:%M') for timestamp in timestamps]
days = [date.strftime('%d') for date in dates]
print(days)
The '%d-%m-%Y %H:%M'and '%d' bits are format specififers, that describe how your timestamp is formatted. See e.g. here for a complete list of directives.
datetime.strptime parses a string into a datetimeobject using such a specifier. dateswill thus hold a list of datetime instances instead of strings.
datetime.strftime does the opposite: It turns a datetime object into string, again using a format specifier. %d simply instructs strftime to only output the day of a date.