I have written this function:
def time_to_unix(df,dateToday):
'''this function creates the timestamp column for the dataframe. it also gets today's date (ex: 2022-8-8 0:0:0)
and then it adds the seconds that were originally in the timestamp column.
input: dataframe, dateToday(type: pandas.core.series.Series)
output: list of times
'''
dateTime = dateToday[0]
times = []
for i in range(0,len(df['timestamp'])):
dateAndTime = dateTime + timedelta(seconds = float(df['timestamp'][i]))
unix = pd.to_datetime([dateAndTime]).astype(int) / 10**9
times.append(unix[0])
return times
so it takes a dataframe and it gets today's date and then its taking the value of the timestamp in the dataframe( which is in seconds like 10,20,.... ) then it applies the function and returns times in unix time
however, because I have approx 2million row in my dataframe, its taking me a lot of time to run this code.
how can I use lambda function or something else in order to speed up my code and the process.
something along the line of:
df['unix'] = df.apply(lambda row : something in here), axis = 1)
What I think you'll find is that most of the time is spent in the creation and manipulation of the datetime / timestamp objects in the dataframe (see here for more info). I also try to avoid using lambdas like this on large dataframes as they go row by row which should be avoided. What I've done when dealing with datetimes / timestamps / timezone changes in the past is to build a dictionary of the possible datetime combinations and then use map to apply them. Something like this:
import datetime as dt
import pandas as pd
#Make a time key column out of your date and timestamp fields
df['time_key'] = df['date'].astype(str) + '#' + df['timestamp']
#Build a dictionary from the unique time keys in the dataframe
time_dict = dict()
for time_key in df['time_key'].unique():
time_split = time_key.split('#')
#Create the Unix time stamp based on the values in the key; store it in the dictionary so it can be mapped later
time_dict[time_key] = (pd.to_datetime(time_split[0]) + dt.timedelta(seconds=float(time_split[1]))).astype(int) / 10**9
#Now map the time_key to the unix column in the dataframe from the dictionary
df['unix'] = df['time_key'].map(time_dict)
Note if all the datetime combinations are unique in the dataframe, this likely won't help.
I'm not exactly sure what type dateTime[0] has. But you could try a more vectorized approach:
import pandas as pd
df["unix"] = (
(pd.Timestamp(dateTime[0]) + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)
or
df["unix"] = (
(dateTime[0] + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)
Related
I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format
I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)
I'am trying to calculate the difference between string time values but i could not read microseconds format. Why i have this type of errors ? and how i can fix my code for it ?
I have already tried "datetime.strptime" method to get string to time format then use pandas.dataframe.diff method to calculate the difference between each item in the list and create a column in excel for it.
```
from datetime import datetime
import pandas as pd
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
df = pd.DataFrame(time_list)
ls_cnv = [df.diff()]
print (ls_cnv)
```
I expect the output to be
ls_cnv = [NaN, 00:00:00, 00:00:00]
time_list = ['10:54:05.912783', '10:54:05.912783', '10:54:05.912783']
but i have instead (time data '10:54:05.906224' does not match format '%H %M %S %f')
The error you get is because you are using strptime wrong.
df = pd.DataFrame(datetime.strptime(itemz, '%H:%M:%S.%f'))
The above would be the correct form, the one passed from your time_list but that's not the case. You create the DataFrame in the wrong way too. DataFrame is a table if you wish of data. The following lines will create and replace in every loop a new DataFrame for every itemz which is one element of your list at time. So it will create a DataFrame with one element in the first loop which will be '10:54:05.912783' and it will diff() that with itself while there is no other value.
for itemz in time_list:
df = pd.DataFrame(datetime.strptime(itemz, '%H %M %S %f'))
ls_cnv.append(df.diff())
Maybe what you wanted to do is the following:
from datetime import datetime
import pandas as pd
ls_cnv = []
time_list = ['10:54:03.912743', '10:54:05.912783', '10:44:05.912783']
df = pd.to_datetime(time_list)
data = pd.DataFrame({'index': range(len(time_list))}, index=df)
a = pd.Series(data.index).diff()
ls_cnv.append(a)
print (ls_cnv)
Just because your time format must include colons and point like this
"%H:%M:%S.%f"
For my football data analysis, to use the pandas between_time function, I need to convert a list of strings representing fractional seconds from measurement onset into the pandas date_time index. The time data looks as follows:
In order to achieve this I tried the following:
df['Time'] = df['Timestamp']*(1/freq)
df.index = pd.to_datetime(df['Time'], unit='s')
In which freq=600 and Timestamp is the frame number counting up from 0.
I was expecting the new index to show the following format:
%y%m%d-%h%m%s%f
But unfortunately, the to_datetime doesn't know how to handle my type of time data (namely counting up till 4750s after the start).
My question is, therefore, how do I convert my time sample data into a date_time index.
Based on this topic I now created the following function:
def timeDelta2DateTime(self, time_delta_list):
'''This method converts a list containing the time since measurement onset [seconds] into a
list containing dateTime objects counting up from 00:00:00.
Args:
time_delta_list (list): List containing the times since the measurement has started.
Returns:
list: A list with the time in the DateTime format.
'''
### Use divmod to convert seconds to m,h,s.ms ###
s, fs = list(zip(*[divmod(item, 1) for item in time_delta_list]))
m, s = list(zip(*[divmod(item, 60) for item in s]))
h, m = list(zip(*[divmod(item, 60) for item in m]))
### Create DatTime list ###
ms = [item*1000 for item in fs] # Convert fractional seconds to ms
time_list_int = list(zip(*[list(map(int,h)), list(map(int,m)), list(map(int,s)), list(map(int,ms))])) # Combine h,m,s,ms in one list
### Return dateTime object list ###
return [datetime(2018,1,1,item[0],item[1],item[2],item[3]) for item in time_list_int]
As it seems to very slow feel free to suggest a better option.
My dataframes contains one field which is a date and it appears in the string format, as example
'2015-07-02T11:22:21.050Z'
I need to filter the DataFrame on the date to get only the records in the last week.
So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:
def map_to_datetime(row):
format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
row.date = datetime.strptime(row.date, format_string)
df = df.map(map_to_datetime)
and then I would apply a filter as
df.filter(lambda row:
row.date >= (datetime.today() - timedelta(days=7)))
I manage to get the mapping working but the filter fails with
TypeError: condition should be string or Column
Is there a way to use a filtering in a way that works or should I change the approach and how?
I figured out a way to solve my problem by using the SparkSQL API with dates in String format.
Here is an example:
last_week = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = df.where(df.date >= last_week)
Spark >= 1.5
You can use INTERVAL
from pyspark.sql.functions import expr, current_date
df_casted.where(col("dt") >= current_date() - expr("INTERVAL 7 days"))
Spark < 1.5
You can solve this without using worker side Python code and switching to RDDs. First of all, since you use ISO 8601 string, your data can be directly casted to date or timestamp:
from pyspark.sql.functions import col
df = sc.parallelize([
('2015-07-02T11:22:21.050Z', ),
('2016-03-20T21:00:00.000Z', )
]).toDF(("d_str", ))
df_casted = df.select("*",
col("d_str").cast("date").alias("dt"),
col("d_str").cast("timestamp").alias("ts"))
This will save one roundtrip between JVM and Python. There are also a few way you can approach the second part. Date only:
from pyspark.sql.functions import current_date, datediff, unix_timestamp
df_casted.where(datediff(current_date(), col("dt")) < 7)
Timestamps:
def days(i: int) -> int:
return 60 * 60 * 24 * i
df_casted.where(unix_timestamp() - col("ts").cast("long") < days(7))
You can also take a look at current_timestamp and date_sub
Note: I would avoid using DataFrame.map. It is better to use DataFrame.rdd.map instead. It will save you some work when switching to 2.0+
from datetime import datetime, timedelta
last_7_days = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = signal1.where(signal1.publication_date >= last_7_days)
You need to import datetime and timedelta for this.