I have a pandas dataframe where one of the columns consists of datetime values with varying frequences.
I want to create a new column which flags whenever the gap between two datetime values is greater than one day (datetime current row + timedelta(days=1) < datetime next row).
However, I would want to do this with a list operation, rather than a for loop.
Had the values been int values, you could do something like:
df_ship["gap_gt_1"] = (df_ship['datetime']+1).lt(df_ship['datetime'].shift().bfill()).astype(int)
However, lt and similar operators don't work with datetime objects.
I've tried to do the following, but it only returns 'false' values.
df_ship["gap_gt_1"] = ((df_ship['datetime'] + timedelta(days=1)) < (df_ship['datetime'].shift()))
You can try to do:
import numpy as np
# Take the difference in dates
df["timedelta"] = df['date'] - df['date'].shift(1)
# To make the flags
conditions, type_choices = ([df['timedelta'] > pd.Timedelta(days=1)], [1])
df["flag"] = np.select(conditions, type_choices, default=0)
Related
i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.
You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)
I am trying to add a day to my column CutoffSLAII if the time in the column is after 1.59am, however, I only want to add a day if the column FILE_START_TIME also has a time prior to 12:00am. If neither of these conditions are met, then the value in CutoffSLAII should be retained.
The code I am using does run, but nothing changes in the dataframe:
from datetime import datetime, time, timedelta
import pandas as pd
def add_a_day(row: pd.Series) -> datetime:
if pd.isna(x['CutoffSLAII', 'FILE_START_TIME']):
return x['CutoffSLAII', 'FILE_START_TIME']
tme = x['CutoffSLAII'].time()
tme1 = x['FILE_START_TIME'].time()
if tme < time(12, 0, 0) and tme1 > time(1, 59, 0):
return x['CutoffSLAII'] + timedelta(days=1)
data: df2['CutoffSLAII'] = df2.apply(add_a_day, axis=1)
Data that I wish to add a day to:
Both FILE_START_TIME and CutoffSLAII are Datetime64[ns] dtypes, however, when I interact with one value in the columns, they are returned as a timestamp.
in: df2.iloc[0]['FILE_START_TIME']
out: Timestamp('2020-11-02 19:23:47')
The data is not embedded as I do not have enough reputation points, sorry for that.
The error message is now:
TypeError: string indices must be integers
Your function add_a_day takes a variable named row but acts on another one named x.
You may want to fix that first !
I'm a little bit confused what's going on. Has X been referenced somewhere else or is that supposed to be referencing row? Also you map row as a series to datetime when you are applying on the whole data frame. Lastly I think you are trying to reference the rows incorrectly.
if pd.isna(x['CutoffSLAII']) or pd.isna(x['FILE_START_TIME'])):
I'm trying to split a df by datetime. The df is indexed on the datetime variable. Essentially, I can do:
first = df['2020-04-09':'2020-04-21']
second = df['2020-04-22':'2020-05-08']
and that yields my desired result of 2 dfs, each with their respective datetime range's worth of data.
However, I'd like a way to allow for easier editing at the top of the script by assigning the datetime ranges to local variables. Ideally something like this:
first_dates = '2020-04-09':'2020-04-21'
second_dates = '2020-04-22':'2020-05-08'
Such that later on I'm able to use something like:
first = df[first_dates]
second = df[second_dates]
and yield the same result of 2 dfs with their respective date ranges worth of data.
Is this what you want
# edit this
date_str = '2020-04-21'
# no need to edit this
date = pd.to_datetime(date_str, utc=True)
first = df[:date]
second = df[date+pd.to_timedelta('1D'):]
Using datetime, you could use mask comparison maybe?
like:
mask1 = df.index <= dt.date(2020,4,21)
mask2 = df.index > dt.date(2020,4,21)
df1 = df.loc[mask1]
df2 = df.loc[mask2]
I am trying to learn Pandas. I have found several examples on how to construct a pandas dataframe and how to add columns, they work nicely. I would like to learn to select all rows based on a value of a column. I have found multiple examples on how to perform selection if a value of a column should be smaller or greater than a certain number, that also works. My question is how to do a more general selection, where I would like to first compute a function of a column, then select all rows for which the value of a function would be greater or smaller than a certain number
import names
import numpy as np
import pandas as pd
from datetime import date
import random
def randomBirthday(startyear, endyear):
T1 = date.today().replace(day=1, month=1, year=startyear).toordinal()
T2 = date.today().replace(day=1, month=1, year=endyear).toordinal()
return date.fromordinal(random.randint(T1, T2))
def age(birthday):
today = date.today()
return today.year - birthday.year - ((today.month, today.day) < (birthday.month, birthday.day))
N_PEOPLE = 20
dict_people = { }
dict_people['gender'] = np.array(['male','female'])[np.random.randint(0, 2, N_PEOPLE)]
dict_people['names'] = [names.get_full_name(gender=g) for g in dict_people['gender']]
peopleFrame = pd.DataFrame(dict_people)
# Example 1: Add new columns to the data frame
peopleFrame['birthday'] = [randomBirthday(1920, 2020) for i in range(N_PEOPLE)]
# Example 2: Select all people with a certain age
peopleFrame.loc[age(peopleFrame['birthday']) >= 20]
This code works except for the last line. Please suggest what is the correct way to write this line. I have considered adding an extra column with the value of the function age, and then selecting based on its value. That would work. But I am wondering if I have to do it. What if I don't want to store the age of a person, only use it for selection
Use Series.apply:
peopleFrame.loc[peopleFrame['birthday'].apply(age) >= 20]
I have the following code snippet, which loads data from a CSV file into a numpy.core.records.recarray:
r = mlab.csv2rec(datafile, delimiter=',', names=('dt', 'val'))
data = zip(date2num(r['dt']),r['val']) # Need to filter for records lying between two dates here ...
I want to only 'zip' records that have dates falling bewteen (say) '2000-01-01' and 2000-03-01'
I understand the concept of lambda functions - but I haven't used them before. It would be cool if I could use a lambda to filter the records between the required dates (like in pseudocode below):
data = zip(lambda: date2num(r['dt']),r['val'] if r['dt'] > '2000-01-01' and r['dt'] < '2000-03-01' )
What is the Pythonic way to extract a subset of data from the rec.array, based on specified indixes (i.e. dates)?
Since you are using numpy, you don't need lambda function to do this kind of things.
Here is an example, you can campare array with value as r["dt"] >= date(2000,1,1), This will get a bool array, and use "&" operator, you can calculate bitwise and of two bool array. Finally, use a bool array as the index, you can get the values corresponding to True.
import pylab as pl
import StringIO
from datetime import date
data = """2000-01-01,3
1999-04-01,5
2000-01-11,4
2000-02-21,7
2000-08-12,8
"""
r = pl.csv2rec(StringIO.StringIO(data), delimiter=",", names=("dt","val"))
mask = (r["dt"] >= date(2000,1,1)) & (r["dt"] <= date(2000,3,1))
r2 = r[mask]
print zip(pl.date2num(r2["dt"]), r2["val"])
Lambda (often combined with map or filter) is generally a less pythonic and clear solution to the same problem that list comprehensions and its cousins solve.
[dt, val for dt, val in zip(date2num(r['dt'], r['val'])) if '2000-01-01' < r['dt'] < '2000-03-01']