How do I find all rows with a certain date using Pandas? - python

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?
I naively want to do something like this (which obviously doesn't work):
dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and
data['valid_time'].day == dt.day and
data['valid_time'].month == dt.month)
There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.

Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.
Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:
df['1-2014':'2-2014']
Pandas is pretty powerful, especially for time-indexed data.

Try this (is just like the continuation of your idea):
import pandas as pd
import numpy.random as rd
import datetime
times = pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)
data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day
)
print data[rows]

Related

assign a time period to each value of a column in a Pandas dataframe

I have a pandas dataframe with one of the columns being a date. I need to create another column which would be a start (or end, doesn't matter) of a 2W period containing this date. Ideally this would be generalizable to any offset used by pd.Grouper.
Knowing pd.Grouper I can come up with a hacky solution using .groupby.transform() - but I hope there is a nicer solution.
I tried using pd.Series.dt.to_period() but it does not accept offsets like "2W" and interprets them as a weekly offset. I could not find documentation of dt.to_period() that would explain this.
df = pd.DataFrame({"date":["2022-01-03", "2022-01-10", "2022-01-20"], "data":[1,2,3]})
df["date"] = pd.to_datetime(df["date"])
# Trying to assign a 2W period to a new column
# This is ugly and hacky, and pd.Grouper is deprecated
# can this be made better?
df["2W_date_grouper"] = df.groupby(pd.Grouper(freq="2W", key="date"))["data"].transform(lambda x:[x.name]*len(x))
# using .dt.to_period() seems to ignore "2W" and interpret it as "weekly" - WHY???
df["2W_date_to_period"] = df["date"].dt.to_period("2W")
Try to use the following:
pd.Timedelta(days=14)
df[‘date’] = df[‘2A_date_grouper’] + pd.Timedelta(days=14)
https://pandas.pydata.org/pandas-docs/stable/timedeltas.html

How to display aggregated and non-aggregated values at the same time?

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:
You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

PYTHON: Filtering a dataset and truncating a date

I am fairly new to python, so any help would be greatly appreciated. I have a dataset that I need to filter down to specific events. For example, I have a column with dates and I need to know what dates are in the current month and have happened within the past week. The column is called POS_START_DATE with dates formatted like 2019-01-27T00:00:00-0500. I need to truncate that date and compare it to the previous week. No luck so far.
Here is my code so far:
## import data package
import datetime
## assign date variables
today = datetime.date.today()
six_day = datetime.timedelta(days = 6)
## Create week parameter
week = today + six_day
## Statement to extract recent job movements
if fields.POS_START_DATE < week and fields.POS_START_DATE > today:
out1 += in1
Here is sample of the table:
Sample Table
I am looking for the same table filtered down to only rows that happened within one week. The bottom of the sample table(not shown) will have dates in this month. I'd like the final output to only show those rows, and any other rows in the current month of November.
I am not too sure to understand what is your expected output, but this will help you create an extra column which will be used as flag for those cases that fulfill with the condition you state in your if-statement:
import numpy as np
fields['flag_1'] = np.where(((fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)),1,0)
This will generate an extra column in your dataframe with a 1 for the cases that meet the criteria you stated. Finally you can perform this calculation to get the total of cases that actually met the criteria:
total_cases = fields['flag_1'].sum()
Edit:
If you need to filter the data with only the cases that meet the criteria you can either use pandas filtering with the original if-statement (without creating the extra flag field) like this:
df_filtered = fields[(fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)]
Or, if you created the flag, then much simpler:
df_filtered = fields[fields['flag'] == 1]
Both should work to generate a new dataframe, with only the cases that match your criteria.

plot the relationship between two variables with pandas

I am new to python but am aware about the usefulness of pandas, thus I would like to kindly ask if someone can help me to use pandas in order to address the below problem.
I have a dataset with buses, which looks like:
BusModel;BusID;ModeName;Value;Unit;UtcTime
Alpha;0001;Engine hours;985;h;2016-06-22 19:58:09.000
Alpha;0001;Engine hours;987;h;2016-06-22 21:58:09.000
Alpha;0001;Engine hours;989;h;2016-06-22 23:59:09.000
Alpha;0001;Fuel consumption;78;l;2016-06-22 19:58:09.000
Alpha;0001;Fuel consumption;88;l;2016-06-22 21:58:09.000
Alpha;0001;Fuel consumption;98;l;2016-06-22 23:59:09.000
The file is .csv format and is separated by semicolon (;). Please note that I would like to plot the relationship between ‘Engine hours’ and ‘Fuel consumption’ by 'calculating the mean value of both for each day' based on the UtcTime. Moreover, I would like to plot graphs for all the busses in the dataset (not only 0001 but also 0002, 0003 etc.). How I can do that with simple loop?
Start with the following interactive mode
import pandas as pd
df = pd.read_csv('bus.csv', sep=";", parse_dates=['UtcTime'])
You should be able to start playing around with the DataFrame and discovering functions you can directly use with the data. To get a list of buses by ID just do:
>>> bus1 = df[df.BusID == 1]
>>> bus1
Substitute 1 with the ID of the bus you require. This will return you a sub-DataFrame. To get BusID 1 and just their engine hours do:
>>> bus1[bus1.ModeName == "Engine hours"]
You can quickly get statistics of columns by doing
>>> bus1.Value.describe()
Once you grouped the data you need you can start plotting:
>>> bus1[bus1.ModeName == "Engine hours"].plot()
>>> bus1[bus1.ModeName == "Fuel consumption"].plot()
>>> plt.show()
There is more explanation on the docs. Please refer to http://pandas.pydata.org/pandas-docs/stable/.
If you really want to use pandas, remember this simple thing: never use a loop. Loops aren't scalable, so try to use built-in functions. First let's read your dataframe:
import pandas as pd
data = pd.read_csv('bus.csv',sep = ';')
Here is the weak point of my answer, I don't know how to manage dates efficently. So create a column named day which contains the day from UtcTime (I would use an apply methode like this data['day'] = data['UtcTime'].apply(lambda x: x[:10]) but it's a hidden loop so don't do that!)
Then to take only the data of a single bus, try a slicing method:
data_bus1 = data[data.BusID == 1]
Finally use the groupby function:
data_bus1[['Modename','Value','day']].groupby(['ModeName','day'],as_index = False).mean()
Or if you don't need to separate your busses in different dataframes, you can use the groupby on the whole data:
data[['BusID','ModeName','Value','day']].groupby(['BusID','ModeName','day'],as_index = False).mean()

How do I perform math operations on dates and sort them in Python?

I have dates in a Python script that I need to work with that are in a list. I have to keep the format that already exists. The format is YYYY-MM-DD. They are displayed in the form ['2010-05-12', '2011-04-15', 'Date', '2010-04-20', '2010-11-05'] where the order of the dates appears to be random and they are made into lists with seemingly insignificant lengths. The length of this data can get very large. I need to know how to sort these dates into a chronological order and omit the seemingly randomly placed entries of 'Date' from this order. Then I need to be able to perform math operations such as moving up and down the list. For example if I have five dates in order I need to be able to take one date and be able to find a date x spaces ahead or behind that date in the order. I'm very new to Python so simpler explanations and implementations are preferred. Let me know if any clarifications are needed. Thanks.
You are asking several questions at the same time, so I'll answer them in order.
To filter out the "Date" entries, use the filter function like this:
dates = ['2011-06-18', 'Date', '2010-01-13', '1997-12-01', '2007-08-11']
dates_filtered = filter(lambda d: d != 'Date', dates)
Or perhaps like this, using Python's list comprehensions, if you find it easier to understand:
dates_filtered = [d for d in dates if d != 'Date']
You might want to convert the data types of the date items in your list to the date class to get access to some date-related methods like this:
from datetime import datetime
date_objects = [datetime.strptime(x,'%Y-%m-%d').date() for x in dates_filtered]
And to sort the dates you simply use the sort method
date_objects.sort()
The syntax in Python for accessing items and ranges of items in lists (or any "sequence type") is quite powerful. You can read more about it here. For example, if you want to access the last two dates in your list you could do something like this:
print(date_objects[-2:]
If you put it all together you'll get something like this:
from datetime import datetime
dates = ['2011-06-18', 'Date', '2010-01-13', '1997-12-01', '2007-08-11']
my_dates = [datetime.strptime(d, '%Y-%m-%d').date()
for d in dates
if d != 'Date']
my_dates.sort()

Categories