Subsetting out rows using pandas - python

I have two sets of dataframes: datamax, datamax2015 and datamin, datamin2015.
Snippet of data:
print(datamax.head())
print(datamin.head())
print(datamax2015.head())
print(datamin2015.head())
Date ID Element Data_Value
0 2005-01-01 USW00094889 TMAX 156
1 2005-01-02 USW00094889 TMAX 139
2 2005-01-03 USW00094889 TMAX 133
3 2005-01-04 USW00094889 TMAX 39
4 2005-01-05 USW00094889 TMAX 33
Date ID Element Data_Value
0 2005-01-01 USC00200032 TMIN -56
1 2005-01-02 USC00200032 TMIN -56
2 2005-01-03 USC00200032 TMIN 0
3 2005-01-04 USC00200032 TMIN -39
4 2005-01-05 USC00200032 TMIN -94
Date ID Element Data_Value
0 2015-01-01 USW00094889 TMAX 11
1 2015-01-02 USW00094889 TMAX 39
2 2015-01-03 USW00014853 TMAX 39
3 2015-01-04 USW00094889 TMAX 44
4 2015-01-05 USW00094889 TMAX 28
Date ID Element Data_Value
0 2015-01-01 USC00200032 TMIN -133
1 2015-01-02 USC00200032 TMIN -122
2 2015-01-03 USC00200032 TMIN -67
3 2015-01-04 USC00200032 TMIN -88
4 2015-01-05 USC00200032 TMIN -155
For datamax, datamax2015, I want to compare their Data_Value columns and create a dataframe of entries in datamax2015 whose Data_Value is greater than all entries in datamax for the same day of the year. Thus, the expected output should be a dataframe with rows from 2015-01-01 to 2015-12-31 but with dates only where the values in the Data_Value column are greater than those in the Data_Value column of the datamax dataframe.
i.e 4 rows and anything from 1 to 364 columns depending on the condition above.
I want the converse (min) for the datamin and datamin2015 dataframes.
I have tried the following code:
upper = []
for row in datamax.iterrows():
for j in datamax2015["Data_Value"]:
if j > row["Data_Value"]:
upper.append(row)
lower = []
for row in datamin.iterrows():
for j in datamin2015["Data_Value"]:
if j < row["Data_Value"]:
lower.append(row)
Could anyone give me a helping hand as to where I am going wrong?

This code does what you want for the datamin. Try to adapt it to the datamax symmetric case as well - leave a comment if you have trouble and happy to help further.
Create Data
from datetime import datetime
import pandas as pd
datamin = pd.DataFrame({"date": pd.date_range(start=datetime(2005, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 1})
datamin["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year higher in order for the desired result to be non-empty
datamin.loc[datamin["day_of_year"]==4, "Data_Value"] = 2
datamin2015 = pd.DataFrame({"date": pd.date_range(start=datetime(2015, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 2})
datamin2015["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year lower in order for the desired result to be non-empty
datamin2015.loc[3, "Data_Value"] = 1
The solution
df1 = datamin.groupby("day_of_year").agg({"Data_Value": "min"})
df2 = datamin2015.join(df1, on="day_of_year", how="left", lsuffix="2015")
lower = df2.loc[df2["Data_Value2015"]<df2["Data_Value"]]
lower
We group the datamin by day of year to find the min across all the years for each day of the year (using .dt.dayofyear). Then we join that with datamin2015 and finally can then compare the Data_Value2015 with Data_Value in order to find the indexes of the rows where the Data_Value in 2015 was less than the minimum across all same days of the year in datamin.
In the example above lower has 1 row by the way I set up the dataframes.

Python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
Remove leap year dates (i.e. 29th February).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.max_rows",None,"display.max_columns",None)
data = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
newdata = data[(data['Date'] >= '2005-01-01') & (data['Date'] <= '2014-12-12')]
datamax = newdata[newdata['Element']=='TMAX']
datamin = newdata[newdata['Element']=='TMIN']
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
datamax["day_of_year"] = datamax["Date"].dt.dayofyear
datamax = datamax.groupby('day_of_year').max()
datamin["day_of_year"] = datamin["Date"].dt.dayofyear
datamin = datamin.groupby('day_of_year').min()
datamax = datamax.reset_index()
datamin = datamin.reset_index()
datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d')
datamax['Date'] = datamax['Date'].dt.strftime('%Y-%m-%d')
datamax = datamax[~datamax['Date'].str.contains("02-29")]
datamin = datamin[~datamin['Date'].str.contains("02-29")]
breakoutdata = data[(data['Date'] > '2014-12-31')]
datamax2015 = breakoutdata[breakoutdata['Element']=='TMAX']
datamin2015 = breakoutdata[breakoutdata['Element']=='TMIN']
datamax2015['Date'] = pd.to_datetime(datamax2015['Date'])
datamin2015['Date'] = pd.to_datetime(datamin2015['Date'])
datamax2015["day_of_year"] = datamax2015["Date"].dt.dayofyear
datamax2015 = datamax2015.groupby('day_of_year').max()
datamin2015["day_of_year"] = datamin2015["Date"].dt.dayofyear
datamin2015 = datamin2015.groupby('day_of_year').min()
datamax2015 = datamax2015.reset_index()
datamin2015 = datamin2015.reset_index()
datamin2015['Date'] = datamin2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015['Date'] = datamax2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015 = datamax2015[~datamax2015['Date'].str.contains("02-29")]
datamin2015 = datamin2015[~datamin2015['Date'].str.contains("02-29")]
dataminappend = datamin2015.join(datamin,on="day_of_year",rsuffix="_new")
lower = dataminappend.loc[dataminappend["Data_Value_new"]>dataminappend["Data_Value"]]
datamaxappend = datamax2015.join(datamax,on="day_of_year",rsuffix="_new")
upper = datamaxappend.loc[datamaxappend["Data_Value_new"]<datamaxappend["Data_Value"]]
upper['Date'] = pd.to_datetime(upper['Date'])
lower['Date'] = pd.to_datetime(lower['Date'])
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
ax = plt.gca()
plt.plot(datamax['day_of_year'],datamax['Data_Value'],color='red')
plt.plot(datamin['day_of_year'],datamin['Data_Value'], color='blue')
plt.scatter(upper['day_of_year'],upper['Data_Value'],color='purple')
plt.scatter(lower['day_of_year'],lower['Data_Value'], color='cyan')
plt.ylabel("Temperature (degrees C)",color='navy')
plt.xlabel("Date",color='navy',labelpad=15)
plt.title('Record high and low temperatures by day (2005-2014)', alpha=1.0,color='brown',y=1.08)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35),fancybox=False,labels=['Record high','Record low'])
plt.xticks(rotation=30)
plt.fill_between(range(len(datamax['Date'])), datamax['Data_Value'], datamin['Data_Value'],color='yellow',alpha=0.8)
plt.show()
I have converted the 'Date' column to a string using Datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d').
I have then converted this back to 'datetime' format using upper['Date'] = pd.to_datetime(upper['Date'])
I then used 'date of year' as the x-value.

Related

pandas.Series.apply() lambda function to count data-frame column values with conditions

This post follows on from another one I posted which can be found here:
use groupby() and for loop to count column values with conditions
I am working with the same data again:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
Like in the previous post, I first created a pd.Series with the 1st day of every month in the entire history of the data
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I now want to do is count the number of rows in the data-frame where the df["start_date"] values are less than the 1st day of each month in the series and where the df["end_date"] values are greater than the 1st day of each month in the series
I would think that I would apply a lambda function or use np.logical_and on the dates series to obtain the output I am after - the logic of which would look something like this:
#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]
dates.apply(
lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)
or like this:
dates.apply(
lambda x: np.logical_and(
inactives[inactives["start_date"] < x,
inactives[inactives["cancel_date"] > x]]
).sum())
The resulting output would look like this:
month_first
count
2015-01-01
10
2015-02-01
25
2015-03-01
45
Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq "MS" to create start of month inside our defined interval.
new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})
This will result in this table:
month_first
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-04-01
4 2015-05-01
.. ...
89 2022-06-01
90 2022-07-01
91 2022-08-01
92 2022-09-01
93 2022-10-01
[94 rows x 1 columns]
Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives which the start_date is less and end_date is greater. We use & operator to perform and operation to each row of our resulting comparisons. Then, we use sum to sum all the boolean values.
new_df["count"] = new_df["month_first"].apply(
lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())
This will result in this table:
month_first count
0 2015-01-01 0
1 2015-02-01 4
2 2015-03-01 9
3 2015-04-01 14
4 2015-05-01 19
.. ... ...
89 2022-06-01 25
90 2022-07-01 22
91 2022-08-01 19
92 2022-09-01 13
93 2022-10-01 13
[94 rows x 2 columns]

use groupby() and for loop to count column values with conditions

The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64

Python Timeseries Conditional Calculations Summarized by Month

I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Modify Code from Year Timespan to Month/Week Timespans

I am making a stacked bar plot over a year time span where the x-axis is company names, y-axis is the number of calls, and the stacks are the months.
I want to be able to make this plot run for a time span of a month, where the stacks are days, and a time span of a week, where the stacks are days. I am having trouble doing this since my code is built already around the year time span.
My input is a dataframe that looks like this
pivot_table.head(3)
Out[12]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
CompanyName
Customer1 17 30 29 39 15 26 24 12 36 21 18 15
Customer2 4 11 13 22 35 29 15 18 29 31 17 14
Customer3 11 8 25 24 7 15 20 0 21 12 12 17
and my code is this so far.
First I grab a years worth of data (I would change this to a month or a week for this question)
# filter by countries with at least one medal and sort
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - pd.DateOffset(years=1)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
Then I create the pivot_table shown above.
###########################################################
#Create Dataframe
###########################################################
df = df.set_index('recvd_dttm')
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg(len).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
pivot_table = result.pivot(index='Month', columns='CompanyName', values='NumberCalls').fillna(0)
s = pivot_table.sum().sort(ascending=False,inplace=False)
pivot_table = pivot_table.ix[:,s.index[:30]]
pivot_table = pivot_table.transpose()
pivot_table = pivot_table.reset_index()
pivot_table['CompanyName'] = [str(x) for x in pivot_table['CompanyName']]
Companies = list(pivot_table['CompanyName'])
pivot_table = pivot_table.set_index('CompanyName')
pivot_table.to_csv('pivot_table.csv')
Then I use the pivot table to create an OrderedDict for Plotting
###########################################################
#Create OrderedDict for plotting
###########################################################
months = [pivot_table[(m)].astype(float).values for m in range(1, 13)]
names = ["Jan", "Feb", "Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov", "Dec"]
months_dict = OrderedDict(list(zip(names, months)))
###########################################################
#Plot!
###########################################################
palette = brewer["RdYlGn"][8]
hover = HoverTool(
tooltips = [
("Month", "#months"),
("Number of Calls", "#NumberCalls"),
]
)
output_file("stacked_bar.html")
bar = Bar(months_dict, Companies, title="Number of Calls Each Month", palette = palette, legend = "top_right", width = 1200, height=900, stacked=True)
bar.add_tools(hover)
show(bar)
Does anyone have ideas on how to approach modifying this code so it can work for shorter time spans? This is what the graph looks like for a year
EDIT Added the full code. Input looks like this example:
CompanyName recvd_dttm
Company1 6/5/2015 18:28:50 PM
Company2 6/5/2015 14:25:43 PM
Company3 9/10/2015 21:45:12 PM
Company4 6/5/2015 14:30:43 PM
Company5 6/5/2015 14:32:33 PM

Categories