my sql Table
SDATETIME FE014BPV FE011BPV
0 2022-05-28 5.770000 13.735000
1 2022-05-30 16.469999 42.263000
2 2022-05-31 56.480000 133.871994
3 2022-06-01 49.779999 133.561996
4 2022-06-02 45.450001 132.679001
.. ... ... ...
93 2022-09-08 0.000000 0.050000
94 2022-09-09 0.000000 0.058000
95 2022-09-10 0.000000 0.051000
96 2022-09-11 0.000000 0.050000
97 2022-09-12 0.000000 0.038000
My code:
import pandas as pd
import pyodbc
monthSQL = pd.read_sql_query('SELECT SDATETIME,max(FE014BPV) as flow,max(FE011BPV) as steam FROM [SCADA].[dbo].[TOTALIZER] GROUP BY SDATETIME ORDER BY SDATETIME ASC', conn)
monthdata = monthSQL.groupby(monthSQL['SDATETIME'].dt.strftime("%b-%Y"), sort=True).sum()
print(monthdata)
Produces this incorrect output
flow steam
SDATETIME
Aug-2022 1800.970001 2580.276996
Jul-2022 1994.300014 2710.619986
Jun-2022 3682.329998 7633.660018
May-2022 1215.950003 3098.273025
Sep-2022 0.000000 1.705000
I want output some thing like below
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000
Also, need a sum of last 12 month data
The output is correct, just not in the order you expect. Try this:
# This keep SDATETIME as datetime, not string
monthdata = monthSQL.groupby(pd.Grouper(key="SDATETIME", freq="MS")).sum()
# Rolling sum of the last 12 months
monthdata = pd.concat(
[
monthdata,
monthdata.add_suffix("_LAST12").rolling("366D").sum(),
],
axis=1,
)
# Keep SDATETIME as datetime for as long as you need to manipulate the
# dataframe in Python. When you need to export it, convert it to
# string
monthdata.index = monthdata.index.strftime("%b-%Y")
About the rolling(...) operation: it's easy to think that rolling(12) gives you the rolling sum of the last 12 months, given that each row represents a month. In fact, it returns the rolling sum of the last 12 rows. This is important, because if there are gaps in your data, 12 rows may cover more than 12 months. rolling("366D") makes sure that it only count rows within the last 366 days, which is the maximum length of any 12-month period.
We can't use rolling("12M") because months do not have fixed durations. There are between 28 to 31 days in a month.
You are sorting the date names in alphabetical order - you need to specify which column to sort. You can see that because it goes (the starting letters of the dates):
SDATETIME
Aug-2022 # A goes before J, M, S in the alphabet
Jul-2022 # J goes after A, but before M and S in the alphabet
Jun-2022 # J goes after A, but before M and S in the alphabet
May-2022 # M goes after A, J but before S in the alphabet
Sep-2022 # S goes after A, J, M in the alphabet
To sort by months in reality, you have to make a dictionary and then sort by the .apply() method:
month_dict = {'Jan-2022':1,'Feb-2022':2,'Mar-2022':3, 'Apr-2022':4, 'May-2022':5, 'Jun-2022':6, 'Jul-2022':7, 'Aug-2022':8, 'Sep-2022':9, 'Oct-2022':10, 'Nov-2022':11, 'Dec-2022':12}
df = df.sort_values('SDATETIME', key= lambda x : x.apply (lambda x : month_dict[x]))
print(df)
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000
Related
In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass
I have two dataframes with particular data that I'm needing to merge.
Date Greenland Antarctica
0 2002.29 0.00 0.00
1 2002.35 68.72 19.01
2 2002.62 -219.32 -59.36
3 2002.71 -242.83 46.55
4 2002.79 -209.12 63.31
.. ... ... ...
189 2020.79 -4928.78 -2542.18
190 2020.87 -4922.47 -2593.06
191 2020.96 -4899.53 -2751.98
192 2021.04 -4838.44 -3070.67
193 2021.12 -4900.56 -2755.94
[194 rows x 3 columns]
and
Date Mean Sea Level
0 1993.011526 -38.75
1 1993.038692 -39.77
2 1993.065858 -39.61
3 1993.093025 -39.64
4 1993.120191 -38.72
... ... ...
1021 2020.756822 62.83
1022 2020.783914 62.93
1023 2020.811006 62.98
1024 2020.838098 63.00
1025 2020.865190 63.00
[1026 rows x 2 columns]
My ultimate goal is trying to pull out the data from the second data frame(Mean Sea Level column) that comes from (roughly) the same time frame as the dates in the first dataframe, and then merge that back in with the first data frame.
However, the only ways that I can think of selecting out certain dates, involves first converting all of the dates in the Date columns of both Dataframes to something Pandas recognizes, but I have been unable to figure our how to do that. I figured out some code(below) that can convert individual dates to a more common date format, but its been difficult to successfully apply it to all of the Dates in dataframe. Also I'm not sure I can then get Pandas to then convert that to a date format that Pandas recognizes.
from datetime import datetime
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first)*fraction
I also looked at pandas.to_datetime but I don't see a way to have it read the format the dates are initially in.
So does anyone have any guidance on this? Firstly with the conversion of dates, but also with the task of picking out the dates from the second dataframe if possible. Any help would be greatly appreciated.
Suppose you have this 2 dataframes:
df1:
Date Greenland Antarctica
0 2020.79 -4928.78 -2542.18
1 2020.87 -4922.47 -2593.06
2 2020.96 -4899.53 -2751.98
3 2021.04 -4838.44 -3070.67
4 2021.12 -4900.56 -2755.94
df2:
Date Mean Sea Level
0 2020.756822 62.83
1 2020.783914 62.93
2 2020.811006 62.98
3 2020.838098 63.00
4 2020.865190 63.00
To convert the dates:
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first) * fraction
df1["Date"] = df1["Date"].apply(fraction2datetime)
df2["Date"] = df2["Date"].apply(fraction2datetime)
print(df1)
print(df2)
Prints:
Date Greenland Antarctica
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94
Date Mean Sea Level
0 2020-10-03 23:55:28.012795 62.83
1 2020-10-13 21:54:02.073603 62.93
2 2020-10-23 19:52:36.134397 62.98
3 2020-11-02 17:51:10.195198 63.00
4 2020-11-12 15:49:44.255992 63.00
For the join, you can use pd.merge_asof. For example this will join on nearest date within 30-day tolerance (you can tweak these values as you want):
x = pd.merge_asof(
df1, df2, on="Date", tolerance=pd.Timedelta(days=30), direction="nearest"
)
print(x)
Will print:
Date Greenland Antarctica Mean Sea Level
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18 62.93
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06 63.00
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98 NaN
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67 NaN
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94 NaN
You can specify a timestamp format in to_datetime(). Otherwise, if you need to use a custom function you can use apply(). If performance is a concern, be aware that apply() does not perform as well as builtin pandas methods.
To combine the DataFrames you can use an outer join on the date column.
I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).
Perform group_by to compute mean using last two month's Value and Duration.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values
Use GroupBy.agg instead transform if need output filled by aggregate values:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
Group Value Duration
0 A 10.0 60.0
1 B 27.5 224.0
2 C 15.0 130.0
If need last value per groups use GroupBy.last:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)
I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.
I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer