Implement MSSQL's partition by windowed clause in Pandas - python

I’m in the process of moving a MSSQL database to MYSQL and have decided to move some stored procedures to Python rather than rewrite in MYSQL. I am using Pandas 0.23 on Python 3.5.4.
The old MSSQL base uses a number of windowed functions. So far I’ve had success with converting using Pandas using pandas.Dataframe.rolling as follows:
MSSQL
AVG([Close]) OVER (ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
Python
df['MA14'] = df.Close.rolling(14).mean()
I'm stuck working on a solution for the PARTITION BY part of the MSSQL windowed function in python. I am working on a solution with pandas groupby based on feedback since posting...
https://pandas.pydata.org/pandas-docs/version/0.23.0/groupby.html
For Example let's say MSSQL is:
AVG([Close]) OVER (PARTITION BY myCol ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
What I have worked out so far:
Col1 contains my categorical data which I wish to groupby and apply function to on a rolling basis. There is also a date column, thus Col1 and the date column would represent a unique record in the df.
1. Delivers the mean for Col1 albeit aggregated
grouped = df.groupby(['Col1']).mean()
print(grouped.tail(20))
2. Appears to be applying the rolling mean per categorical group of Col1. Which I am after
grouped = df.groupby(['Col1']).Close.rolling(14).mean()
print(grouped.tail(20))
3 Assign to df as new Column RM
df['RM'] = df.groupby(['Col1']).Close.rolling(14).mean()
print(df.tail(20))
It doesn't like this step which I get the error...
TypeError: incompatible index of inserted column with frame index
I've worked up a simple example which may help:
How do I get the results of #2 in the df in #1 or similar.
import numpy as np
import pandas as pd
dta = {'Colour': ['Red','Red','Blue','Blue','Red','Red','Blue','Red','Blue','Blue','Blue','Red'],
'Year': [2014,2015,2014,2015,2016,2017,2018,2018,2016,2017,2013,2013],
'Val':[87,78,863,673,74,81,756,78,694,701,804,69]}
df = pd.DataFrame(dta)
df = df.sort_values(by=['Colour','Year'], ascending=True)
print(df)
#1 add calculated columns to the df. This averages all of column Val
df['ValMA3'] = df.Val.rolling(3).mean().round(0)
print (df)
#2 Group by Colour. This is calculating average by groups correctly.
# where are the other columns from my original dataframe?
#what if I have multiple calculated columns to add?
gf = df.groupby(['Colour'])
gf = gf.Val.rolling(3).mean().round(0)
print(gf)

I am pretty sure the transform function can help.
df.groupby('Col1'')['Val'].transform(lambda x: x.rolling(3, 2).mean())
where e.g. the value 3 is the step of the rolling window, and 2 is the minimum number of periods.
(Just don't forget to sort your data frame before applying the running calculation)

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

how to create a dataframe using groupy such that the grouping criteria is contained in the data

I wanted to create a 2D dataframe about coronavirus such that it contains a column containing countries and another one containing number of deaths. the csv file that I am using is date oriented so for some days the number of deaths is 0 so I decided to group them by Country and sum them up. yet it returned a dataframe with 1 column only. but when I write it to a csv file it creates 2 columns.
here is my code:
#import matplotlib.pyplot as plt
from pandas.core.frame import DataFrame
covid_data = pd.read_csv('countries-aggregated.csv')
bar_data = pd.DataFrame(covid_data.groupby('Country')['Deaths'].sum())
Difficult to give you a perfect answer without the dataset, however, groupby will set your key as index, thus returning a Series. You can pass as_index=False:
bar_data = covid_data.groupby('Country', as_index=False)['Deaths'].sum()
Or, if you have only one column in the DataFrame to aggregate:
bar_data = covid_data.groupby('Country', as_index=False).sum()

Python Pandas GroupBy Max Date

I have a very simple dataframe with columns: Index, Person, Item, Date. There are only 4 people and 3 items and random dates. All person/item/date combinations are unique. I am trying to get a simple pivot-table like df to print using:
import pandas as pd
mydf = pd.read_csv("Test_Data.csv",index_col=[0])
mydf = mydf.sort_values(by=['Date','Item','Person'], ascending=False)
print(mydf.groupby(['Person','Item'])['Date'].max())
however, I noticed that while the structure is what I want, the data is not. It is not returning the max date for the Person/Item combination. I thought sorting things first would help, but it did not. Do I need to create a temp df first and then join to do what I'm trying to do?
Also to be clear, there are 28 rows of data (all test data) with some People/Items being repeated but with different dates. Index is just 0 through 27.
Figured it out! Should have made sure the Date field was actually recognized as a date:
mydf['Date'] = pd.to_datetime(mydf['Date'])

Column missing after Pandas GroupBy (not the GroupBy column)

I am using the following source code:
import numpy as np
import pandas as pd
# Load data
data = pd.read_csv('C:/Users/user/Desktop/Daily_to_weekly.csv', keep_default_na=True)
print(data.shape[1])
# 18
# Create weekly data
# Agreggate by calculating the sum per store for every week
data_weekly = data.groupby(['STORE_ID', 'WEEK_NUMBER'], as_index=False).agg('sum')
print(data_weekly.shape[1])
# 17
As you may see for some reason a column is missing after the aggregation and this column is neither of the GroupBy columns ('STORE_ID', 'WEEK_NUMBER').
Why is this happening and how can I fix it?
I've run in to this problem numerous times before. The problem is panda's is dropping one of your columns because it has identified it as a "nuisance" column. This means that the aggregation you are attempting to do cannot be applied to it. If you wish to preserve this column I would recommend including it in the groupby.
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

computing rolling averages by integer days in pandas

I have taken some data from a csv and put it into a dataframe:
from pandas import read_csv
df = read_csv('C:\...', delimiter = ',', encoding = 'utf-8')
df2 = df.groupby(['i-j','day'])['i-j'].agg({'count'})
I would like to calculate for each 'i-j' the seven day moving average of their count. First I think I need to add the days with zero count to the table. Is there an easy way to do this by modifying my code above? In other words I would like missing values to count as 0.
Then I would need to add another column to the dataframe that calculates the average of count for each i-j for the previous seven-days. Do I need to convert the days to something that pandas recognizes as a date value in order to use some of the rolling statistical functions? Or can I just change the type of the 'date' column and proceed.
Many thanks!
There may be a better way to do this, but given your starting DataFrame of df2 the following should work.
First reindex df2 to fill in the missing days with zeros:
new_index = pd.MultiIndex.from_product([df2.index.get_level_values(0).unique(), range(31)])
df2 = df2.reindex(new_index, fill_value=0)
(I'm assuming you want 31 days, but you can change this as necessary.)
Now if you unstack this reindexed DataFrame and take the transpose, you have a DataFrame where each column is an entry of i-j and contains the counts per day:
df2.unstack().T
You can calculate the rolling mean of this DataFrame:
rm = pd.rolling_mean(df2.unstack().T, 7)
To finish, you can stack this frame of rolling means to get back to the shape of the original reindexed df2:
rm.T.stack(dropna=False)

Categories