Column missing after Pandas GroupBy (not the GroupBy column) - python

I am using the following source code:
import numpy as np
import pandas as pd
# Load data
data = pd.read_csv('C:/Users/user/Desktop/Daily_to_weekly.csv', keep_default_na=True)
print(data.shape[1])
# 18
# Create weekly data
# Agreggate by calculating the sum per store for every week
data_weekly = data.groupby(['STORE_ID', 'WEEK_NUMBER'], as_index=False).agg('sum')
print(data_weekly.shape[1])
# 17
As you may see for some reason a column is missing after the aggregation and this column is neither of the GroupBy columns ('STORE_ID', 'WEEK_NUMBER').
Why is this happening and how can I fix it?

I've run in to this problem numerous times before. The problem is panda's is dropping one of your columns because it has identified it as a "nuisance" column. This means that the aggregation you are attempting to do cannot be applied to it. If you wish to preserve this column I would recommend including it in the groupby.
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

Related

How to use apply function here?

import numpy as np
import pandas as pd
PATH = r'C:\Users\ADMIN\Desktop\Net_Present_value.csv'
data1 = pd.read_csv(PATH)
def calc_equity(assets,liabilities):
return liabilities - assets
data1.apply(calc_equity)
Its giving me error stating:
calc_equity() missing 1 required positional argument: 'liabilities
Please help as if how can I resolve this
I'm assuming your data has two columns ['assets', 'liabilities'] and you want to calculate the equity as a third column. You don't need the apply function here. You can calculate it as a difference of the two columns:
data1['equity'] = calc_equity(data1['assets'], data1['liabilities'])
This would create new column 'equity' in your DataFrame.
If you insist on applying a function to the DataFrame, the function in question needs to acept a single argument that is either a column or a row of the DataFrame. I your case you want to take a difference of two values in the same row, so the function to apply needs to take a row as an argument:
def calc_equity(row):
return row['liabilities'] - row['assets']
data['equity'] = data1.apply(calc_equity, axis=1)
axis=1 tells the apply function to work on each row. In the function you can access the values in the row by the columns. Bear in mind that this is slower than the first approach as it iterates all the rows instead of working on the columns as numpy arrays.

How to use pandas dataframe to add a column to a dataframe that labels data as 1 or 0 based on matching columns in another df

I'm working on labeling some Medicare datasets for machine learning algorithm as fraudulent or non-fraudulent using the Pandas dataframes. The labeling involves matching the NPI numbers in the DMPOES dataset to the NPI number in the LEIE dataset. Each dataset includes a column named "NPI". I need to be able to find out if each row in the DMEPOS dataframe has a matching NPI in the LEIE dataset. Next, I need to add a column to the DMPOES dataset (maybe named "Fraudulent" that denotes whether or not that row is fraudulent, using 1 as fraudulent, and 0 as not fraudulent.
Here is the code that I have written (It isn't much but it should give the general direction I'm using with Pandas.
import pandas as pd
import numpy as np
#Read files into df
dmepos = pd.read_csv('dmpoes.csv')
leie = pd. =read_csv('leie.csv')
Here are links to downloading the datasets (The NPI columns are labeled differently in each dataset, so I went in and changed it so that the column names matched, I suggest doing that too)I also changed the names of the files to make it more simple to code with:
DMPOES: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/DME2018
LEIE: https://oig.hhs.gov/exclusions/exclusions_list.asp
You can use merge. It's actually cleaner IMO if you don't rename the cols because you'll have to deal with suffixes after the merge. Once you merge you can use np.where to update the Fraudulent col based upon the presence of NaN values where there two merge cols didn't have a match. Not totally sure that is the logic you wanted for the Fraudulent column, but if not, post a comment and I will update as needed.
import pandas as pd
import numpy as np
#Read files into df
dmepos = pd.read_csv('dmpoes.csv')
leie = pd.read_csv('leie.csv')
df_m = dmepos.merge(leie, left_on='REFERRING_NPI', right_on='NPI', how='left')
df_m['Fraudulent'] = np.where(df_m['NPI'].isnull(), 1, 0)
Here we can see that rows that didn't have matches in join cols as they contain NaN values.

How to select columns in python based on datatype

I'm trying to organize the columns in the dataframe based on datatype. I thought I'd do this by using pandas.loc to isolate datatypes of each column and then append them to each other to get one large organized dataset
import numpy as np
import pandas as pd
control = pd.read_csv(loan_path, chunksize=1000)
control = pd.concat(control, ignore_index=True)
int_columns= control.loc[:, control.dtypes==int]
I expect a new dataset with every row and only the columns that have integer datatypes. Instead I get the index of every row but 0 columns.
I know there are columns with integer datatypes. I've also tried looking for categories and floats and always get the same wrong result

Implement MSSQL's partition by windowed clause in Pandas

I’m in the process of moving a MSSQL database to MYSQL and have decided to move some stored procedures to Python rather than rewrite in MYSQL. I am using Pandas 0.23 on Python 3.5.4.
The old MSSQL base uses a number of windowed functions. So far I’ve had success with converting using Pandas using pandas.Dataframe.rolling as follows:
MSSQL
AVG([Close]) OVER (ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
Python
df['MA14'] = df.Close.rolling(14).mean()
I'm stuck working on a solution for the PARTITION BY part of the MSSQL windowed function in python. I am working on a solution with pandas groupby based on feedback since posting...
https://pandas.pydata.org/pandas-docs/version/0.23.0/groupby.html
For Example let's say MSSQL is:
AVG([Close]) OVER (PARTITION BY myCol ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
What I have worked out so far:
Col1 contains my categorical data which I wish to groupby and apply function to on a rolling basis. There is also a date column, thus Col1 and the date column would represent a unique record in the df.
1. Delivers the mean for Col1 albeit aggregated
grouped = df.groupby(['Col1']).mean()
print(grouped.tail(20))
2. Appears to be applying the rolling mean per categorical group of Col1. Which I am after
grouped = df.groupby(['Col1']).Close.rolling(14).mean()
print(grouped.tail(20))
3 Assign to df as new Column RM
df['RM'] = df.groupby(['Col1']).Close.rolling(14).mean()
print(df.tail(20))
It doesn't like this step which I get the error...
TypeError: incompatible index of inserted column with frame index
I've worked up a simple example which may help:
How do I get the results of #2 in the df in #1 or similar.
import numpy as np
import pandas as pd
dta = {'Colour': ['Red','Red','Blue','Blue','Red','Red','Blue','Red','Blue','Blue','Blue','Red'],
'Year': [2014,2015,2014,2015,2016,2017,2018,2018,2016,2017,2013,2013],
'Val':[87,78,863,673,74,81,756,78,694,701,804,69]}
df = pd.DataFrame(dta)
df = df.sort_values(by=['Colour','Year'], ascending=True)
print(df)
#1 add calculated columns to the df. This averages all of column Val
df['ValMA3'] = df.Val.rolling(3).mean().round(0)
print (df)
#2 Group by Colour. This is calculating average by groups correctly.
# where are the other columns from my original dataframe?
#what if I have multiple calculated columns to add?
gf = df.groupby(['Colour'])
gf = gf.Val.rolling(3).mean().round(0)
print(gf)
I am pretty sure the transform function can help.
df.groupby('Col1'')['Val'].transform(lambda x: x.rolling(3, 2).mean())
where e.g. the value 3 is the step of the rolling window, and 2 is the minimum number of periods.
(Just don't forget to sort your data frame before applying the running calculation)

Manipulate A Group Column in Pandas

I have a data set with columns Dist, Class, and Count.
I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).
The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?
import pandas as pd
import numpy as np
a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])
def manipcolumn(x):
csum = x['Count'].sum()
x['Count'] = x['Count'].apply(lambda x: x/csum)
return x
s.groupby('Dist').apply(manipcolumn)
One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:
s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)
This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

Categories