I'm in the process of converting some code from SAS to Python and was needing some tips on how to build a function that will effectively act as a macro would in SAS and run instances of the same code for a set of variables (dataframes) I pass into the function as an argument/parameters.
In the example I have a dataframe called country_extract. I then subset the dataframe based on the country code field. This results in multiple dataframes (australia_extract,england_extract and india_extract). I then need to apply a set of filters and sum the GDP for each of those dataframes. There will be 20 filters based on multiple conditions for each before I aggregate, in the example below I just list two simple filters to give an idea of how the code is currently structured.
How would I define a function to run step 2 for all the dataframes? Are there resources available I can look at some working examples? Currently I get errors I believe for the return saying there is nodata.
#1. Subset Country Dataframe into multiple dataframes
australia_filter = country_extract['country_code']== 'aus'
australia_extract= country_extract.where(australia_filter ,inplace = True)
england_filter= country_extract['country_code'] == 'eng'
england_extract= country_extract.where(england_filter,inplace = True)
india_filter= country_extract['country_code'] == 'ind'
india_extract= country_extract.where(india_filter,inplace = True)
#2. Apply filters for country type and sub-type and then aggregate GDP
def extract_filters(x):
country_type_filter = x['country_type'].isin('CRTD')
country_sub_type_filter = x['country_sub_type'].isin('GLA') &
x['continent'].isin('Y') &
x['generic'].isin('Y')
return country_total
country_total= [
[1,x.loc[country_type_filter ,'GDP'].sum()],
[2,x.loc[country_sub_type_filter ,'GDP'].sum()],
]
australia_gdp= extract_filters(australia_extract)
england_gdp= extract_filters(england_extract)
india_gdp= extract_filters(india_extract)
Basically I want the function to run for the 3 dataframes (england_extract,australia_extract and india_extract) and generate a separate list for each. How would I code this?
Yes, that's a very good use of Python functions. However, it looks like a good candidate for .groupby() and .agg(), which would look something like this:
country_extract.groupby(["country_code","country_sub_type"]).agg(sumGDP=('GDP','sum'))
Update: You could also save yourself some typing by doing something like
GPD_dict = {}
country_list = ['aus', 'eng', 'ind']
for country in country_list:
GPD = extract_filters(country_extract[country_extract.country_code == country])
GDP_dict[country] = GDP
You'll need to modify your function to return country_totals as well.
Related
Suppose you have a survey, and you want to calculate the Net Promoter Score (NPS) of different cuts of respondents. My data may look something like this:
import pandas as pd
data = [[1,111,1,1,35,'F','UK','High'], [1,112,0,1,42,'F','Saudi Arabia','Low'], [1,113,1,1,17,'M','Belize','High'],[1,1234,1,1,35,'F','Saudi Arabia','High'],[2,1854,1,1,35,'M','Belize','Low'],[2,1445,1,1,35,'F','UK','Low']]
df = pd.DataFrame(data, columns = ['survey_num','id_num','nps_sum','nps_count','age','gender','country','income_level'])
df
I want to be able to write a function that cycles through this data and does the following each time:
col_list = ['survey_num','nps_sum','nps_count']
df_customname = df_customname1[col_list]
df_customname = df_customname.groupby('survey_num').sum()
df_customname['nps_customname'] = (df_customname['nps_sum'] / df_customname['nps_count'])*100
df_customname = df_customname.sort_values(by=['survey_num'],ascending=True)
df_customname= pd.DataFrame(df_customname.drop(['nps_sum','nps_count'], axis=1))
df_customname
The reason I need this to be dynamic is because I need to repeat this process for different cuts of data. For example, I want to be able to filter for gender = F AND country = Saudi Arabia, for example. Or Just Gender = M. Or just income = High. I then want to do a left join of that to the original df that is currently called customname (this would be my base case, so it may just be called 'all'
So the final table after running the function a few times, defining my cuts each time, my final output will look like this:
data = [[1,66.67,83.5,22.5,47.7,74.1],[2,75.67,23.5,24.5,76.7,91.1]]
df_final = pd.DataFrame(data, columns = ['survey_num','nps_all','nps_saudi_f','nps_m','nps_high','nps_40plus'])
df_final
Note there may be better ways to run this, but I'm looking for the quickest/simplest possible way that is as close as this as possible. I don't yet know what my cuts will be, but there are likely to be a lot of them, so the easier it is to just define those outside the function and have the function run that code, then left join to the original df, the better.
Thank you!
I have the following for loop for a dataframe
# this is my data
df=yf.download('AAPL', period='max', interval='1d' )
vwap15 = []
for i in range(0,len(df)-1):
if(i>=15):
vwap15.append(sum(df["Close"][i-15:i]*df["Volume"][i-15:i])/sum(df["Volume"][i-15:i]))
else:
vwap15.append(None)
When I created the above for loop it generated a list.
I actually want to have it as a dataframe that I can join to my original dataframe df
any insights would be appreciated
thanks
Maybe you mean something like (right after the loop):
df["vwap15"] = vwap15
Note that you will need to fix your for loop like so (otherwise lengths will not match):
for i in range(len(df)):
Maybe you want to have a look at currently available packages for Technical Analysis indicators in Python with Pandas.
Also, try to use NaN instead of None and consider using the Pandas .rolling method when computing indicators over a time window.
I'm trying to use the function itis.hierarchy_full of the pytaxize package in order to retrieve information about a biological species from a specific Id.
The function takes only one values/Id and save all the taxonomic information inside a pandas dataframe that I can edit later.
import pandas as pd
from pytaxize import itis
test1 = itis.hierarchy_full(180530, as_dataframe = True)
I have something like 800 species Ids, and I want to automate the process to obtain 800 different dataframes.
I have somehow created a test with a small list (be aware, I am a biologist so the code is really basic and maybe inefficient:
species = [180530, 48739, 567823]
tx = {}
for e in species2:
tx[e] = pd.DataFrame(itis.hierarchy_full(e, as_dataframe = True))
Now if I input tx (I'm using a Jupyter Notebook) I obtain a dictionary of pandas dataframes (I think it is a nested dictionary). And if I input tx[180530] I obtain exactly a single dataframe equal to the ones that I can create with the original function.
from pandas.testing import assert_frame_equal
assert_frame_equal(test_180530, sp_180530)
Now I can write something to save each result stored in dictionary as a separate dataframe:
sp_180530 = tx[180530]
sp_48739 = tx[48739]
sp_567823 = tx[567823]
There is a way to automate the process and save each dataframe to a sp_id? Or even better, there is a way to include in the original function where I create tx, to output directly multiple dataframes?
Not exactly what you asked, but to be able to elaborate a bit more on working with the dataframes in the dictionary... To work with the dictionary, loop over the dict and then use every contained dataframe one by one...
for key in tx.keys():
df_temp = tx[key]
# < do all your stuff to df_temp .....>
# Save the dataframe as you want/need (I assume as csv for here)
df_temp.to_csv(f'sp_{key}.csv')
I have two dataframes, let's call them Train and LogItem. There is a column called user_id in both of them.
For each row in Train, I pick the user_id and a date field and then pass it to a function which returns some values by calculating it from the LogItem dataframe which I use to populate column in Train(LogEntries_7days,Sessioncounts_7days) against the location of that particular row.
def ServerLogData(user_id,threshold,threshold7,dataframe):
dataframe = LogItem[LogItem['user_id']==user_id]
UserData = dataframe.loc[(dataframe['user_id']==user_id) &
(dataframe['server_time']<threshold) &
(dataframe['server_time']>threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.session_id.nunique()
return entries,Unique_Session_Count
for id in Train.index:
print (id)
user_id = (Train.loc[[id],['user_id']].values[0])[0]
threshold = (Train.loc[[id],['impression_time']].values[0])[0]
threshold7 = (Train.loc[[id],['AdThreshold_date']].values[0])[0]
dataframe=[]
Train.loc[[id],'LogEntries_7days'],Train.loc[[id],'Sessioncounts_7days'] =
ServerLogData(user_id,threshold,threshold7,dataframe)
This approach is incredibly slow and just like in databases, can we use apply method here or something else which could be fast enough.
Please suggest me a better approach
Edit: Based on suggestions from super-helpful colleagues here, I am putting some data images for both dataframes and some explanation.
In dataframe Train, there will be user actions with some date values and there will be multiple rows for a user_id.
For each row, I pass user_id and dates to another dataframe and calculate some values. Please note that the second dataframe too has multiple rows for user_id for different dates. So grouping them does not seem be an option here.
I pass user_id and dates, flow goes to second dataframe and find rows based on user_id which fits the dates too that I passed.
If you have a really large dataframe, printing each row is going to eat up a lot of time, and it's not like you'll be able to read throw thousands of lines of output anyway.
If you have a lot of rows for each id, then you can speed it up quite a bit by processing each id only once. There's a question that discsusses filtering a dataframe to unique indices. The top rated answer, adjusted for this case, would be unique_id_df = Train.loc[~Train.index.duplicated(keep='first')]. That creates a dataframe with only one row for each id. It takes the first row for each id, which seems to be what you're doing as well.
You can then create a dataframe from applying your function to unique_id_df. There are several ways to do this. One is to create a series entries_counts_series = unique_id_df.apply(ServerLogData,axis=1) and then turn it into a dataframe with entries_counts_df = pd.DataFrame(entries_counts_series.tolist(), index = entries_counts_series.index). You could also put the data into unique_id_df with unique_id_df['LogEntries_7days'],unique_id_df['Sessioncounts_7days'] = zip(*unique_id_df.apply(ServerLogData,axis=1), but then you would have a bunch of extra columns to get rid of.
Once you have your data, you can merge it with your original dataframe: Train_with_data = Train.merge(entries_counts_df, left_index = True, right_index = True). If you put the data into unique_id_df, you could do something such as Train_with_data = Train.merge(unique_id_df[['LogEntries_7days','Sessioncounts_7days']], left_index = True, right_index = True).
Try out different variants of this and the other answers, and see how long each of them take on a subset of your data.
Also, some notes on ServerLogData:
dataframe is passed as a parameter, but then immediately overwritten.
You subset LogItem to where LogItem['user_id']==user_id, but then you check that condition again. Unless I'm missing something, you can get rid of the dataframe = LogItem[LogItem['user_id']==user_id] line.
You've split the line that sets UserData up, which is good, but standard style is to indent the lines in this sort of situation.
You're only using session_id, so you only need to take that part of the dataframe.
So:
def ServerLogData(user_id,threshold,threshold7):
UserData = LogItem.session_id.loc[(LogItem['user_id']==user_id) &
(LogItem['server_time']<threshold) &
(LogItem['server_time']>threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.nunique()
return entries, Unique_Session_Count
I did some quite-possibly-not-representative tests, and subsetting column, rather than subsetting the entire dataframe and then taking the column out from that dataframe, sped things up significantly.
Try to do a groupby user_id then you pass each user's history as a dataframe, I think it will get you faster results than passing your Train line by line. I have used this method on log files data and it wasn't slow I don't know if it is the optimal solution but I found the results satisfying and quite easy to implement. Something like this:
group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
df_user = group_user.get_group(x)
df_train = group_train.get_group(x)
# do your thing here
processing_function(df_user, df_train)
Write a function doing the calculation you want (I named it processing_function). I hope it helps.
EDIT: here is how your code becomes
def ServerLogData(threshold, threshold7, df_user):
UserData = df_user[(df_user['server_time'] < threshold) & (df_user['server_time'] > threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.session_id.nunique()
return entries, Unique_Session_Count
group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
df_user = group_user.get_group(x)
df_train = group_train.get_group(x)
for id in df_train.index:
user_id = (df_train.loc[[id], ['user_id']].values[0])[0]
threshold = (df_train.loc[[id], ['impression_time']].values[0])[0]
threshold7 = (df_train.loc[[id], ['AdThreshold_date']].values[0])[0]
df_train.loc[[id], 'LogEntries_7days'], df_train.loc[[id], 'Sessioncounts_7days'] = ServerLogData(threshold, threshold7, df_user)
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()