Suppose you have a survey, and you want to calculate the Net Promoter Score (NPS) of different cuts of respondents. My data may look something like this:
import pandas as pd
data = [[1,111,1,1,35,'F','UK','High'], [1,112,0,1,42,'F','Saudi Arabia','Low'], [1,113,1,1,17,'M','Belize','High'],[1,1234,1,1,35,'F','Saudi Arabia','High'],[2,1854,1,1,35,'M','Belize','Low'],[2,1445,1,1,35,'F','UK','Low']]
df = pd.DataFrame(data, columns = ['survey_num','id_num','nps_sum','nps_count','age','gender','country','income_level'])
df
I want to be able to write a function that cycles through this data and does the following each time:
col_list = ['survey_num','nps_sum','nps_count']
df_customname = df_customname1[col_list]
df_customname = df_customname.groupby('survey_num').sum()
df_customname['nps_customname'] = (df_customname['nps_sum'] / df_customname['nps_count'])*100
df_customname = df_customname.sort_values(by=['survey_num'],ascending=True)
df_customname= pd.DataFrame(df_customname.drop(['nps_sum','nps_count'], axis=1))
df_customname
The reason I need this to be dynamic is because I need to repeat this process for different cuts of data. For example, I want to be able to filter for gender = F AND country = Saudi Arabia, for example. Or Just Gender = M. Or just income = High. I then want to do a left join of that to the original df that is currently called customname (this would be my base case, so it may just be called 'all'
So the final table after running the function a few times, defining my cuts each time, my final output will look like this:
data = [[1,66.67,83.5,22.5,47.7,74.1],[2,75.67,23.5,24.5,76.7,91.1]]
df_final = pd.DataFrame(data, columns = ['survey_num','nps_all','nps_saudi_f','nps_m','nps_high','nps_40plus'])
df_final
Note there may be better ways to run this, but I'm looking for the quickest/simplest possible way that is as close as this as possible. I don't yet know what my cuts will be, but there are likely to be a lot of them, so the easier it is to just define those outside the function and have the function run that code, then left join to the original df, the better.
Thank you!
Related
I'm in the process of converting some code from SAS to Python and was needing some tips on how to build a function that will effectively act as a macro would in SAS and run instances of the same code for a set of variables (dataframes) I pass into the function as an argument/parameters.
In the example I have a dataframe called country_extract. I then subset the dataframe based on the country code field. This results in multiple dataframes (australia_extract,england_extract and india_extract). I then need to apply a set of filters and sum the GDP for each of those dataframes. There will be 20 filters based on multiple conditions for each before I aggregate, in the example below I just list two simple filters to give an idea of how the code is currently structured.
How would I define a function to run step 2 for all the dataframes? Are there resources available I can look at some working examples? Currently I get errors I believe for the return saying there is nodata.
#1. Subset Country Dataframe into multiple dataframes
australia_filter = country_extract['country_code']== 'aus'
australia_extract= country_extract.where(australia_filter ,inplace = True)
england_filter= country_extract['country_code'] == 'eng'
england_extract= country_extract.where(england_filter,inplace = True)
india_filter= country_extract['country_code'] == 'ind'
india_extract= country_extract.where(india_filter,inplace = True)
#2. Apply filters for country type and sub-type and then aggregate GDP
def extract_filters(x):
country_type_filter = x['country_type'].isin('CRTD')
country_sub_type_filter = x['country_sub_type'].isin('GLA') &
x['continent'].isin('Y') &
x['generic'].isin('Y')
return country_total
country_total= [
[1,x.loc[country_type_filter ,'GDP'].sum()],
[2,x.loc[country_sub_type_filter ,'GDP'].sum()],
]
australia_gdp= extract_filters(australia_extract)
england_gdp= extract_filters(england_extract)
india_gdp= extract_filters(india_extract)
Basically I want the function to run for the 3 dataframes (england_extract,australia_extract and india_extract) and generate a separate list for each. How would I code this?
Yes, that's a very good use of Python functions. However, it looks like a good candidate for .groupby() and .agg(), which would look something like this:
country_extract.groupby(["country_code","country_sub_type"]).agg(sumGDP=('GDP','sum'))
Update: You could also save yourself some typing by doing something like
GPD_dict = {}
country_list = ['aus', 'eng', 'ind']
for country in country_list:
GPD = extract_filters(country_extract[country_extract.country_code == country])
GDP_dict[country] = GDP
You'll need to modify your function to return country_totals as well.
I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.
So I try to import a number of excels and create a list of all the data and here is my code for it:
import os
import pandas as pd
cwd = os.path.abspath('')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.XLSX'):
df = df.append(pd.read_excel(file), ignore_index=True)
df = df.where(df.notnull(), None)
array = df.values.tolist()
print(array)
the excels, on the other hand, look something like so:
product cost used_by prime
name price gender yes or no
name price gender yes or no
... and so on
However, not all of them have the odder of product cost used_by prime(case one order). Some of them, for example, are in the format of cost product prime used_by(case two order). Of course, pandas would be able to auto-sort them and make sure the data find the right header, but I run into an issue.
So basically, I run this code on two different devices using the same data and code but the results are different. One of them is in case one order while the other one is in the case two order. I want to have a line of code that makes sure the data frame is always in the order of product cost used_by prime but I am not sure how.
Can you show me the python code for it? Thank you in advance.
you can try reordering right after loading the csv file
df = df[['product', 'used_by', 'prime']]
I am new to python but am aware about the usefulness of pandas, thus I would like to kindly ask if someone can help me to use pandas in order to address the below problem.
I have a dataset with buses, which looks like:
BusModel;BusID;ModeName;Value;Unit;UtcTime
Alpha;0001;Engine hours;985;h;2016-06-22 19:58:09.000
Alpha;0001;Engine hours;987;h;2016-06-22 21:58:09.000
Alpha;0001;Engine hours;989;h;2016-06-22 23:59:09.000
Alpha;0001;Fuel consumption;78;l;2016-06-22 19:58:09.000
Alpha;0001;Fuel consumption;88;l;2016-06-22 21:58:09.000
Alpha;0001;Fuel consumption;98;l;2016-06-22 23:59:09.000
The file is .csv format and is separated by semicolon (;). Please note that I would like to plot the relationship between ‘Engine hours’ and ‘Fuel consumption’ by 'calculating the mean value of both for each day' based on the UtcTime. Moreover, I would like to plot graphs for all the busses in the dataset (not only 0001 but also 0002, 0003 etc.). How I can do that with simple loop?
Start with the following interactive mode
import pandas as pd
df = pd.read_csv('bus.csv', sep=";", parse_dates=['UtcTime'])
You should be able to start playing around with the DataFrame and discovering functions you can directly use with the data. To get a list of buses by ID just do:
>>> bus1 = df[df.BusID == 1]
>>> bus1
Substitute 1 with the ID of the bus you require. This will return you a sub-DataFrame. To get BusID 1 and just their engine hours do:
>>> bus1[bus1.ModeName == "Engine hours"]
You can quickly get statistics of columns by doing
>>> bus1.Value.describe()
Once you grouped the data you need you can start plotting:
>>> bus1[bus1.ModeName == "Engine hours"].plot()
>>> bus1[bus1.ModeName == "Fuel consumption"].plot()
>>> plt.show()
There is more explanation on the docs. Please refer to http://pandas.pydata.org/pandas-docs/stable/.
If you really want to use pandas, remember this simple thing: never use a loop. Loops aren't scalable, so try to use built-in functions. First let's read your dataframe:
import pandas as pd
data = pd.read_csv('bus.csv',sep = ';')
Here is the weak point of my answer, I don't know how to manage dates efficently. So create a column named day which contains the day from UtcTime (I would use an apply methode like this data['day'] = data['UtcTime'].apply(lambda x: x[:10]) but it's a hidden loop so don't do that!)
Then to take only the data of a single bus, try a slicing method:
data_bus1 = data[data.BusID == 1]
Finally use the groupby function:
data_bus1[['Modename','Value','day']].groupby(['ModeName','day'],as_index = False).mean()
Or if you don't need to separate your busses in different dataframes, you can use the groupby on the whole data:
data[['BusID','ModeName','Value','day']].groupby(['BusID','ModeName','day'],as_index = False).mean()