Apply function to several DataFrames by creating new DataFrame - python

May be Someone can just help me to find the solution:
I have 100 dataframes. Each dataframe contains time / High_Price / Low_price
I would like to create new Dataframe, which contains Gains from each DataFrame.
Example:
df1 = pd.DataFrame({"high":[5,4,5,2],
"low":[1,2,2,1]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
df100 = pd.DataFrame({"high":[7,5,6,7],
"low":[1,2,3,4]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
Functions:
def myfunc(data, amount):
data= data.loc[(data!=0).any(1)]
profit = (amount/data.iloc[0]['low']) * data.iloc[-1]['high']
return profit
Output should be:
output= pd.DataFrame({"Gain":[1,6]},
index=["df1","df100"])
How can I apply function to 100 DataFrames and get from them only Gains by creating the Dataframe, where we see the name of DataFrame and the Gain for this DataFrame?

Put your dataframes in a list and access them by integer index. Having variables named df1 to df100 is bad programming style because a) the dataframes belong together, so put them in a collection (e.g. list) and b) you cannot get "the" name of an object from its value, leading to complications such as the one you are facing now.
So let dfs be your list of 100 dataframes, starting at index 0.
Use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs], columns=['Gain'])
The index of output now corresponds to the index of dfs, starting at 0. There's no reason to rename it to 'df1' ... 'df100', you gain no information and the output becomes harder to handle.
In case of arbitrary dataframe names, use a dictionary that maps name to df. Let's call it dfs again. Then use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs.values()], columns=['Gain'], index=dfs.keys()])
I'm assuming myfunc is correct, I did not debug it.

Related

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Iterate over loop to get cumsum for a dataframe variable for each different date variable (Not Aggregation)

I am trying to write a simple code in which I have units produced in a dataframe 'Yield' and 'Date' on which they were produced. Multiple records are present for the same date. I am going to use numpy cumsum function to get running total for each row and then subtract the value for the current row. I do not wish to do aggregation for the date since I need the original raw records to remain.
I can do this for one set of date by having .loc variable made for each date and then apply the function. But can't figure out how to do this iteratively.
data_43102 = data['Yield_Done','PDate'].loc[data['PDate'] ==43102]
#gives me Yield Done for only 43102
data_43102['Running_total']= cumsum(data_43102['Yield_Done']) #gives me cumulative total
data_43102['Running_total'] = data_43102['Running_total'] - data_43102['Yield_Done']
Whet I expect after running the code is there to be output like in the case of one I had
You can store all the dates in a list and then use isin to get data filtered for all the dates:
dates = ['43102', '23102', '43102'...]
data_filtered_by_date = data['Yield_Done','PDate'].loc[data['PDate'].isin(dates)]
I hope this helps.

How to maintain lexsort status when adding to a multi-indexed DataFrame?

Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686

Iterate through and overwrite specific values in a pandas dataframe

I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.
Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)
I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.

Categories