Suppose you have an array of functions. Each function returns a pandas.Series object of the same indexing and size. Each function takes in the same input, the main dataframe df.
I'm looking for an output that has each of the series as a column of a resulting dataframe
Currently I have the following:
df_result = [f(df) for f in f_arr]
df_result = pd.DataFrame(df_result)
This takes a long time (there seems to be some overhead on the list operation) and the resulting dataframe is the transpose of what I need. I feel like there should be a clean map/apply way to do this.
Using
df_result = pd.concat(df_result, axis=1)
in place of the second line will avoid getting the transpose.
Related
Is there a fast way of adding a column to a data frame df with values depending on all the rows of df with smaller index? A very simple example where the new column only depends on the value of one other column would be df["new_col"] = df["old_col"].cumsum() (if df is ordered), but I have something more complicated in mind. Ideally, I'd like to write something like
df["new_col"] = df.[some function here](f),
where [some function] sets the i-th value of df["new_col"] to f(df[df.index <= df.index[i]]). (Ideally [some function] can also be applied to groupby() objects.)
At the moment I loop through rows, add a temporary column containing a dict of relevant values and then apply a function, but this is very slow, memory-inefficient, etc.
I have to create a new dataframe in which each column is determined by a function which has two arguments. The problem is that for each column the function needs a different argument which is given by the number of the column.
There are about 6k rows and 200 columns in the dataframe:
The function that defines each column of the new dataframe is defined like this:
def phiNT(M,nT):
M=M[M.columns[:nT]]
d=pd.concat([M.iloc[:,nT-1]]*nT,axis=1)
d.columns=M.columns
D=M-d
D=D.mean(axis=1)
return D
I tried to create an empty dataframe and then add each column using a loop:
A=pd.DataFrame()
for i in range(1,len(M.columns)):
A[i]=phiNT(M,i)
But this is what pops up:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
So i need a way to apply pd.concat to create all columns at once.
you should create all dataframes in a list or generator then call pd.concat on the list or generator to create a new dataframe with all the dataframe columns in it, instead of doing it once for each column.
the following uses a generator to be memory efficient.
results = (phiNT(M,i) for i in range(1,len(M.columns)))
A = pd.concat(results,axis=1)
this is how it'd be done in a list.
A = pd.concat([phiNT(M,i) for i in range(1,len(M.columns))],axis=1)
I am passing a dataframe as an argument to a function which performs column and row filtration and then returns the dataframe, now I am storing the return value of function in a different dataframe, In such case the original dataframe will remain unchanged or not?
def some_function(df):
#row and column filter on df
return df
new_df=some_function(original_df)
Will "new_df" be equal to "original_df"?
In my personal experience sometimes the original dataframe remains the same and sometimes it changes. What is the reason behind such behavior?
When you pass a pandas dataframe as an argument, you pass it by reference. This means that the function can change the dataframe (df, in your case). Now:
It can change df, but it doesn't necessarily do so. In the following example, df doesn't change:
def foo(df):
print(df)
when it returns df at the end, the object it returns may or may not be the same object as the original object. Operations like df["new_col"] = 7 or df.reset_index(inplace=True) change the original dataframe. Operations such as df = df * 2 create a new dataframe, so when you return df at the end of the function it's a different object than the one it received as an argument.
I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of
I have two data frames, one (p) df contains columns to be transformed, second (a) contains transformation parameter in form of pd.series:
p=np.random.rand(5,3) #create data frame
cols=["A","B","C"]
df1=pd.DataFrame(p,columns=cols)
a=np.array([0.3,0.4,0.5]) # create series of transform parameters
a=pd.Series(a,index=cols)
I wander how to iterate over df columns to transform each one with appropriate transform parameter, something like below:
df1.apply(stats.boxcox,lmbda=a)
which of course not works. My temporary solution is just a brute force function:
def boxcox_transform(df,lambdas):
df1=pd.DataFrame(index=df.index)
for column in list(df):
df1[column]=stats.boxcox(df[column],lambdas[column])
return(df1)
boxcox_transform(df1,a)
I wander is there any more elegant solution like for example R CRAN mapply which can iterate over two lists
You can use a lambda:
result_df = df1.apply(lambda col: stats.boxcox(col, a.loc[col.name]))