I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of
Related
I need to add a new column to a pandas dataframe, where the value is calculated from the value of a column in the previous row.
Coming from a non-functional background (C#), I am trying to avoid loops since I read it is an anti-pattern.
My plan is to use series.shift to add a new column to the dataframe for the previous value, call dataframe.apply and finally remove the additional column. E.g.:
def my_function(row):
# perform complex calculations with row.time, row.time_previous and other values
# return the result
df["time_previous"] = df.time.shift(1)
df.apply(my_function, axis = 1)
df.drop("time.previous", axis=1)
In reality, I need to create four additional columns like this. Is there a better alternative to accomplish this without a loop? Is this a good idea at all?
I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df
I have two data frames, one (p) df contains columns to be transformed, second (a) contains transformation parameter in form of pd.series:
p=np.random.rand(5,3) #create data frame
cols=["A","B","C"]
df1=pd.DataFrame(p,columns=cols)
a=np.array([0.3,0.4,0.5]) # create series of transform parameters
a=pd.Series(a,index=cols)
I wander how to iterate over df columns to transform each one with appropriate transform parameter, something like below:
df1.apply(stats.boxcox,lmbda=a)
which of course not works. My temporary solution is just a brute force function:
def boxcox_transform(df,lambdas):
df1=pd.DataFrame(index=df.index)
for column in list(df):
df1[column]=stats.boxcox(df[column],lambdas[column])
return(df1)
boxcox_transform(df1,a)
I wander is there any more elegant solution like for example R CRAN mapply which can iterate over two lists
You can use a lambda:
result_df = df1.apply(lambda col: stats.boxcox(col, a.loc[col.name]))
I have a Pandas dataframe with a multiindex. Level 0 is 'Strain' and level 1 is 'JGI library.' Each 'Strain' has several 'JGI library' columns associated with it. I would like to use a lambda function to apply a t-test to compare two different strains. To troubleshoot, I have been taking one row of my dataframe using the .iloc[0] command.
row = pvalDf.iloc[0]
parent = 'LL1004'
child = 'LL345'
ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]
This works as expected. Now I try to apply it to my whole dataframe
parent = 'LL1004'
child = 'LL345'
pvalDf = countsDf4.apply(lambda row: ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1])
Now I get an error message saying, "ValueError: ('level name Strain is not the name of the index', 'occurred at index (LL1004, BCHAC)')"
'LL1004' is a 'Strain,' but Pandas doesn't seem to be aware of this. It looks like maybe the multiindex was not passed to the lambda function correctly? Is there a better way to troubleshoot lambda functions than using .iloc[0]?
I put a copy of my Jupyter notebook and an excel file with the countsDf4 dataframe on Github https://github.com/danolson1/pandas_ttest
Thanks,
Dan
How about, more simply:
pvalDf = countsDf4.apply(lambda row: ttest_ind(row[parent], row[child]), axis=1)
I've tested it on your notebook and it works.
Your problem is that DataFrame.apply() by default applies the function to each column, not to each row. So, you need to specify the axis=1 parameter to override the default behavior and apply the function row by row.
Also, there's no reason to use row.groupby(level='Strain').get_group(x) when you could simply index the group of columns by row[x]. :)
Is there any way to change some column names in pandas dataframe using lambda, but not all? For example, suppose that this data frame has columns whose name are osx, centos, ubunto, windows. In this data frame, I want to replace all column names with that column name appended by x, so in this case, I can rename the column name by:
df.rename(columns=lambda x: x+'x')
However, if I want to rename all column names other than ubunto, how can I do it? So what I want to get is data frame whose name is osxx, centosx, ubunto, windowsx. Actually, my true data frame has much more columns, so I don't like to write out one-by-one using usual dictionary syntax and instead want to lean on lambda function if it's feasible.
Thanks.
A ternary operator might achieve your goal:
os_list = ['osxx', 'centos', 'windowsx']
df.rename(columns=lambda x: x+'x' if x in os_list else x)