Rename last column in a dataframe passed along in method chain - python

How can I rename the last column in a dataframe, that was passed along in a method chain? Think about the following example (the real use case is more complex). How can the rename function refer to the dataframe that it processes (which is different from the "table" dataframe? Is there something like the following? Unfortunately "self" does not exist.
result = table.iloc[:,2:-1].rename(columns={self.columns[-1]: "Text"})

Use pipe():
result = table.iloc[:,2:-1].pipe(lambda df: df.rename(columns={df.columns[-1]: "Text"}))

I think that you can just do the following:
result = table.iloc[:,2:-1]
result.columns = result.columns[:-1] + ["Text"]

Related

Pandas Dataframe: Function doesn't preserve my custom column order when returning df

I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable

How to filter multiple dataframes in a loop?

I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.

Applying function to dataframe column?

I have the following function (one-hot encoding function that takes a column as an input). I basically want to apply it to a column in my dataframe, but can't seem to understand what's going wrong.
def dummies(dataframe, col):
dataframe[col] = pd.Categorical(dataframe[col])
pd.concat([dataframe,pd.get_dummies(dataframe[col],prefix = 'c')],axis=1)
df1 = df['X'].apply(dummies)
Guessing something is wrong with how I'm calling it?
you need to make sure you're returning a value from the function, currently you are not..also when you apply a function to a column you are basically passing the value of each row in the column into the function, so your function is set up wrong..typically you'd do it like this:
def function1(value):
new_value = value*2 #some operation
return new_value
then:
df['X'].apply(function1)
currently your function is set up to take entire df, and the name of a column, so likely your function might work if you call it like this:
df1 = dummies(df, 'X')
but you still need to add a return statement
If you want to apply it to that one column you don't need to make a new dataframe. This is the correct syntax. Please read the docs.
df['X'] = df['X'].apply(lambda x : dummies(x))

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

How to get particular Column of DataFrame in pandas?

I have a data frame name it df I want to have like its first and second colums(series) in variable x and y.
I would have done that by name of the column like df['A'] or df['B'] or something like that.
But problem here is that data is itself header and it has no name.Header is like 2.17 ,3.145 like that.
So my Question is:
a) How to name column and start the data(which starts now from head) right after the name ?
b) How to get particular column's data if we don't know the name or it doesn't have the name ?
Thank you.
You might want to read the
documentation on indexing.
For what you specified in the question, you can use
x, y = df.iloc[:, [0]], df.iloc[:, [1]]
Set the names kwarg when reading the DataFrame (see the read_csv docs.
So instead of pd.read_csv('kndkma') use pd.read_csv('kndkma', names=['a', 'b', ...]).
It is usually easier to name the columns when you read or create the DataFrame, but you can also name (or rename) the columns afterwards with something like:
df.columns = ['A','B', ...]

Categories