applymap() does not work on Pandas MultiIndex Slice - python

I have an hierarchical dataset:
df = pd.DataFrame(np.random.rand(6,6),
columns=[['A','A','A','B','B','B'],
['mean', 'max', 'avg']*2],
index=pd.date_range('20000103', periods=6))
I want to apply a function to all values under the columns A. I can set the value to something:
df.loc[slice(None), 'A'] = 1
Easy enough. Now, instead of assigning a value, if I want to apply a mapping to this MultiIndex slice, it does not work.
For example, let me apply a simple formatting statement:
df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
This step works fine. However, I cannot assign this to the original df:
df.loc[slice(None), 'A'] = df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
Everything turns into a NaN. Any help would be appreciated.

You can do it in a couple of ways:
df['A'] = df['A'].applymap('{:.2f}'.format)
or (this will keep the original dtype)
df['A'] = df['A'].round(2)
or as a string
df['A'] = df['A'].round(2).astype(str)

Related

difference between df.loc[:, columns] and df.loc[:][columns]

I want to normalize some columns of a pandas data frame using MinMaxScaler in this way:
scaler = MinMaxScaler()
numericals = ["TX_TIME_SECONDS",'TX_Amount']
while I do in this way:
df.loc[:][numericals] = scaler.fit_transform(df.loc[:][numericals])
it's not done inplace and df is not changed;
whereas, when I do in this way:
df.loc[:, numericals] = scaler.fit_transform(df.loc[:][numericals])
the numerical columns of df are changed in place,
So, What's the difference between df.loc[:, ~] and df.loc[:][~]
df.loc[:][numericals] selects all rows and then selects columns "TX_TIME_SECONDS" and 'TX_Amount' of the returning object, and assigns some value to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame.
The correct way of making this assignment is using df.loc[:, numericals], because with .loc you are guaranteed to modify the original DataFrame.
I suggest you read some documentation because this is pretty basic.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://www.geeksforgeeks.org/python-pandas-dataframe-loc/

Should I redefine a pandas dataframe with every function?

From experience, some pandas functions require that I redefine the dataframe if I intend to use them, otherwise they won't return a copy by default. For example: df.drop("ColA", axis=1) will not actually drop the column, but I need to implement it by df = df.drop("ColA", axis=1) or by df.drop("ColA", axis=1, inplace=True) if I need to modify the dataframe.
This seems to be the case with some other pandas functions. Therefore, what I usually do is redefine a dataframe for every function so that I can ensure it is modified. For example:
df = df.set_index("id")
df = df.sort_values(by="Date")
df["B"] = df["B"].fillna(-1)
df = df.reset_index(drop = True)
df["ColA"] = df["ColA"].astype(str)
I know some of these functions do not require to define the dataframe, but I just do it to make sure the changes are applied. My question is if there is a way to know which functions require redefining the dataframe and which don't need it, and also if there is any computational difference between using df = df.set_index("id") and df.set_index("id") if they have the same output.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
My question is if there is a way to know which functions require redefining the dataframe and which don't need it
It's called the manual.
set_index() has an inplace=True parameter; if that's set, you won't need to reassigned.
sort_values() has that too.
fillna() has that too.
reset_index() has that too.
astype() has copy=True by default, but heed the warning setting it to False:
"be very careful setting copy=False as changes to values then may propagate to other pandas objects"
if there is any computational difference between
Yes – if Pandas is able to make the changes in-place, it won't need to copy the series or dataframe, which could be a significant time and memory expense with large dataframes.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
Yes, there is. The first reassigns a series into a dataframe, the other just assigns the single series into the (now misnamed) name df
In pandas github is long discussion about this, check this.
I also agree the best dont use inplace, because confused and not sure how/when it save memory.
Should I redefine a pandas dataframe with every function?
I think yes, maybe if use large DataFrames here should be exceptions, link.
There is always list of methods with inplace parameter.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)
If use df["B"] = df["B"].fillna(-1) it reassign column B (Series) back with replaced missing values to -1.
If use df = df["B"].fillna(-1) it return Series with replaced values, but it is reassigned to df, so original DataFrame is overwitten by this replaced Series.
I don't think there is a solution for this. Some methods work inplace by default and some others return a copy of the df and you need to reassign the df as you usually do. The best option is to check the docs (for the inplace parameter) everytime you want to use some method and you will learn by practice, at least the most common ones, like sorting, reseting index, etc

Process for multiple columns

I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me.
c = data["High_banks"]
c2 = pd.to_numeric(c.str.replace(',',''))
data = data.assign(High_banks = c2)
What is the best way to do this?
i think you can do it like this
df = df.replace(",","",regex=True )
after that you can convert datatype
You can use a combination of the methods apply and applymap.
Take this for an example:
df = pd.DataFrame([['1,', '2,12'], ['3,356', '4,567']], columns = ['a','b'])
new_df = (df.applymap(lambda x: x.replace(',',''))
.apply(pd.to_numeric, axis = 1))
new_df.dtypes
>> #successfully converted to numeric types
a int64
b int64
dtype: object
The first method, applymap runs element wise on the dataframe to remove , then apply applies the pd.to_numeric function across the column axis of the dataframe.

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

Categories