Applying functions declared as strings to a pandas dataframe - python

I have a pandas dataframe. I want to create new columns in the dataframe with
mathematical functional values of the existing columns.
I know how to do it for simple cases:
import pandas as pd
import numpy as np
# Basic dataframe
df = pd.DataFrame(data={'col1': [1,2], 'col2':[3,5]})
for i in df.columns:
df[f'{i}_sqrt'] = df[i].apply(lambda x :np.sqrt(x))
produces
Now I want to extend it to the cases where the functions are written as strings like:
one_func = ['(x)', '(np.sqrt(x))']
two_func = ['*'.join(i) for i in itertools.product(one_func, one_func)]
so that two_func = ['(x)*(x)','(x)*(np.sqrt(x))','(np.sqrt(x))*(x)', '(np.sqrt(x))*(np.sqrt(x))']. Is there any way I can create columns like the first example with these new functions?

That looks like a bad design, but I won't go down that road.
Answering your question, you can use df.eval
First of all, set
one_func = ['{x}', '(sqrt({x}))']
with {} instead of () such that you can replace {x} for your actual column name.
Then, for instance,
expr = two_func[0].format(x='col1')
df.eval(expr)
The food loop your look like
for col in df.columns:
for func in two_func: df[func] = df.eval(func.format(x=col))

Related

Calling a Python function/class that takes an entire pandas dataframe or series as input, for all rows in another dataframe

I have a Python class that takes a geopandas Series or Dataframe to initialize (specifically working with geopandas, but I imagine it to be the same solution as pandas). This class has attributes/methods that utilize the various columns in the series/dataframe. Outside of this, I have a dataframe with many rows. I would like to iterate through (ideally in an efficient/parallel manner as each row is independent of each other) this dataframe, and call a method in the class for each row (aka Series). And append the results as a column to the dataframe. But I am having trouble with this. With the standard list comprehension/pandas apply() methods, I can call like this e.g.:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2['date']))
But if said function (or in my case, class) needs the entire gdf, and I call like this:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2))
It does not work because 'my_function()' takes a dataframe or series, while what is being sent to it is the column names (strings) of gdf2.
How can I apply a function to all rows in a dataframe if said function takes an entire dataframe/series and not just select column(s)? In my specific case, since it's a method in a class, I would like to do this, or something similar to call this method on all rows in a dataframe:
gdf1['function_return_col'] = list(map((lambda f: my_class(f).my_method()), gdf2))
Or am I just thinking of this in the entirely wrong way?
Have you tried using pandas dataframe method called "apply".
Here is an example of using it for both row axis and column axis.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
df1 = df.apply(np.sum, axis=0)
print(df1)
df1 = df.apply(np.sum, axis=1)
print(df1)

Is there a function for making a new dataframe using pandas to select only part of a word?

I am looking to select all values that include "hennessy" in the name, i.e. "Hennessy Black Cognac", "Hennessy XO". I know it would simply be
trial = Sales[Sales["Description"]if=="Hennessy"]
if I wanted only the value "Hennessy", but I want it if it contains the word "Hennessy" at all.
working on python with pandas imported
Thanks :)
You can use the in keyword to check if a value is present in a sequence.
Like this:
trial = "hennessy" in lower(Sales[Sales["Description"]])
you can try using str.startswith
import pandas as pd
# initialize list of lists
data = [['Hennessy Black Cognac', 10], ['Hennessy XO', 15], ['julian merger', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
new_df = df.loc[df.Name.str.startswith('Hennessy', na=False)]
new_df
or You can use apply to easily apply any string matching function to your column elementwise
df_new =df[df['Name'].apply(lambda x: x.startswith('Hennessy'))]
df_new

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Check Dataframe for certain string and return the column headers of the columns that string is found in

I have a dataframe that looks something like this:
Now I simply want to return the headers of the columns that have the string "worked" to a list.
So that in this case the list only includes lst=["OBE"]
You can obtain it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'OBE': ['Worked', 'Worked', np.nan, 'Uploaded'],
'TDG': ['Uploaded']*4,
'TMA':[np.nan]*4, 'TMCZ': ['Uploaded']*4})
columns_with_worked = (df == 'Worked').any(axis=0)
columns_with_worked[columns_with_worked].index.tolist()
['OBE']
So the solutions construct a boolean Series of which columns contain the term "Worked". Then, we only get the portion of the series related to the true label, select the labels by invoking index and return that object as a list

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Categories