passing optional dataframe parameter in python - python

Rather than explicitly specifying the DataFrame columns in the code below, I'm trying to give an option of passing the name of the data frame in itself, without much success.
The code below gives a
"ValueError: Wrong number of dimensions" error.
I've tried another couple of ideas but they all lead to errors of one form or another.
Apart from this issue, when the parameters are passed as explicit DataFrame columns, p as a single column, and q as a list of columns, the code works as desired. Is there a clever (or indeed any) way of passing in the data frame so the columns can be assigned to it implicitly?
def cdf(p, q=[], datafr=None):
if datafr!=None:
p = datafr[p]
for i in range(len(q)):
q[i]=datafr[q[i]]
...
(calculate conditional probability tables for p|q)
to summarize:
current usage:
cdf(df['var1'], [df['var2'], df['var3']])
desired usage:
cdf('var1', ['var2', 'var3'], datafr=df)

Change if datafr != None: to if datafr is not None:
Pandas doesn't know which value in the dataframe you are trying to compare to None so it throws an error. is checks if both datafr and None are the pointing to the same object, which is a more stringent identity check. See this explanation.
Additional tips:
Python iterates over lists
#change this
for i in range(len(q)):
q[i]=datafr[q[i]]
#to this:
for i in q:
q[i] = datafr[q]
If q is a required parameter don't do q = [ ] when defining your function. If it is an optional parameter, ignore me.
Python can use position to match the arguments passed to the function call to with the parameters in the definition.
cdf('var1', ['var2', 'var3'], datafr=df)
#can be written as:
cdf('var1', ['var2', 'var3'], df)

Related

I want to create a DataFrame to check whether random state affects the score

I'm trying to create a DataFrame to check whether the different shuffling affects the model r-squared value for Training as well as Testing Dataset.
I tried running for loop to do so but got errors.
I m confused about how to create two columns using pd.DataFrame method where the index is the value of random state for that random state what is the value of r-square for train and test
The code I'm writing:
%%time
for i in range(0,100):
X_train,X_test,y_train,y_test = train_test_split(features,prices,test_size=0.2,random_state=i)
regr.fit(X_train,y_train)
random_try = pd.DataFrame(data=[regr.score(X_train,y_train)],
[regr.score(X_test,y_test)],index=[i for i in range(1,100)],columns=['Training score','Testing score'])
Just out of curiosity I want to know.Thank you
The problem results in the way how you create dataframe:
random_try = pd.DataFrame(data=[regr.score(X_train,y_train)],
[regr.score(X_test,y_test)], <====
index=[i for i in range(1,100)],
columns=['Training score','Testing score'])
By writing data=, index=, you are passing variables to keyword argument. Look at the line I use <==== to point out, you pass the parameter directly to function like function(1), the argument where parameter passed to is what we call positional argument.
Python doesn't allow you pass variable to positional argument after keyword argument, that's why you see the error.
Actually, I think you want to use bracket to wrap the line I point out
random_try = pd.DataFrame(data=[[regr.score(X_train,y_train)],
[regr.score(X_test,y_test)]],
index=[i for i in range(1,100)],
columns=['Training score','Testing score'])

If statement running but not processing

I've been scratching my head all morning trying to understand why the following doesn't seem to work. The idea here is that if there is no column provided, run 1 set of code and if there is, then run another.
Let's say the value of df just holds 1 value = 'A'
def function_name(df, col):
if col == None:
df = df.str.lower()
else:
df[col] = df[col].str.lower()
function_name(df, None)
Expected Results: 'a'
Current Results: 'A'
If I was to run function_name(df, 'A'):
Expected Results: 'a'
Current Results: 'a'
Ideally when running the function, since None was passed in it should return whatever commands I passed through but currently, it's acting as if nothing is happening. When I debug by printing it, I can see that the code is doing the 'stuff' but the function itself isn't resulting in whatever commands was run. Any thoughts?
Since i can't comment, i'll post it as an answer.
df is acting as two data types, in the if block it is not an array, whereas in else block it is acting as an array so please check it out. You can do df[0] in the first block.
Moreover you mentioned that the function isn't returning anything, this is due to the missing return statement, So if you intend to return the new df add return df at the end of the function.
Lets say, You're passing df which just holds 1 value = 'A'. So, the code
def function_name(df, col):
if col == None:
df = str(df).lower()
else:
df[col] = str(df).lower()
print(df)
function_name('A', None)
you're expecting an output a when you're passing 'A', and the function does not return any value. So, I tried printing the value of df and its matching the expected output. The .(dot) operator is used to access class, structure, union members, and str is a datatype, so using df.str.lower() leads to AttributeError: type object 'str' has no attribute 'df'. I hope this answer helps.

Passing second argument to function in pool.map

I have a panda dataframe with many rows, I am using multiprocessing to process grouped tables from this dataframe concurrently. It works fine but I have a problem passing in a second parameter, I have tried to pass both arguments as a Tuple but it doesn't work. My code is as follows:
I want to also pass in the parameter "col" to the function "process_table"
for col in cols:
tables = df.groupby('test')
p = Pool()
lines = p.map(process_table, table)
p.close()
p.join()
def process_table(t):
# Bunch of processing to create a line for matplotlib
return line
You could do this, it takes an iterable and expands it into individual arguments :
def expand(x):
return process_table(*x)
p.map(expand, table)
You might be tempted to do this:
p.map(lambda x: process_table(*x), table) # DOES NOT WORK
But it won't work because lambdas are unpickleable (if you don't know what this means, trust me).

Using length of string in each row of column of df as argument in a function

I am having some serious trouble with this! Suppose I have a pandas dataframe like the following:
Name LeftString RightString
nameA AATCGCTGCG TGCTGCTGCTT
nameB GTCGTGBAGB BTGHTAGCGTB
nameC ABCTHJKLAA BFTCHHFCTSH
....
I have a function that takes the following as arguments:
def localAlign(minAlignment, names, string1, string2):
# do something great
In my function, minAlignment is an integer, names, string1, and string2 are dataframe columns being used as list objects by the function.
I then call the function at a later point:
left1_2_compare = localAlign(12, df['Name'], df['LeftString'], df['RightString'])
My function runs with no issues, but the 12 is passed in as a hard coded value, or as a sys argument, but what I would rather it be is a variable that is 60% the length of the df['LeftString'].
So what I have tried in regards to this is to pass in a calculation that would return an int to the function argument:
left1_2_compare = localAlign((int(len(df['LeftString'])*0.6)),
df['Name'], df['LeftString'],
df['RightString'])
The interesting part about this is that the code doesn't fail or return errors, it just doesn't output anything for that value (the output file is blank for this part). The rest has data produced and good.
We see that the df has been defined before the function is called, is there a way to use the length of string in row1...rown as the input integer for the function without defining it inside of the function?
Need Series created by len, multiple by mul and cast to inttegers by astype:
left1_2_compare = localAlign((df['LeftString'].str.len().mul(.6)).astype(int),
df['Name'],
df['LeftString'],
df['RightString'])

list of functions with parameters

I need to obtain a list of functions, where my function is defined as follows:
import theano.tensor as tt
def tilted_loss(y,f,q):
e = (y-f)
return q*tt.sum(e)-tt.sum(e[e<0])
I attempted to do
qs = np.arange(0.05,1,0.05)
q_loss_f = [tilted_loss(q=q) for q in qs]
however, get the error TypeError: tilted_loss() missing 2 required positional arguments: 'y' and 'f'. I attempted the simpler a = tilted_loss(q=0.05) with the same result.
How do you go about creating this list of functions when parameters are required? Similar questions on SO consider the case where parameters are not involved.
You can use functools.partial:
q_loss_f = [functools.partial(tilted_loss, q=q) for q in qs]
There are 2 ways you can solve this problem. Both ways require you know the default values for y and f.
With the current function, there's simply no way for the Python interpreter to know the value of y and f when you call tilted_loss(q=0.05). y and f are simply undefined & unknown.
Solution (1): Add default values
We can fix this by adding default values for the function, for example, if default values are: y = 0, f = 1:
def tilted_loss(q, y=0, f=1):
# original code goes here
Note that arguments with default values have to come AFTER non-default arguments (i.e q).
Solution (2): Specify default values during function call
Alternatively, just specify the default values every time you call that function. (Solution 1 is better)

Categories