I've recently started to use vaex for its great potentialities on large set of data.
I'm trying to apply the following function:
def get_columns(v: str, table_columns: List, pref: str = '', suff: str = '') -> List:
return [table_columns.index(i) for i in table_columns if (pref + v + suff) in i][0]
to a df as follows:
df["column_day"] = df.apply(get_columns, arguments=[df.part_day, table.columns.tolist(), "total_4wk_"])
but I get the error when I run df["column_day"]:
NameError: Column or variable 'total_4wk_' does not exist.
I do not understand what I am doing wrong, since other functions (with only one argument) I used with apply worked fine.
Thanks.
I believe vaex expects the arguments passed to apply to actually be expressions.
In your case table.columns.tolist() and "total_4wk_" are not expressions so it complains. So I would re-write your get_columns function such that it only takes in expressions as arguments, and I believe that will work.
Related
I'm new to python and pandas but I have a problem I cannot wrap my head around.
I'm trying to add a new column to my DataFrame. To achieve that I use the assign() function.
Most of the examples on the internet are painfully trivial and I cannot find a solution for my problem.
What works:
my_dataset.assign(new_col=lambda x: my_custom_long_function(x['long_column']))
def my_custom_long_function(input)
return input * 2
What doesn't work:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.upper()
What confuses me is that in the debug I can see that even for my_custom_long_function the parameter is a Series, not a long.
I just want to use the lambda function and pass a value of the column to do my already written complicated functions. How do I do this?
Edit: The example here is just for demonstrative purpose, the real code is basically an existing complex function that does not care about panda's types and needs a str as a parameter.
Because the column doesn't have a upper method, in order to use it, you need to do str.upper:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.str.upper()
That said, I would use:
my_dataset['new column'] = my_dataset['string_column'].str.upper()
For efficiency.
Edit:
my_dataset['new column'] = my_dataset['string_column'].apply(lambda x: my_custom_string_function(x))
def my_custom_string_function(input):
return input.upper()
I am seeking to be able to use variables within the format() parentheses, in order to parameterize it within a function. Providing an example below:
sample_str = 'sample_str_{nvars}'
nvars_test = 'apple'
sample_str.format(nvars = nvars_test) #Successful Result: ''sample_str_apple''
But the following does not work -
sample_str = 'sample_str_{nvars}'
nvars_test_2 = 'nvars = apple'
sample_str.format(nvars_test_2) # KeyError: 'nvars'
Would anyone know how to do this? Thanks.
Many thanks for guidance. I did a bit more searching. For anyone who may run into the same problem, please see examples here: https://pyformat.info
sample_str = 'sample_str_{nvars}'
nvars_test_2 = {'nvars':'apple'}
sample_str.format(**nvars_test_2) #Successful Result: ''sample_str_apple''
First, I'd recommend checking out the string format examples.
Your first example works as expected. From the documentation, you are permitted to actually name the thing you are passing into {}, and then pass in a same-named variable for str.format():
'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
# returns 'Coordinates: 37.24N, -115.81W'
Your second example doesn't work because you are not passing a variable called nvars in with str.format() - you are passing in a string: 'nvars = apple'.
sample_str = 'sample_str_{nvars}'
nvars_test_2 = 'nvars = apple'
sample_str.format(nvars_test_2) # KeyError: 'nvars'
It's a little more common (I think) to not name those curly-braced parameters - easier to read at least.
print('sample_str_{}'.format("apple")) should return 'sample_str_apple'.
If you're using Python 3.6 you also have access to Python's formatted string literals.
>>> greeting = 'hello'
>>> name = 'Jane'
>>> f'{greeting} {name}'
'hello Jane'
Note that the literal expects the variables to be already present. Otherwise you get an error.
>>> f'the time is now {time}'
NameError: name 'time' is not defined
I am trying to implement a UDF in spark; that can take both a literal and column as an argument. To achieve this, I believe I can use a curried UDF.
The function is used to match a string literal to each value in the column of a DataFrame. I have summarized the code below:-
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(None, match_string_1, match_string_2).ratio()
return matching
hc.udf.register("matching", matching)
matching_udf = F.udf(matching, StringType())
df_matched = df.withColumn("matching_score", matching_udf(lit("match_string"))(df.column))
"match_string" is actually a value assigned to a list which I am iterating over.
Unfortunately this is not working as I had hoped; and I am receiving
"TypeError: 'Column' object is not callable".
I believe I am not calling this function correctly.
It should be something like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(
a=match_string_1, b=match_string_2).ratio()
# Here create udf.
return F.udf(matching_inner, StringType())
df.withColumn("matching_score", matching("match_string")(df.column))
If you want to support Column argument for match_string_1 you'll have to rewrite it like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return F.udf(
lambda a, b: difflib.SequenceMatcher(a=a, b=b).ratio(),
StringType())(match_string_1, match_string_2)
return matching_inner
df.withColumn("matching_score", matching(F.lit("match_string"))(df.column)
Your current code doesn't work, matching_udf is and UDF and matching_udf(lit("match_string")) creates a Column expression instead of calling internal function.
Rather than explicitly specifying the DataFrame columns in the code below, I'm trying to give an option of passing the name of the data frame in itself, without much success.
The code below gives a
"ValueError: Wrong number of dimensions" error.
I've tried another couple of ideas but they all lead to errors of one form or another.
Apart from this issue, when the parameters are passed as explicit DataFrame columns, p as a single column, and q as a list of columns, the code works as desired. Is there a clever (or indeed any) way of passing in the data frame so the columns can be assigned to it implicitly?
def cdf(p, q=[], datafr=None):
if datafr!=None:
p = datafr[p]
for i in range(len(q)):
q[i]=datafr[q[i]]
...
(calculate conditional probability tables for p|q)
to summarize:
current usage:
cdf(df['var1'], [df['var2'], df['var3']])
desired usage:
cdf('var1', ['var2', 'var3'], datafr=df)
Change if datafr != None: to if datafr is not None:
Pandas doesn't know which value in the dataframe you are trying to compare to None so it throws an error. is checks if both datafr and None are the pointing to the same object, which is a more stringent identity check. See this explanation.
Additional tips:
Python iterates over lists
#change this
for i in range(len(q)):
q[i]=datafr[q[i]]
#to this:
for i in q:
q[i] = datafr[q]
If q is a required parameter don't do q = [ ] when defining your function. If it is an optional parameter, ignore me.
Python can use position to match the arguments passed to the function call to with the parameters in the definition.
cdf('var1', ['var2', 'var3'], datafr=df)
#can be written as:
cdf('var1', ['var2', 'var3'], df)
What does the following code do?
a = lambda _:True
From what I read and tested in the interactive prompt, it seems to be a function that returns always True.
Am I understanding this correctly? I hope to understand why an underscore (_) was used as well.
The _ is variable name. Try it.
(This variable name is usually a name for an ignored variable. A placeholder so to speak.)
Python:
>>> l = lambda _: True
>>> l()
<lambda>() missing 1 required positional argument: '_'
>>> l("foo")
True
So this lambda does require one argument. If you want a lambda with no argument that always returns True, do this:
>>> m = lambda: True
>>> m()
True
Underscore is a Python convention to name an unused variable (e.g. static analysis tools does not report it as unused variable). In your case lambda argument is unused, but created object is single-argument function which always returns True. So your lambda is somewhat analogous to Constant Function in math.
it seems to be a function that returns True regardless.
Yes, it is a function (or lambda) that returns True. The underscore, which is usually a placeholder for an ignored variable, is unnecessary in this case.
An example use case for such a function (that does almost nothing):
dd = collections.defaultdict(lambda: True)
When used as the argument to a defaultdict, you can have True as a general default value.
Below is the line of code in question:
a = lambda _:True
It creates a function having one input parameter: _. Underscore is a rather strange choice of variable name, but it is just a variable name. You can use _ anywhere, even when not using lambda functions. For example, instead of....
my_var = 5
print(my_var)
You could write:
_ = 5
print(_)
However, there was a reason that _ was used as the name of parameter name instead of something like x or input. We'll get to that in a moment.
First, we need to know that the lambda-keyword constructs a function, similar to def, but with different syntax. The definition of the lambda function, a = lambda _:True, is similar to writing:
def a(_):
return True
It creates a function named a with an input parameter _, and it returns True. One could have just as easily written a = lambda x:True, with an x instead of an underscore. However, the convention is to use _ as a variable name when we do not intend to use that variable. Consider the following:
for _ in range(1, 11):
print('pear')
Notice that the loop index is never used inside of the loop-body. We simply want the loop to execute a specified number of times. As winklerrr has written, "the variable name _ is [...] like a "throw-away-variable", just a placeholder which is of no use. "
Likewise, with ``a = lambda x:True the input parameter is not used inside the body of the function. It does not really matter what the input argument is, as long as there is one. The author of that lambda-function wrote _ instead of something like x, to indicate that the variable would not be used.
Note that the lambda does have an argument; So, writing
a(), will raise an error.
If you want a lambda with no argument write something like this:
bar = lambda: True
Now calling bar(), with no args, will work just fine.
A lambda which takes no arguments need not always return the same value:
import random
process_fruit = lambda : random.random()
The lambda function above is more complex that just a something which always returns the same constant.
One reason that programmers sometimes us the lambda keyword instead of def is for functions which are especially short and simple. Note that a lambda definition can usually fit all on one line, whereas, it is difficult to do the same with a def statement. Another reason to use lambda instead of def sf when the function will not be used again. If we don't want to call the function again later, then there is no need to give the function a name. For example consider the following code:
def apply_to_each(transform, in_container):
out_container = list()
for idx, item in enumerate(container, 0):
out_container[idx] = transform(item)
return out_container
Now we make the following call:
squares = apply_to_each(lambda x: x**2 range(0, 101))
Notice that lambda x: x**2 is not given a label. This is because we probably won't call it again later, it was just something short and simple we needed temporarily.
The fact that lambda functions need not be given a name is the source of another name to describe them: "anonymous functions."
Also note that lambda-statements are like a function-call in that they return a reference to the function they create. The following is illegal:
apply_to_each(def foo(x): x**2 , range(0, 101))
Whereas, apply_to_each(lambda x: x**2 range(0, 101)) is just fine.
So, we use lambda instead of def and _ instead of a long variable name when we want something short, sweet and probably won't want use again later.
Lambda means a function.
The above statement is same as writing
def f(_):
return True
For lambda a variable needs to be present. So you pass it a variable called _(Similarly you could pass x, y..)
Underscore _ is a valid identifier and is used here as a variable name. It will always return True for the argument passed to the function.
>>>a('123')
True