I'm new to python and pandas but I have a problem I cannot wrap my head around.
I'm trying to add a new column to my DataFrame. To achieve that I use the assign() function.
Most of the examples on the internet are painfully trivial and I cannot find a solution for my problem.
What works:
my_dataset.assign(new_col=lambda x: my_custom_long_function(x['long_column']))
def my_custom_long_function(input)
return input * 2
What doesn't work:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.upper()
What confuses me is that in the debug I can see that even for my_custom_long_function the parameter is a Series, not a long.
I just want to use the lambda function and pass a value of the column to do my already written complicated functions. How do I do this?
Edit: The example here is just for demonstrative purpose, the real code is basically an existing complex function that does not care about panda's types and needs a str as a parameter.
Because the column doesn't have a upper method, in order to use it, you need to do str.upper:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.str.upper()
That said, I would use:
my_dataset['new column'] = my_dataset['string_column'].str.upper()
For efficiency.
Edit:
my_dataset['new column'] = my_dataset['string_column'].apply(lambda x: my_custom_string_function(x))
def my_custom_string_function(input):
return input.upper()
Related
I've recently started to use vaex for its great potentialities on large set of data.
I'm trying to apply the following function:
def get_columns(v: str, table_columns: List, pref: str = '', suff: str = '') -> List:
return [table_columns.index(i) for i in table_columns if (pref + v + suff) in i][0]
to a df as follows:
df["column_day"] = df.apply(get_columns, arguments=[df.part_day, table.columns.tolist(), "total_4wk_"])
but I get the error when I run df["column_day"]:
NameError: Column or variable 'total_4wk_' does not exist.
I do not understand what I am doing wrong, since other functions (with only one argument) I used with apply worked fine.
Thanks.
I believe vaex expects the arguments passed to apply to actually be expressions.
In your case table.columns.tolist() and "total_4wk_" are not expressions so it complains. So I would re-write your get_columns function such that it only takes in expressions as arguments, and I believe that will work.
I am trying to create a function which resamples time series data in pandas. I would like to have the option to specify the type of aggregation that occurs depending on what type of data I am sending through (i.e. for some data, taking the sum of each bin is appropriate, while for others, taking the mean is needed, etc.). For example data like these:
import pandas as pd
import numpy as np
dr = pd.date_range('01-01-2020', '01-03-2020', freq='1H')
df = pd.DataFrame(np.random.rand(len(dr)), index=dr)
I could have a function like this:
def process(df, freq='3H', method='sum'):
r = df.resample(freq)
if method == 'sum':
r = r.sum()
elif method == 'mean':
r = r.mean()
#...
#more options
#...
return r
For a small amount of aggregation methods, this is fine, but seems like it could be tedious if I wanted to select from all of the possible choices.
I was hoping to use getattr to implement something like this post (under "Putting it to work: generalizing method calls"). However, I can't find a way to do this:
def process2(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return r.foo()
#fails with:
#AttributeError: 'DatetimeIndexResampler' object has no attribute 'foo'
def process3(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return foo(r)
#fails with:
#TypeError: __init__() missing 1 required positional argument: 'obj'
I get why process2 fails (calling r.foo() looks for the method foo() of r, not the variable foo). But I don't think I get why process3 fails.
I know another approach would be to pass functions to the parameter method, and then apply those functions on r. My inclination is that this would be less efficient? And it still doesn't allow me to access the built-in Resample methods directly.
Is there a working, more concise way to achieve this? Thanks!
Try .resample().apply(method)
But unless you are planning some more computation inside the function, it will probably be easier to just hard-code this line.
Sadly, the answer of this question for datetime.date do not work for datetime.time.
So I implemented an df.apply() function, which is doing what I expect:
def get_ts_timeonly_float(timeonly):
if isinstance(timeonly, datetime.time):
return timeonly.hour * 3600 + timeonly.minute * 60 + timeonly.second
elif isinstance(timeonly, pd.Timedelta):
return timeonly.seconds
fn_get_ts_timeonly_pd_timestamp = lambda row: get_ts_timeonly_float(row.ts_timeonly)
col = df.apply(fn_get_ts_timeonly_pd_timestamp, axis=1)
df = df.assign(ts_timeonly_as_ts=col.values)
Problem:
However, this is not yet “blazingly fast.” One reason is that .apply()
will try internally to loop over Cython iterators. But in this case,
the lambda that you passed isn’t something that can be handled in
Cython, so it’s called in Python, which is consequently not all that
fast.
This is a great blog post
So is there a faster method to convert datetime.time into some int representation (like total_seconds till start of day)? Thanks!
For example,
import pandas as pd
weather = pd.read_csv(r'D:\weather.csv')
weather.count
weather.count()
weather is one dataframe with multiple columns and rows
Then what's the difference when asking for weather.count and weather.count() ?
It depends. In general this question has nothing to do with pandas. The answer is relevant to how Python is designed.
In this case, .count is a method. Particularly, a method of pandas.DataFrame, and it will confirm it:
df = pd.DataFrame({'a': []})
print(df.count)
Outputs
<bound method DataFrame.count of Empty DataFrame
Columns: [a]
Index: []>
Adding () will call this method:
print(df.count())
Outputs
a 0
dtype: int64
However that is not always the case. .count could have been a non-callable attribute (ie a string, an int, etc) or a property.
In this case it's a non-callable attribute:
class Foo:
def __init__(self, c):
self.count = c
obj = Foo(42)
print(obj.count)
Will output
42
Adding () in this case will raise an exception because it makes no sense to call an integer:
print(obj.count())
TypeError: 'int' object is not callable
The double bracket are related to a method call.
If you declare a function like this one that sum two number:
def sum(a:int, b:int)->int:
return a+b
Then you can call the function using: sum(1,2).
The function without bracket, are used when you don't need to call the function but, instead, you need to pass the reference to that function to another method.
For example, pass a reference to the function can be useful for multithread management:
from multiprocessing import Process
t = Process(target=my_long_running_function)
Please, have a look to the #roippi reply for further information:
https://stackoverflow.com/a/21786508/9361998
I am trying to implement a UDF in spark; that can take both a literal and column as an argument. To achieve this, I believe I can use a curried UDF.
The function is used to match a string literal to each value in the column of a DataFrame. I have summarized the code below:-
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(None, match_string_1, match_string_2).ratio()
return matching
hc.udf.register("matching", matching)
matching_udf = F.udf(matching, StringType())
df_matched = df.withColumn("matching_score", matching_udf(lit("match_string"))(df.column))
"match_string" is actually a value assigned to a list which I am iterating over.
Unfortunately this is not working as I had hoped; and I am receiving
"TypeError: 'Column' object is not callable".
I believe I am not calling this function correctly.
It should be something like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(
a=match_string_1, b=match_string_2).ratio()
# Here create udf.
return F.udf(matching_inner, StringType())
df.withColumn("matching_score", matching("match_string")(df.column))
If you want to support Column argument for match_string_1 you'll have to rewrite it like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return F.udf(
lambda a, b: difflib.SequenceMatcher(a=a, b=b).ratio(),
StringType())(match_string_1, match_string_2)
return matching_inner
df.withColumn("matching_score", matching(F.lit("match_string"))(df.column)
Your current code doesn't work, matching_udf is and UDF and matching_udf(lit("match_string")) creates a Column expression instead of calling internal function.