I have a list such that
l = ['xyz','abc','mnq','qpr']
These values are weighted such that xyz>abc>mnq>qpr
I have a pandas dataframe with a column that has sets of values.
COL_NAME
0 set(['xyz', 'abc'])
1 set(['xyz'])
2 set(['mnq','qpr'])
Now, I want to pick the highest values in the sets such that after I apply the custom function I am left with
COL_NAME
0 set(['xyz'])
1 set(['xyz'])
2 set(['mnq'])
Is there an elegant way to do this process without resorting to a dictionary of weights?
you can use pd.Categorical with the parameter ordered=True and set the categories=l[::-1] to get the order you'd like.
def max_cat(x):
return set([pd.Categorical(x, l[::-1], True).max()])
df.COL_NAME.apply(max_cat)
0 {xyz}
1 {xyz}
2 {mnq}
Name: COL_NAME, dtype: object
Related
i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0
I have a data frame that has a column of lists of strings, I want to find the value of a colum in a row which is based on the value of another column
i.e
samples subject trial_num
0 ['aa','bb'] 1 1
1 ['bb','cc'] 1 2
I have ['bb','cc'] and I want to get the value from the trial_num column where this list equals the samples colum, in this case 2.
Given the search column (samples) contains a list, it makes thing a tiny bit more complicated.
In this case, the apply() function can be used to test the values, and return a boolean mask, which can be applied to the DataFrame to obtain the required value.
Example code:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num']
Output:
1 2
Name: trial_num, dtype: int64
To only return the required value (2), simply append .iloc[0] to the end of the statement, as:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num'].iloc[0]
>>> 2
I have a dataframe that looks like:
index data
11727.213152 -62.260842
12144.825397 -26.384420
12566.138322 -47.091084
12981.362812 -74.528391
I would like to calculate the mad() value of every two items of the data column, how can i do that?
Is there a way to group the data column in groups of two (or more)?
Or should I simply iterate through the df and calculate the mad of two consequent values?
thanks!
I think need groupby by helper array created by flooring division by //:
s = df.groupby(np.arange(len(df)) // 2)['data'].mad()
print (s)
0 17.938211
1 13.718653
Name: data, dtype: float64
Detail:
print (np.arange(len(df)) // 2)
[0 0 1 1]
Here is my input data.
df1= pd.DataFrame( np.random.randn(10,3), columns= list("ABC") )
A B C
0 0.557303 1.657976 -0.091638
1 -0.769201 1.305553 -0.248403
2 1.251513 -0.634947 0.100130
3 -1.030045 -0.268972 1.328666
4 0.665483 -0.133410 0.151235
5 0.703294 -0.525490 0.109413
6 0.549441 0.002626 -0.005841
7 0.454866 1.094490 -1.946760
8 -0.152995 -0.736689 -0.367252
9 -0.632906 1.066869 0.303271
I want to create groups based on value of column A. So I slice A first. And define a function. Then I use apply method on the Groupby Obj. I am expecting the new column to be the difference between B and C over the group mean of A.
b=np.linspace(-1, 1,5)
def tmpF(x):
x['newCol']= (x['B']-x['C'])/df1['A'].mean()
return x
df1.groupby(np.digitize(df1['A'],b)).apply(tmpF)
However, I am only using the mean value of the entire column A. I know df1['A'].mean() is wrong but I dont know how to access the group mean instead.
How to solve that ?
You can change df1['A'] to x['A'] in function tmpF:
b=np.linspace(-1, 1,5)
def tmpF(x):
x['newCol']= (x['B']-x['C'])/x['A'].mean()
return x
df1.groupby(np.digitize(df1['A'],b)).apply(tmpF)
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72