i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0
Related
I have a data frame that has a column of lists of strings, I want to find the value of a colum in a row which is based on the value of another column
i.e
samples subject trial_num
0 ['aa','bb'] 1 1
1 ['bb','cc'] 1 2
I have ['bb','cc'] and I want to get the value from the trial_num column where this list equals the samples colum, in this case 2.
Given the search column (samples) contains a list, it makes thing a tiny bit more complicated.
In this case, the apply() function can be used to test the values, and return a boolean mask, which can be applied to the DataFrame to obtain the required value.
Example code:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num']
Output:
1 2
Name: trial_num, dtype: int64
To only return the required value (2), simply append .iloc[0] to the end of the statement, as:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num'].iloc[0]
>>> 2
I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.
EDIT:
I realized I set up my example incorrectly, the corrected version follows:
I have two dataframes:
df1 = pd.DataFrame({'x values': [11, 12, 13], 'time':[1,2.2,3.5})
df2 = pd.DataFrame({'x values': [11, 21, 12, 43], 'time':[1,2.1,2.6,3.1})
What I need to do is iterate over both of these dataframes, and compute a new value, which is a ratio of the x values in df1 and df2. The difficulty comes in because these dataframes are of different lengths.
If I just wanted to compute values in the two, I know that I could use something like zip, or even map. Unfortunately, I don't want to drop any values. Instead, I need to be able to compare the time column between the two frames to determine whether or not to copy over a value from a previous time to the computation of in the next time period.
So for instance, I would compute the first ratio:
df1["x values"][0]/df2["x values"][0]
Then for the second I check which update happens next, which in this case is to df2, so df1["time"] < df2["time"] and:
df1["x values"][0]/df2["x values"][1]
For the third I would see that df1["time"] > df2["time"], so the third computation would be:
df1["x values"][1]/df2["x values"][1]
The only time both values should be used to compute the ratio from the same "position" is if the times in the two dataframes are equal.
And so on. I'm very confused as to whether or not this is possible to execute using something like a lambda function, or itertools. I've made some attempts, but most have yielded errors. Any help would be appreciated.
Here is what I ended up doing. Hopefully it helps clarify what my question was. Also, if anyone can think of a more pythonic way to do this, I would appreciate the feedback.
#add a column indicating which 'type' of dataframe it is
df1['type']=pd.Series('type1',index=df1.index)
df2['type']=pd.Series('type2',index=df2.index)
#concatenate the dataframes
df = pd.concat((df1, df2),axis=0, ignore_index=True)
#sort by time
df = df.sort_values(by='time').reset_index()
#we create empty arrays in order to track records
#in a way that will let us compute ratios
x1 = []
x2 = []
#we will iterate through the dataframe line by line
for i in range(0,len(df)):
#if the row contains data from df1
if df["type"][i] == "type1":
#we append the x value for that type
x1.append(df[df["type"]=="type1"]["x values"][i])
#if the x2 array contains exactly 1 value
if len(x2) == 1:
#we add it to match the number of x1
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x1)-1):
x2.append(x2[0])
#if the x2 array contains more than 1 value
#add a copy of the previous x2 record to correspond
#to the new x1 record
if len(x2) > 0:
x2.append(x2[len(x2)-1])
#if the row contains data from df2
if df["type"][i] == "type2":
#we append the x value for that type
x2.append(df[df["type"]=="type2"]["x values"][i])
#if the x1 array contains exactly 1 value
if len(x1) == 1:
#we add it to match the number of x2
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x2)-1):
x1.append(x2[0])
#if the x1 array contains more than 1 value
#add a copy of the previous x1 record to correspond
#to the new x2 record
if len(x1) > 0:
x1.append(x1[len(x1)-1])
#combine the records
new__df = pd.DataFrame({'Type 1':x1, 'Type 2': x2})
#compute the ratio
new_df['Ratio'] = new_df['x1']/f_df['x2']
You can merge the two dataframes on time and then calculate ratios
new_df = df1.merge(df2, on = 'time', how = 'outer')
new_df['ratio'] = new_df['x values_x'] / new_df['x values_y']
You get
time x values_x x values_y ratio
0 1 11 11 1.000000
1 2 12 21 0.571429
2 2 12 12 1.000000
3 3 13 43 0.302326
I have a list such that
l = ['xyz','abc','mnq','qpr']
These values are weighted such that xyz>abc>mnq>qpr
I have a pandas dataframe with a column that has sets of values.
COL_NAME
0 set(['xyz', 'abc'])
1 set(['xyz'])
2 set(['mnq','qpr'])
Now, I want to pick the highest values in the sets such that after I apply the custom function I am left with
COL_NAME
0 set(['xyz'])
1 set(['xyz'])
2 set(['mnq'])
Is there an elegant way to do this process without resorting to a dictionary of weights?
you can use pd.Categorical with the parameter ordered=True and set the categories=l[::-1] to get the order you'd like.
def max_cat(x):
return set([pd.Categorical(x, l[::-1], True).max()])
df.COL_NAME.apply(max_cat)
0 {xyz}
1 {xyz}
2 {mnq}
Name: COL_NAME, dtype: object
It seems that applying functions to data frames is typically wrt series (e.g df.apply(my_fun)) and so such functions index 'one row at a time'. My question is if one can get more flexibility in the following sense: for a data frame df, write a function my_fun(row) such that we can point to rows above or below the row.
For example, start with the following:
def row_conditional(df, groupcol, appcol1, appcol2, newcol, sortcol, shift):
"""Input: df (dataframe): input data frame
groupcol, appcol1, appcol2, sortcol (str): column names in df
shift (int): integer to point to a row above or below current row
Output: df with a newcol appended based on conditions
"""
df[newcol] = '' # fill new col with blank str
list_results = []
members = set(df[groupcol])
for m in members:
df_m = df[df[groupcol]==m].sort(sortcol, ascending=True)
df_m = df_m.reset_index(drop=True)
numrows_m = df_m.shape[0]
for r in xrange(numrows_m):
# CONDITIONS, based on rows above or below
if (df_m.loc[r + shift, appcol1]>0) and (df_m.loc[r - shfit, appcol2]=='False'):
df_m.loc[r, newcol] = 'old'
else:
df_m.loc[r, newcol] = 'new'
list_results.append(df_m)
return pd.concat(list_results).reset_index(drop=True)
Then, I'd like to be able to re-write the above as:
def new_row_conditional(row, shift):
"""apply above conditions to row relative to row[shift, appcol1] and row[shift, appcol2]
"""
return new value at df.loc[row, newcol]
and finally execute:
df.apply(new_row_conditional)
Thoughts/Solutions with 'map' or 'transform' are also very welcome.
From an OO-approach, I might imagine a row of df to be treated as an object that has attributes i) a pointer to all rows above it and ii) a pointer to all rows below it. Then referencing row.above and row.below in order to assign the new value at df.loc[row, newcol]
One can always look into the enclosing execution frame:
import pandas
dataf = pandas.DataFrame({'a':(1,2,3), 'b':(4,5,6)})
import sys
def foo(roworcol):
# current index along the axis
axis_i = sys._getframe(1).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(1).f_locals['self']
axis = sys._getframe(1).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
# print where we are
print('index: %i - %i items before, %i items after' % (axis_i,
axis_i,
n-axis_i-1))
Within the function function foo there is:
roworcol the current element out of the iteration
axis the chosen axis
axis_i the index of along the chosen axis
dataf the data frame
This is all needed to point before and after in the data frame.
>>> dataf.apply(foo, axis=1)
index: 0 - 0 items before, 2 items after
index: 1 - 1 items before, 1 items after
index: 2 - 2 items before, 0 items after
A complete implementation of the specific example you added in the comments would then be:
import pandas
import sys
df = pandas.DataFrame({'a':(1,2,3,4), 'b':(5,6,7,8)})
def bar(row, k):
axis_i = sys._getframe(2).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(2).f_locals['self']
axis = sys._getframe(2).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
if axis_i == 0 or axis_i == (n-1):
res = 0
else:
res = dataf['a'][axis_i - k] + dataf['b'][axis_i + k]
return res
You'll note that whenever additional arguments are present in the signature of the function mapped we need jump 2 frames up.
>>> df.apply(bar, args=(1,), axis=1)
0 0
1 8
2 10
3 0
dtype: int64
You'll also note that the specific example you have provided can be solved by other, and possibly simpler, means. The solution above is very general in the sense that it is letting you use map while jailbreaking from the row being mapped but it may also violate assumption about what map is doing and, for example. deprive you from the opportunity to parallelize easily by assuming independent computation on the rows.
Create duplicate data frames that are index shifted, and loop over their rows in parallel.
df_pre = df.copy()
df_pre.index -= 1
result = [fun(x1, x2) for x1, x2 in zip(df_pre.iterrows(), df.iterrows()]
This assumes you actually want everything from that row. You can of course do direct operations for example
result = df_pre['col'] - df['col']
Also, there are some standard processing functions built in like diff, shift, cumsum, cumprod that do operate on adjacent rows, although the scope is limited.