apply max to varying-dimension subsets of pandas dataframe - python

For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index.
I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all
multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead
I'm looking for this
targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF

Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):
multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")
The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.
So using your original approach:
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue
(Notice I've also changed the first .loc where you were incorrectly using the boolean index)

Related

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

How to drop a series of rows from dataframe in a faster way

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time
I want to drop every user who makes less than 3 operations.
every operation is stored in a row in which user_id is not the ID of my data
undesirable_users=[]
for i in range(len(operations_per_user)):
if operations_per_user.get_value(operations_per_user.index[i])<=3:
undesirable_users.append(operations_per_user.index[i])
for i in range(len(undesirable_users)):
data = data.drop(data[data.user_id == undesirable_users[i]].index)
data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().
Why not just filter them? You don't need to loop at all.
You can get the filtered indexes by:
operations_per_user.index[operations_per_user <= 3]
And then you can filter these indexes from the df, making the solution:
data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]
EDIT
My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.
filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()
data = data[~data[user_id].isin(filtered_user_ids)]
If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)
Edit
Instead of creating a seperate series, you could add operations_per_user to data with:
data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()
You could either perform the drop as above or perform the selection with the inverse logical condition:
data = data.loc[data['operations_per_user' > 3]]
Original
It would be preferable if you could supply some more information about the variables used in your code.
If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[]
for i in operations_per_user.index:
if operations_per_user.loc[i] <= 3:
undesirable_users.append(i)
The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.
You can use python lists as iterators; for your second loop:
for user in undesirable_users:
data = data.drop(data.loc[data['user_id'] == user].index)
Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.
First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.
keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]

How to return the index value of an element in a pandas dataframe

I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Modifying entries of a pandas dataframe within a loop

I want to add the probabilities of each record within a dataframe for that I used a for loop
def map_score(dataframe,customers,prob):
dataframe['Propensity'] = 0
for i in range(len(dataframe)):
for j in range(len(customers)):
if dataframe['Client'].iloc[i] == customers[j]:
dataframe["Propensity"].iloc[i] = prob[j]
I am able to map the Probabilities associated with each client correctly, but Python throws a warning message
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
from ipykernel import kernelapp as app
When I use .loc function, the result is erroneous and I am getting null values.
Kindly suggest a good method to update and add entries conditionally
You are attempting to make assignment on a copy.
dataframe["Propensity"] is one column, yet a "copy" of dataframe.
However, you are tracking index position with i. So how do you get to use .loc when you have a column name "Propensity" and an index location i.
Assign some variable, say idx, equal to dataframe.index at that location
idx = dataframe.index[i]
Then you can use .loc with the assignment and without issues
dataframe.loc[idx, "Propensity"] = prob[j]
def map_score(dataframe,customers,prob):
dataframe['Propensity'] = 0
for i in range(len(dataframe)):
idx = dataframe.index[i]
for j in range(len(customers)):
if dataframe['Client'].iloc[i] == customers[j]:
dataframe.loc[idx, "Propensity"] = prob[j]

Categories