I need to assign a value to all registers assigned as 0 (zero) in a column ('A'). This new value will be the mean value of every registers that share the same value registered on another column ('B'), i.e.: all rows that have 'A' assigned as 0 will have their value replaced by the mean of 'A' found among those who share the same value on 'B'. Apparently, the following code is not working, because, when I call print(df.A) after it, I have some rows with 'A' as 0 returned:
df = df[df.A == 0].groupby('B')['A'].mean().reset_index()
I tried a bunch of line codes, but some aren't even accepted...
What I expect is a situation that all 0 values for A are replaced for a mean of A column grouped by B column. Like this:
Before:
Output:
A B
1 0 7
2 0 7
3 9 7
4 10 6
5 8 6
6 0 6
7 0 2
After:
Output:
A B
1 3 7
2 3 7
3 9 7
4 10 6
5 8 6
6 3 6
7 0 2
Thank you for your support.
I think I understand your question now, but then I don't see how you got the '3' for row 6 col A. I am following the logic for how I was able to match the 3's in rows 1 and 2 in col A, which I will try to explain in code below. If this isn't quite the correct interpretation, hopefully is still gets you pointed in the right direction.
Your initial df
df = pd.DataFrame({
'A': [0, 0, 9, 10, 8, 0, 0],
'B': [7, 7, 7, 6, 6, 6, 2]
})
A B
1 0 7
2 0 7
3 9 7
4 10 6
5 8 6
6 0 6
7 0 2
Restating the objective:
For each unique value in col B where col A is 0, find the rows in col A where B has that value and take the mean of those col A values. Then overwrite that mean value to those rows in A that are 0 and line up with the value selected in B. So, for example, the first 3 rows there are 7's in col B, and 0, 0, 9 in col A. The mean of the first 3 A rows is 3, so that is the value that will get overwritten on the 0s in col A, rows 1 and 2.
Steps
Get the unique values from col B where col A is also 0
bvals_when_a_zero = df[df['A'] == 0]['B'].unique()
array([7, 6, 2])
For each of those unique values, calculate the mean of the corresponding values in col A
means = [df[df['B'] == i]['A'].mean() for i in bvals_when_a_zero]
[3.0, 6.0, 0.0]
Loop over the bvals,means pairs and overwrite the 0's with the corresponding mean for the bval. The logic of the pandas where method keeps the values stated to the left (in this case, df['A'] values) that meet the condition in the first argument in the brackets, otherwise choose the second argument as the value to keep. Our condition (df['A'] == 0) & (df['B'] == bval) says get the rows where col A is 0 and col B is one of the unique bvals. But here we actually want to keep the df['A'] values that are NOT equal to the condition, so the conditional argument in the brackets is negated with the ~ symbol in front.
for bval, mean in zip(bvals_when_a_zero, means):
df['A'] = df['A'].where( ~((df['A'] == 0) & (df['B'] == bval)), mean )
This gives the final df
A B
1 3 7
2 3 7
3 9 7
4 10 6
5 8 6
6 6 6
7 0 2
Related
Assume I have the following dataframe:
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
a
b
c
1
[2]
[3]
8
[10,11,12,13]
[6,7,8,9]
Columns b and c can consist of 1 to n elements, but in a row they will always have the same number of elements.
I am looking for a way to explode the columns b and c, while applying a function to the whole row, here for example in order to divide the column a by the number of elements in b and c (I know this is a stupid example as this would be easily solvable by first dividing, then exploding. The real use case is a bit more complicated but of no importance here.)
So the result would look something like this:
a
b
c
1
2
3
2
10
6
2
11
7
2
12
8
2
13
9
I tried using the apply-method like in the following snippet, but this only produced garbage and does not work, when the number of elements in the list does not fit the number of columns:
def fun(row):
if isinstance(row.c, list):
result = [[row.a, row.b, c] for c in row.c]
return result
return row
df.apply(fun, axis=1)
Panda's explode explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
Is there an easier way than to iterate through the data-frame, exploding the values and manually building up a new data-frame in the way I need it here?
Thank you already for your help.
Edit:
The real use case is basically a mapping from b+c to a.
So I have another file that looks something like that:
b
c
a
2
3
1
10
6
1
11
7
1
12
8
2
13
9
4
So coming from this example, the result would actually be as follows:
a
b
c
1
2
3
1
10
6
1
11
7
2
12
8
4
13
9
The problem is, that between this two files there is no 1:1 relation between those two files as it might seem here.
You say:
Panda's explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
This isn't true. Consider this:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
# explode
df2 = df.explode(['b','c'])
As you can see from the print below, the result also explodes the index:
a b c
0 1 2 3
1 8 10 6
1 8 11 7
1 8 12 8
1 8 13 9
So, we can use the index to track how many elements per row got exploded. Try this:
# reset index, add old index as column and create a Series with frequence
# for old index; now populate 'a' with 'a' divided by this Series
df2.reset_index(drop=False, inplace=True)
df2['a'] = df2['a']/df2.loc[:,'index'].map(df2.loc[:,'index'].value_counts())
# drop the old index, now as column
df2.drop('index', axis=1, inplace=True)
Result:
a b c
0 1.0 2 3
1 2.0 10 6
2 2.0 11 7
3 2.0 12 8
4 2.0 13 9
the easiest way to explain this should be an example.
Imagine following dataframe:
a b
1 5
2 4
3 2
4 2
5 4
6 3
7 2
8 1
9 0
I want to be able to get the average of top 3 values and bottom 3 values for each value in column b.
so it should look like this
a b c
1 5
2 4
3 2
4 2 3.3
5 4 2.3
6 3 1.83
7 2
8 1
9 0
any help is appreciated
Thanks
Here's my solution using some help from numpy:
(df is your example dataframe)
length = df.shape[0] # Number of rows in the dataframe
windowSize = 3 # Since we are looking at top 3 and bottom 3 values
for i in range(windowSize, length-windowSize):
# Get the indexes (0-based) of the top 3 values
top3Idxs = np.arange(i - windowSize, i)
bottom3Idxs = np.arange(i + 1, i + 1 + windowSize)
# Get the values in column b at the proper indices
top3Vals = df.b.to_numpy()[top3Idxs]
bottom3Vals = df.b.to_numpy()[bottom3Idxs]
# Find the average of the top3Vals and bottom3Vals
avg = np.mean(np.concatenate((top3Vals, bottom3Vals)))
# Set the average at the proper index in column c
df.at[i, 'c'] = avg
I don't really understand your question or how you got the values in column 'c'. If you want the top and bottom averages of the two columns, that would be 4 separate values (whereas you only have 3 values in column 'c'). I'm also not sure if by top/bottom you mean highest/lowest 3 values in each column (since you say top 'n' values, I'm guessing not).
The top/bottom averages of col 'a' and col 'b' would just be this:
data = {'a': list(range(1,10)), 'b': [5, 4, 2, 2, 4, 3, 2, 1, 0]}
a b
0 1 5
1 2 4
2 3 2
3 4 2
4 5 4
5 6 3
6 7 2
7 8 1
8 9 0
n = 3
averages = {}
for col in df.columns:
averages[col+'_bottom_avg'] = df[col][:n].mean()
averages[col+'_top_avg'] = df[col][-n:].mean()
Output:
averages
{'a_bottom_avg': 2.0,
'a_top_avg': 8.0,
'b_bottom_avg': 3.6666666666666665,
'b_top_avg': 1.0}
If you want the average of the top 3 max/min values, you can just sort the columns first:
averages = {}
for col in df.columns:
averages[col+'_bottom_avg'] = df[col].sort_values()[:n].mean()
averages[col+'_top_avg'] = df[col].sort_values()[-n:].mean()
Output:
averages
{'a_bottom_avg': 2.0,
'a_top_avg': 8.0,
'b_bottom_avg': 1.0,
'b_top_avg': 4.333333333333333}
Apologies if I have completely misunderstood your question.
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])
I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])