Quick sum of all rows that fill a condition in DataFrame - python

I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0

Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b

Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)

Related

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Calculating overlap between categorical variables in the same DataFrame in Python

I am working on my undergraduate thesis and I have a question about part of my code. I am trying to calculate the number of times each column pair in a df has the same value. The columns are all binary (0, 1).
The input has this format:
df = pd.DataFrame([[1, 0, 1], [0, 0, 1]], columns = ["col1", "col2"])
For instance, the number of times col1 and col2 had the same value in the snippet above is 3.
This is the code i have so far
bl1 = []
bl2 = []
overlap = []
for i in df.iterrows():
for j in range(len(df.columns)):
for k in range(j):
a = df.iloc[j]
b = df.iloc[k]
comparison_column = np.where(a == b)
bl1.append(df.columns[j])
bl2.append(df.columns[k])
overlap.append(len(comparison_column[0]))
After combining the lists into a pd.Dataframe the output looks like this
Base Learner 1 Base Learner 2 overlap
col1 col2 2
I know that the code does not work because I did a count in excel and got different results for the overlap count. I suspect that the loops fail to sum the number of times the pair was found across the df.iterrows() part but i do not know how to fix it. Please give me any suggestions you can. Thanks.
Let's try the magic matrix multiplication:
(df.T # df) + ((1-df.T)#(1-df))
Output:
Rule21 Rule22 Rule23 Rule24
Rule21 5 5 4 2
Rule22 5 5 4 2
Rule23 4 4 5 3
Rule24 2 2 3 5
Explanation: df.T # df counts if corresponding cells in both columns are 1. Similarly, ((1-df.T)#(1-df)) counts if both columns are 0.

How to remove blanks/NA's from dataframe and shift the values up

I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!

Python: Iterate over a data frame column, check for a condition-value stored in array, and get the values to a list

After some help in the forum I managed to do what I was looking for and now I need to get to the next level. ( the long explanation is here:
Python Data Frame: cumulative sum of column until condition is reached and return the index):
I have a data frame:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
I add a column with the cumulative sum of another column.
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
Then I apply a condition to the cumsumcolumn and I extract the corresponding values of the row where the condition is met with a given tolerance:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
Now, what I want to do is to substitute to the condition 15 in the example, the conditions stored in an array. Check when the condition is met and retrieve not the entire row, but only the value of the column Num_Albums. Finally, all these retrieved values (one per condition) are stored in an array or list.
Coming from matlab, I would do something like this (I apologize for this mixed matlab/python syntax):
conditions = np.array([10, 15, 23])
for i=0:len(conditions)
retrieved_values(i) = df.where((df['cumsum']>=conditions(i)-tol)&(df['cumsum']<=conditions(i)+tol)).dropna()
So for the data frame above I would get (for tol=0):
retrieved_values = [10, 4, 1]
I would like a solution that lets me keep the .where function if possible..
A quick way to do would be to leverage NumPy's broadcasting techniques as an extension of this answer from the same post linked, although an answer related to the use of DF.where was actually asked.
Broadcasting eliminates the need to iterate through every element of the array and it's highly efficient at the same time.
The only addition to this post is the use of np.argmax to grab the indices of the first True instance along each column (traversing ↓ direction).
conditions = np.array([10, 15, 23])
tol = 0
num_albums = df.Num_Albums.values
num_albums_cumsum = df.Num_Albums.cumsum().values
slices = np.argmax(np.isclose(num_albums_cumsum[:, None], conditions, atol=tol), axis=0)
Retrieved slices:
slices
Out[692]:
array([0, 2, 4], dtype=int64)
Corresponding array produced:
num_albums[slices]
Out[693]:
array([10, 4, 1], dtype=int64)
If you still prefer using DF.where, here is another solution using list-comprehension -
[df.where((df['cumsum'] >= cond - tol) & (df['cumsum'] <= cond + tol), -1)['Num_Albums']
.max() for cond in conditions]
Out[695]:
[10, 4, 1]
The conditions not fulfilling the given criteria would be replaced by -1. Doing this way preserves the dtype at the end.
well the output not always be 1 number right?
in case the ouput is exact 1 number you can write this code
tol = 0
#condition
c = [5,15,25]
value = []
for i in c:
if len(df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a']) > 0:
value = value + [df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values[0]]
else:
value = value + [[]]
print(value)
the output should be like
[1,2,3]
in case the output can be multiple number and want to be like this
[[1.0, 5.0], [12.0, 15.0], [25.0]]
you can use this code
tol = 5
c = [5,15,25]
value = []
for i in c:
getdatas = df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values
value.append([x for x in getdatas])
print(value)

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Categories