What does np.mean(data.isnull()) exactly? - python

in creating a cleaning project throught Python, I've found this code:
# let's see if there is any missing data
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing,2)))
Which actually works fine, giving back the % of null values per column in the dataframe, but I'm a little confused on how it works:
First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?
Just for reference, I've worked around it with this:
NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))
that gives me back basically the same results but just to understand the mechanism...I'm confused about the first block of code...

It's something that's very intuitive once you're used to it. The steps leading to this kind of code could be like the following:
To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
So, first we need to detect the null rows. This is easy, as there is a provided method: df[col].isnull().
The result of df[col].isnull() is a new column consisting of booleans -- True or False.
Now we need to count the Trues. Here we can realize that counting Trues in a boolean array is the same as summing the array: True can be converted to 1, and False to zero.
So we would be left with df[col].isnull().sum() / len(df[col]).
But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result: mean(df[col].isnull()).

df[col].isnull() is assigning a boolean (True/False) depending on the NA/null state of the values.
np.mean computes the average of the values, with True as 1 and False as 0, which is a equivalent of computing the proportion of Null values in the column.
np.mean([True, False, False, False])
# equivalent to
np.mean([1, 0, 0, 0])
# 0.25

the first thing that happens is the
df[col].isnull()
this creates a list of bool values with True beeing when the column is null so if for example the values are [x1, x2, x3, null, x4] then it gives the vector [False, False, False, True, False].
The next step is then the np.mean funcion. This function calculates the mean value of the vector but replaces True with 1 and False with 0. This gives the vector [0, 0, 0, 1, 0]
The mean of this vector is equal to the sum of nulls divided by the length of the vector which is the method you are using.
Just a comment. It does not give a percent you need to multiply by 100.

Related

Interleaving IDs of groups, with IDs coming from 2 separate arrays, defining size of groups

The problem
I have 2 different functions, named update() and reset().
I am referring to these functions as 'IDs' in the title of the ticket.
They apply successively to groups of contiguous rows from a data array.
The sizes of groups to which apply these functions are defined in related arrays.
import numpy as np
# 'data' array, size 5.
data = np.array([1., 4., 3., 8., 9.], dtype='float64')
# Group sizes to which will apply 'update' function.
update_gs = np.array([1, 0, 2, 2], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'update' function is to be performed successively on:
# - the 1st row
# - then without any row from data
# - then with the 2nd and 3rd rows from data
# - then with the 4th and 5th rows from data
# Group sizes to which will apply 'reset' function.
reset_gs = np.array([2, 0, 0, 2, 1], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'reset' function is to be performed successively on:
# - the 2 1st rows
# - a 2nd and 3rd reset will be run without any row from data
# - a 4th reset will be run with 3rd and 4th rows from data
# - a 5th reset will be run with the last row from data
The result I am looking for from this input data are 2 1D arrays:
each row of these result arrays relates to one occurrence of either update or reset.
consequently, these arrays are of size len(update_gs) + len(reset_gs), i.e. here 8
one array is of int, defining group sizes again. In this result array, group sizes are defined as the number of the 'elapsed' rows since the previous occurrence of update or reset.
the other array is of bool, defining if a row relates to a reset function (value True) or an update function (value False)
regarding the order of appearance of the update and reset occurrences:
row groups in data of update and reset occurrences overlap. Considering the last rows in their respective row groups, the row coming the latter between each row group makes also the corresponding occurrence (update and reset) the latter.
When row groups of an update and a reset occurrence share the same last row, then update comes 1st in result array.
With previous data, expected results are:
group_sizes = np.array([1, # 1st action is 'update'. It applies to 1st row in data.
0, # There is a 2nd 'update' on 0 row.
1, # At 2nd row of data, there is the 1st 'reset' action.
0, # There is a 2nd 'reset' on 0 row.
0, # There is a 3rd 'reset' on 0 row.
1, # There is the 3rd 'update' taking place, with 1 row elapsed since previous function.
1, # There is a 4th 'reset', with 1 row elapsed since previous function.
1, # There is the 4th 'update' taking place, with 1 row elapsed since previous function.
0, # Occurs finally the last 'reset', with same ending row in 'data' than the previous 'update'
], dtype='int64')
# Sum of the values gives 5, the total number of rows in 'data'.
function_ids = np.array([False, # 1st action is 'update'.
False, # There is a 2nd 'update'.
True, # There is the 1st 'reset' action.
True, # There is a 2nd 'reset'.
True, # There is a 3rd 'reset'.
False, # There is the 3rd 'update'.
True, # There is a 4th 'reset'.
False, # There is the 3rd 'update'.
True, # Occurs finally the last 'reset'.
], dtype='bool')
Possibly a XY problem?
Above question is raised considering following topic:
I have as inputs the 2 arrays mentioned, reset_gs and update_gs. The way the related functions (update or reset) work depends what function has been applied on the previous group (a reset or an update?) and its result.
For this reason, I tried 1st to interleave the respective calls in 2 for loops, imbricated. This results in a complex code, that I have not been successful at making work yet. I believe after some time, it might be possible, with several ifs, buffer variables and boolean flags to communicate about previous states between the 2 interleaved for loops. In terms of maintenance / simplicity of code, it really does not seem adequate.
For this reason, I believe that opting to a flatten for loop is preferable. The 2 resulting arrays I am looking for (above question) will enable me to opt for such a solution.
What about looping over data?
Size of data is several millions of rows.
Size of update_gs is several thousands of rows.
Size of reset_gs varies from several hundreds to several thousands of rows.
Looking for performance, I have reasons to believe that looping over update_gs & reset_gs (i.e. group definitions - several thousand iterations) instead of data (each row individually - several millions iterations) will result in faster code.
This actually turned into 'how making a mergesort?'.
I discovered about sortednp package in the way, that seems the fast way to do it.
import numpy as np
from sortednp import merge # merge of sorted numpy arrays
# Input data.
update = np.array([1, 0, 2, 2], dtype='int64')
reset = np.array([2, 0, 0, 2, 1], dtype='int64')
# Switching from group sizes to indices of group last row, in-place.
np.cumsum(update, out=update)
np.cumsum(reset, out=reset)
# Performing a merge of sorted arrays, keeping insertion index for 'update'.
merged_idx, (update_idx, _) = merge(update, reset, indices=True)
# Going back to group sizes.
merged_gs = np.diff(merged_idx, prepend=0)
# Final step to get 'function_ids'.
function_ids = np.ones(len(merged_gs), dtype="bool")
function_ids[update_idx] = False
# Here we are.
merged_gs
Out[9]: array([1, 0, 1, 0, 0, 1, 1, 1, 0])
function_ids
Out[13]: array([False, False, True, True, True, False, True, False, True])

How to get the average of only positive numbers in a numpy array?

We were given two text files that contain numbers and we need to load them using numpy.loadtxt and then find the mean of only the positive numbers within the text files.
This is what I've tried so far:
A = np.loadtxt('A.txt')
B = np.loadtxt('B.txt')
A_mean = np.mean(np.where(A>0))
B_mean = np.mean(np.where(B>0))
print(f'A={A_mean}')
print(f'B={B_mean}')
I know this is obviously wrong since it appears to be returning the average of the index numbers, not the values. How can I get the average of the actual values?
np.where(A > 0) returns the indices where A > 0 as you noticed. To get the elements of A, use the indices: np.mean(A[np.where(A > 0)]).
But wait, A > 0 is a boolean array that has True in every element that meets the condition and False elsewhere. Such an array is called a "mask", and can be used for indexing as well:
A[A > 0].mean()

Pandas Inconsistent Boolean Comparisons

I'm running into an issue which probably has an obvious fix. So i have a 2x2 dataframe that has lists in it. When i take the first row and compare the entire row against a specific list value i'm looking for, the boolean array that is returned is entirely False. This is incorrect since the first value of the row has the exact value i'm looking for. When I instead compare the singular value in the dataframe, I get True. How come, when doing boolean operations over the entire row, I get False instead of True for the value in the first column. Thanks in advance!
Returns False for both values in the first row.
In:
static_levels = pd.DataFrame([[[58, 'Level']], [[24.4, 'Level'], [23.3, 'Level']]], ['Symbol 1', 'Symbol 2'])
print(static_levels.loc['Symbol 1',:]==[58, 'Level'])
Out:
0 False
1 False
Name: Symbol 1, dtype: bool
However, the code below correctly returns True when comparing just the first value in the first row:
In: print(static_levels.loc['Symbol 1',0]==[58, 'Level'])
Out: True

Count occurences of True/False in column of dataframe

Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well. 
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1. 
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64

Keep indexes intact while doing two separate things with series

I'm trying to create a boolean series in which I treat data two different ways. I'm trying to find the local minimum to start a boolean calculation, and anything before that, I'd like to return as False. My problem is the only way I can think of to do that is to essentially split the resulting series into two, one from start of group to the row before the minimum, and one from the minimum to the end of the group, finally concatenating them. You can see below that I create a list of False entries, then concatenate that with the boolean series I created starting at the minimum. This is really kludge-y, and it doesn't keep the indexes intact.
ser = pd.concat([pd.Series([False] * (argrelextrema(group['B'].values, np.less)[0][:1][0])), (group[argrelextrema(group['B'].values, np.less)[0][:1][0]:].B.diff().shift(-1) <= -1)])
From this:
B
5876 500.2
5877 500.3
5878 500.4
5879 498.3
5880 499.0
5881 512
...
I end up with something like this for example:
1 False
2 False
3 False
5879 True
5880 False
5881 False
...
To fix it, I figured I could reset the indexes starting with the first one of the group, but that seems even more kludge-y.
ser.index = np.arange(group.index[0], len(ser))
Is there a more elegant way to return False for everything before the minimum and combine that with the boolean series I create, keeping all indexes intact?
You can use the operator & to create one Boolean series instead of joining two series. In other words, your Boolean series should satisfy two conditions:
Index of item is greater than or equal to index of minimum value
AND
Item satisfies your other Boolean calculation (which you will then be able to calculate for all items regardless of position)
B = pd.Series([10,9,7,7,12,14])
# Getting first position of minimum value
minValueIndex = B[B == B.min()].index[0]
# Boolean list for condition 1
isAfterMin = B.index >= minValueIndex
# Boolean list for condition 2: whatever calculation you make, for entire series. Example:
myBoolean = [True, False, True, True, False, True]
# Final Boolean list
ser = isAfterMin & myBoolean
print (ser)
# [False False True True False True]
As you can see, items located before the minimum value are always assigned a False, and the other items are assigned the booleans you calculate.

Categories