Counting rows that contain a NULL value - python

I am looking at doing a Data Quality assessment and I would like to see what proportion of records contain a NULL value somewhere. I know I can use data.isnull() to give a boolean for where there are NULLs and I can do something like data.isnull().sum() to give me the total number of NULL values per column. If I want to do this per row however, I imagine I might need to use a loop. Something like this
for i in data:
if i isnull() ......
This is clearly wrong however and I am now aware of my ignorance and lost.
It would be good to be able to create a flag at the end of the column that contains either a boolean or a 1 or 0 for whether or not a NULL was found on that row.
Thanks

Thank you #EdChum.
I can indeed simply type
data.isnull().sum(1)
This gives the total number of NULL values found on each row with the axis being given by the 1 in sum(1)

Related

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Best way to fuzzy match values in a data frame and then replace the value?

I'm working with a dataframe containing various datapoints of customer data. I'm looking to essentially replace any junk phone numbers as a blank value, right now I'm struggling to find an efficient way to find potential junk values such as a phone number like 111-111-1111 and replace that specific value with a blank entry.
I currently have a fairly ugly solution where I'm going through 3 fields; home phone, cell phone and work phone, locating the index values of the rows in question and respective column and then am replacing those,
with regards to actually finding junk values in a dataframe, is there a better approach to this than what I am currently doing?
row_index = dataset[dataset['phone'].str.contains('11111')].index
column_index = dataset.columns.get_loc('phone')
Afterwards, I would zip these up and cycle through a for loop, using dataset.iat[row_index, column_index] = ''. The row and column index variables would also have the junk values in the 'cellphone' and 'workphone' columns appended on as well.
Pandas 'where' function tends to be quick:
dataset['phone'] = dataset['phone'].where(~dataset['phone'].str.contains('11111'),
None)

Random Forest - make null values always have their own branch in a decision tree

Hi I am using random forest to build a model and I am trying to deal with null values. Would anyone happen to know how you could force the random forest model to treat null values as its own separate band? (as in null values never get banded up with other value ranges. Therefore in a decision tree, the null values of a measure always have their own branch).
I don't want to use mean instead of nulls as I don't want the model to band up null values with other values close to the mean and I don't want to remove nulls either.
I want it so that the decision tree always treats null values of a measure as its own branch.
Thanks :)
You could try these.
Replace null values with a value that drastically varies from any other value in the column.
Example
Let 'feature' be the name of a column with only positive values, then a negative value should suffice for null.
dataframe.loc[dataframe['feature'].isna(), 'feature'] = -100
You could add a new null-tracking column to keep track of null values of another column. (Use this if all features are considered for modeling the random forest)
Example
Let 'feature' be the name of a column with null values
dataframe['feature_isnull'] = 0 #null-tracking column
dataframe.loc[dataframe['feature'].isna(),'feature_isnull'] = 1

Check whether dataframe contains any null values

I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count action or the show method.
count operations are prohibitively expensive in my case as the data volume is large. Same for the show method.
Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value?
The solutions in other posts give the count of missing values in each column. I don't need to know the number of missing values in every column.
I just want to know if there is any cell in the dataframe with a null value.
You can use limit for that
df.select("*").where(col("c").isNull()).limit(1)
You have to potentially go through all values and check for null values. This can be done by either traversing the dataframe in a column-wise or row-wise fashion. Which one is best depends on the data (use heuristics).
Row-wise traversal:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).limit(1).collect().isEmpty
Column-wise traversal (empirically this should be faster, see comment by Clock Slave):
import pyspark.sql.functions as f
contains_nulls = False
for c in df.columns:
if not df.where(f.col(c).isNull()).limit(1).collect().isEmpty:
contains_nulls = True
break
limit(1) is used to stop when the first null value is found and collect().isEmpty to check if the dataframe is empty or not.
As i understand your requirement is to just raise flag if any of the column has null. You don't need to know the which all actual rows are having null.
Solution:
The easiest one i can think of creating a tempView of your DataFrame and check null on all the could. Here is the pseudocode for that-
YourDF.createOrReplaceTempView("tempView")
tempViewDF = sqlContext.sql("SELECT count(*) FROM tempView WHERE Col1 is null or Col2 is null or col3 is null")
flag=flase
if tempViewDF > 0:
flag=true
Now use flag as you want.
Regards,
Anupam

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories