I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count action or the show method.
count operations are prohibitively expensive in my case as the data volume is large. Same for the show method.
Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value?
The solutions in other posts give the count of missing values in each column. I don't need to know the number of missing values in every column.
I just want to know if there is any cell in the dataframe with a null value.
You can use limit for that
df.select("*").where(col("c").isNull()).limit(1)
You have to potentially go through all values and check for null values. This can be done by either traversing the dataframe in a column-wise or row-wise fashion. Which one is best depends on the data (use heuristics).
Row-wise traversal:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).limit(1).collect().isEmpty
Column-wise traversal (empirically this should be faster, see comment by Clock Slave):
import pyspark.sql.functions as f
contains_nulls = False
for c in df.columns:
if not df.where(f.col(c).isNull()).limit(1).collect().isEmpty:
contains_nulls = True
break
limit(1) is used to stop when the first null value is found and collect().isEmpty to check if the dataframe is empty or not.
As i understand your requirement is to just raise flag if any of the column has null. You don't need to know the which all actual rows are having null.
Solution:
The easiest one i can think of creating a tempView of your DataFrame and check null on all the could. Here is the pseudocode for that-
YourDF.createOrReplaceTempView("tempView")
tempViewDF = sqlContext.sql("SELECT count(*) FROM tempView WHERE Col1 is null or Col2 is null or col3 is null")
flag=flase
if tempViewDF > 0:
flag=true
Now use flag as you want.
Regards,
Anupam
Related
I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.
I am looking at doing a Data Quality assessment and I would like to see what proportion of records contain a NULL value somewhere. I know I can use data.isnull() to give a boolean for where there are NULLs and I can do something like data.isnull().sum() to give me the total number of NULL values per column. If I want to do this per row however, I imagine I might need to use a loop. Something like this
for i in data:
if i isnull() ......
This is clearly wrong however and I am now aware of my ignorance and lost.
It would be good to be able to create a flag at the end of the column that contains either a boolean or a 1 or 0 for whether or not a NULL was found on that row.
Thanks
Thank you #EdChum.
I can indeed simply type
data.isnull().sum(1)
This gives the total number of NULL values found on each row with the axis being given by the 1 in sum(1)
I have CSV file which has 3 columns.
Here is what I have to do:
I want to write an if condition or whatever like if Divi == 'core' then I need the count of tags (distinct) without redundancy i.e ( two sand1 in the tag for core division should be considered as only one count).
One more if condition like Div === saturn or core && type == dev then same thing need to count the no of tags(distinct)
Can anyone help me out with this? As it was my idea.. any new ideas will be accepted if it satisfies requirement
First, load up your data with pandas.
import pandas as pd
dataframe = pd.read_csv(path_to_csv)
Second, format your data properly (you might have lower case/upper case data as in column 'Division' from your example)
for column in dataframe.columns:
dataframe[column] = dataframe[column].lower()
If you want to count frequency just by one column you can:
dataframe['Division'].value_counts()
If you want to count by two columns you can:
dataframe.groupby(['Division','tag']).count()
Hope that helps
edit:
While this wont give you just the count of when 2 conditions are met, which is what you asked for, it will give you a more 'complete' answer, showing the count for all two columns combinations
I am new to python and SQLALCHEMY, and I came across this doubt, whether can we filter rows of the table based on cell values of the column of same table.
example:
Sbranch=value
result=Transaction.query.filter(Transaction.branch==Sbranch)
.order_by(desc(Transaction.id)).limit(50).all()
if the value of Sbranch=0, i want to read all the rows regardless of Sbranch value, else i want to filter rows with contains Transaction.branch==Sbranch.
I know that it can be achieved by comparing the values of Sbranch(if-else conditions),but it gets complicated as the number of such columns increases.
Example:
Sbranch=value1
trans_by=value2
trans_to=value3
.
.
result=Transaction.query.filter(Transaction.branch==Sbranch,Transaction.trans_by==value2,Transaction_to==trans_to)
.order_by(desc(Transaction.id)).limit(50).all()
I want to apply similar filter with all 3 columns.
I want to know if there is any inbuilt function in SQLALCHEMY to this problem.
You can optionally add the filter based on the value of SBranch
query = Transaction.query
if SBranch != 0:
query = query.filter(Transaction.branch == SBranch)
result = query.order_by(Transaction.id.desc()).limit(50).all()
I think i found the solution, it's not the best but will reduce the work for the developers(not the processor).
Sbranch=value
branches=[]
if Sbranch==0:
# Append all the values into the array for which the rows are filtered
# for example:
branches=[1,2,4,7,3,8]
else:
branches.append(branch)
result=Transaction.query.filter(Transaction.branch.in_(branches))
.order_by(desc(Transaction.id)).limit(50).all()
So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.