I have a table which looks like this:
When I try to look up only the row with case_id = 5 based on pr, sr, sn, I use the following code:
SELECT case_id, dupl_cnt
FROM cases
WHERE pr = NULLIF('', '')::INT AND
sr = NULLIF('CH_REP13702.10000', '')::VARCHAR AND
sn = NULLIF('22155203912', '')::VARCHAR
However, the code above does not yield any result (empty query result). I have narrowed it down to being some sort of an issue with the "pr" value being null - when "pr" removed line is removed from the above query, it starts to work as expected. Can someone explain to me why is that happening? I am anticipating pr or sr columns at times to feature NULL values, but still have to be able to look up case_id numbers with them.
(NULLIF function is in there because it is a part of Python integration with psycopg2 module, and I have to anticipate that sometimes data entry will feature empty string for these values).
NULLIF('', '') returns [null]
that deos'nt that satisfy the pr = [null] condition because
anything = NULL returns NULL
You need to use IS NOT DISTINCT FROM instead of =
Related
I have a dataframe like this, for the sake of simplicity i'm just showing 2 columns both columns are string, but in real life it will have more columns each of different types other than string:
SQLText
TableName
select * from sourceTable;
NewTable
select * from sourceTable1;
NewTable1
I also have a custom Function where i want to iterate over the dataframe and get the sql and run it to create a table, however I'm not passing each column individually, but rather the whole row:
def CreateTables(rowp):
df = spark.sql(rowp.SQLText)
#code to create table using rowp.TableName
This is my code, I first clean up SQLText because it's stored in another table and then I run the UDF on the column:
l = l.withColumn("SQLText", F.lit(F.regexp_replace(F.col("SQLText").cast("string"), "[\n\r]", " ")))
nt = l.select(l["*"]).withColumn("TableName",CreateTables(F.struct(*list(l.columns)) )).select("TableName","SQLText")
nt.show(truncate=False)
So when I'm running the function, and I try to run the code above, it errors out because instead of parsing the rowp.SQLText into its literal value, it passes its type?:
Column<'struct(SourceSQL, TableName)[SourceSQL]'>
So in the CreateTables function, when spark.sql(rowp.SQLText) is executed I expect the following:
df = spark.sql("select * from sourceTable;")
but instead this is happening, the variable type is literally being sent instead of the variable value
df = spark.sql("Column<'struct(SourceSQL, TableName)[SourceSQL]'>")
I've tried numerous solutions: getItem, getField, get, getAs but no luck yet.
I've also tried using indexes like rowp[0] but it just changes the variable type passed to the spark.sql function:
Column<'struct(SourceSQL, TableName)[0]'>
If I try rowp(0) it gives me a Column is not callable error.
There are many ways to do this.
Here is one way I tested in pyspark 3.2.3
rows = df.rdd.collect()
for i in range(len(rows)):
spark.sql(rows[i][0])
I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.
I'm brand new in python and updating tables using sql. I would like to ask how to update certain group of values in single column using SQL. Please see example below:
id
123
999991234
235
789
200
999993456
I need to add the missing prefix '99999' to the records without '99999'. The id column has integer data type by default. I've tried the sql statement, but I have a conflict between data types that's I've tried with cast statement:
update tablename
set id = concat('99999', cast(id as string))
where id not like '99999%';
To be able to use the LIKE operator and CONCAT() function, the column data type should be a STRING or BYTE. In this case, you would need to cast the WHERE clause condition as well as the value of the SET statement.
Using your sample data:
Ran this update script:
UPDATE mydataset.my_table
SET id = CAST(CONCAT('99999', CAST(id AS STRING)) AS INTEGER)
WHERE CAST(id as STRING) NOT LIKE '99999%'
Result:
Rows were updated successfully and the table ended up with this data:
I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count action or the show method.
count operations are prohibitively expensive in my case as the data volume is large. Same for the show method.
Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value?
The solutions in other posts give the count of missing values in each column. I don't need to know the number of missing values in every column.
I just want to know if there is any cell in the dataframe with a null value.
You can use limit for that
df.select("*").where(col("c").isNull()).limit(1)
You have to potentially go through all values and check for null values. This can be done by either traversing the dataframe in a column-wise or row-wise fashion. Which one is best depends on the data (use heuristics).
Row-wise traversal:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).limit(1).collect().isEmpty
Column-wise traversal (empirically this should be faster, see comment by Clock Slave):
import pyspark.sql.functions as f
contains_nulls = False
for c in df.columns:
if not df.where(f.col(c).isNull()).limit(1).collect().isEmpty:
contains_nulls = True
break
limit(1) is used to stop when the first null value is found and collect().isEmpty to check if the dataframe is empty or not.
As i understand your requirement is to just raise flag if any of the column has null. You don't need to know the which all actual rows are having null.
Solution:
The easiest one i can think of creating a tempView of your DataFrame and check null on all the could. Here is the pseudocode for that-
YourDF.createOrReplaceTempView("tempView")
tempViewDF = sqlContext.sql("SELECT count(*) FROM tempView WHERE Col1 is null or Col2 is null or col3 is null")
flag=flase
if tempViewDF > 0:
flag=true
Now use flag as you want.
Regards,
Anupam
So I have some fairly sparse data columns where most of the values are blank but sometimes have some integer value. In Python, if there is a blank then that column is interpreted as a float and there is a .0 at the end of each number.
I tried two things:
Changed all of the columns to text and then stripped the .0 from everything
Filled blanks with 0 and made each column an integer
Stripping the .0 is kind of time consuming on about 2mil+ rows per day and then the data is in text format which means I can't do quick sums and stuff.
Filling blanks seems somewhat wasteful because some columns literally have just a few actual values out of millions. My table for just one month is already over 80gigs (200 columns, but many of the columns after about 30 or so are pretty sparse).
What postgres datatype is best for this? There are NO decimals because the columns contain the number of seconds and it must be pre-rounded by the application.
Edit - here is what I am doing currently (but this bloats up the size and seems wasteful):
def create_int(df, col):
df[col].fillna(0, inplace=True)
df[col] = df[col].astype(int)
If I try to create the column astype(int) without filling in the 0s I get the error:
error: Cannot convert NA to integer
Here is the link the the Gotcha about this.
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
So it makes each int a float. Should I change the datatypes in postgres to numeric or something? I do not need high precision because there are no values after the decimal.
You could take advantage of the fact you are using POSTGRESQL (9.3 or above), and implement a "poor man's sparse row" by converting your data into Python dictionaries and then using a JSON datatype (JSONB is better).
The following Python snippets generate random data in the format you said you have yours, convert them to apropriate json, and upload them into a PostgreSQL table with a JSONB column.
import psycopg2
import json
import random
def row_factory(n=200, sparcity=0.1):
return [random.randint(0, 2000) if random.random() < sparcity else None for i in range(n)]
def to_row(data):
result = {}
for i, element in enumerate(data):
if element is not None: result[i] = element
return result
def from_row(row, lenght=200):
result = [None] * lenght
for index, value in row.items():
result[int(index)] = value
return result
con = psycopg2.connect("postgresql://...")
cursor = con.cursor()
cursor.execute("CREATE TABLE numbers (values JSONB)")
def upload_data(rows=100):
for i in range(rows):
cursor.execute("INSERT INTO numbers VALUES(%s)", (json.dumps(to_row(row_factory(sparcity=0.5))),) )
upload_data()
# To retrieve the sum of all columns:
cursor.execute("""SELECT {} from numbers limit 10""".format(", ".join("sum(CAST(values->>'{}' as int))".format(i) for i in range(200))))
result = cursor.fetchall()
It took me a while to find out how to perform numeric operations on the JSONB data inside Postgresql (if you will be using them from Python you can just use the snippet from_row function above). But the last two lines have a Select operation that performs a SUM on all columns - the select statement itself is assembled using Python string formatting methods - the key to use a Json value as number is to select it with the ->> operator, and them cast it to number.(the sum(CAST(values->>'0' as int)) part)