So I have some fairly sparse data columns where most of the values are blank but sometimes have some integer value. In Python, if there is a blank then that column is interpreted as a float and there is a .0 at the end of each number.
I tried two things:
Changed all of the columns to text and then stripped the .0 from everything
Filled blanks with 0 and made each column an integer
Stripping the .0 is kind of time consuming on about 2mil+ rows per day and then the data is in text format which means I can't do quick sums and stuff.
Filling blanks seems somewhat wasteful because some columns literally have just a few actual values out of millions. My table for just one month is already over 80gigs (200 columns, but many of the columns after about 30 or so are pretty sparse).
What postgres datatype is best for this? There are NO decimals because the columns contain the number of seconds and it must be pre-rounded by the application.
Edit - here is what I am doing currently (but this bloats up the size and seems wasteful):
def create_int(df, col):
df[col].fillna(0, inplace=True)
df[col] = df[col].astype(int)
If I try to create the column astype(int) without filling in the 0s I get the error:
error: Cannot convert NA to integer
Here is the link the the Gotcha about this.
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
So it makes each int a float. Should I change the datatypes in postgres to numeric or something? I do not need high precision because there are no values after the decimal.
You could take advantage of the fact you are using POSTGRESQL (9.3 or above), and implement a "poor man's sparse row" by converting your data into Python dictionaries and then using a JSON datatype (JSONB is better).
The following Python snippets generate random data in the format you said you have yours, convert them to apropriate json, and upload them into a PostgreSQL table with a JSONB column.
import psycopg2
import json
import random
def row_factory(n=200, sparcity=0.1):
return [random.randint(0, 2000) if random.random() < sparcity else None for i in range(n)]
def to_row(data):
result = {}
for i, element in enumerate(data):
if element is not None: result[i] = element
return result
def from_row(row, lenght=200):
result = [None] * lenght
for index, value in row.items():
result[int(index)] = value
return result
con = psycopg2.connect("postgresql://...")
cursor = con.cursor()
cursor.execute("CREATE TABLE numbers (values JSONB)")
def upload_data(rows=100):
for i in range(rows):
cursor.execute("INSERT INTO numbers VALUES(%s)", (json.dumps(to_row(row_factory(sparcity=0.5))),) )
upload_data()
# To retrieve the sum of all columns:
cursor.execute("""SELECT {} from numbers limit 10""".format(", ".join("sum(CAST(values->>'{}' as int))".format(i) for i in range(200))))
result = cursor.fetchall()
It took me a while to find out how to perform numeric operations on the JSONB data inside Postgresql (if you will be using them from Python you can just use the snippet from_row function above). But the last two lines have a Select operation that performs a SUM on all columns - the select statement itself is assembled using Python string formatting methods - the key to use a Json value as number is to select it with the ->> operator, and them cast it to number.(the sum(CAST(values->>'0' as int)) part)
Related
I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.
Is there a way to get the number of rows affected as a result of executing session.bulk_insert_mappings() and session.bulk_update_mappings()?
I have seen that you can use ResultProxy.rowcount to get this number with connection.execute(), but how does it work with these bulk operations?
Unfortunately bulk_insert_mappings and bulk_update_mappings do not return the number of rows created/updated.
If your update is the same for all the objects (for example increasing some int field by 1) you could use this:
updatedCount = session.query(People).filter(People.name == "Nadav").upadte({People.age: People.age + 1)
df1
[601601,859078]
[601601,137726]
[620253,859078]
This is a pull from an SQL database and returned as a dataframe. Each of these data grabs can be tens of thousands of lines deep. I need to then read the right column and make this a new input to the SQL database. These need to then be associated correctly. I.E., with the previous input and the input below:
df2
[859078,682084]
[859078,783085]
I want to get an output of (merge df1 right column with df2 left column [outer merge])
[601601,859078,682084]
[601601,859078,783085]
[601601,137726,]
[620253,859078,682084]
[620253,859078,783085]
This is pretty much what pd.merge(df1,df2,who='outer') is for and I was able to get this to work but I use a recursive function to achieve all the calls to the database. I ran across this stack overflow question that says this is horribly inefficient since the dataframe needs to be copied each time and sure enough, it is painfully slow. So I followed his example and this is what I have:
def sqlFormat(arg, array):
array = removeDuplicates(array)
for x in range(len(array)):
array[x] = f"'{array[x]}'"
return arg + "({})".format(",".join(array))
def recursiveFunction(arg, data, cnxn, counter):
sql_query = pd.read_sql_query(arg, cnxn).add_suffix('_{}'.format(counter)).astype('Int32')
if not sql_query.empty:
data.append(sql_query)
counter += 1
recursiveFunction(sqlFormat("SELECT x, y FROM SQL.Table WHERE x IN ", eval( "sql_query.y_"+str(counter-1)+".to_numpy(dtype='Int32')")), data, cnxn, counter)
return data
def readSQL(cnxn, array):
data=[]
counter = 1
dfPreLoad= #dataframe return of SQL Query; takes cnxn, array
#arg used below modified here
data = recursiveFunction(arg, data, cnxn, counter)
dfOutput = pd.concat(data,ignore_index=True)
Basically, I pass data, a numpy array, through the recursive function that appends as it runs, I can then turn it into a dataframe with pd.concat(data) which returns this:
The picture doesn't have like values in it but pd.concat is effectively placing each array in its own columns after the last row of the last array.
I would like to be able to split data back into dataframes and merge them though this may have the same issue as before. I could also write a custom merge function after converting the concat back to an array. Any ideas on how to achieve this?
Py 3.7
Based on your comments & additions to the question, you could generate the results you want using a recursive CTE query which joins each row to its parent row in the CTE, building a comma separated list of f values:
WITH cte AS (
SELECT f, CAST(f AS VARCHAR(255)) AS ids
FROM test
WHERE d IS NULL
UNION ALL
SELECT test.f, CAST(CONCAT(cte.ids, ',', test.f) AS VARCHAR(255))
FROM test
JOIN cte ON cte.f = test.d
)
SELECT ids
FROM cte
For the sample data you provided this will produce the following output:
ids
1
2
1,3
1,4
1,4,6
1,4,7
1,3,5
This can then be readily split into columns and ordered in python.
I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.
I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).