Concisely retrieve many variables from a SELECT - python

I have a table that has about 50 columns. To retrieve the variables from the table I am doing:
cursor.execute("SELECT * FROM title WHERE vendor_id = '%s'"%vendor_id)
data=cursor.fetchone()
provider, language, type, subtype, vendor_id = data[0], data[1], data[2], data[3], data[4]
etc...
Is there a way to do this more concisely: the variables I want to define are also the names of the columns. Perhaps something like (in pseudocode) --
values=cursor.fetchone; columns=cursor.fetchcolumns()
data = zip(columns, values)

select * is going to be particularly challenging here as you don't know what columns are being returned in which order. Depending on the database you're using and the wrapper you're using, you should be able to retrieve your rows as dicts instead which allows you to reference the columns as dict keys in the rows.
For instance, MySQLdb supports this through a DictCursor. See http://www.kitebird.com/articles/pydbapi.html
For other libraries, they should offer a similar feature.

You can use cursor.description and then convert the result to a dict:
import sqlite3
cnx = sqlite3.connect(r"g:\Python\Test\dabo\turnos\turnos.sqlite")
cur = cnx.execute("select * from Paciente")
rec = cur.fetchone()
fields = [i[0] for i in cur.description]
values = dict(zip(fields, rec))
print values["PacID"], values["PacNombre"] # ,...

Related

psycopg2.extras.DictCursor not returning dict in postgres

Im using psycopg2 to access postgres database using the below query. In order to return a dictionary from the executed query, im using DictCursor in my cursor but still my output is a list and not a dictonary.
Here is the program and output below.
import psycopg2.extras
try:
conn = psycopg2.connect("user='postgres' host='localhost' password='postgres'",
)
except:
print "I am unable to connect to the database"
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute("""SELECT datname from pg_database""")
rows = cur.fetchall()
print "\nShow me the databases:\n"
print rows
Output:-
[['template1'], ['template0'], ['postgres'], ['iip'], ['test'], ['test_postgres'], ['testdb']]
It looks like a list, smells like a list, but it's a DictRow.
rows = cur.fetchall()
for row in rows :
print(type(row))
#>>> <class 'psycopg2.extras.DictRow'>
This means that you can still use the column names as keys to access the data :
rows = cur.fetchall()
print([row['datname'] for row in rows])
This class inherits directly from the builtinlist and add all the needed methods to implement a dictionary logic, but it doesn't change the representation __repr__ or __str__, so the output is the same as a list.
class DictRow(list):
"""A row object that allow by-column-name access to data."""
fetchall() packs all the queried rows in a list without specifying the exact type.
Btw, maybe you are looking for this kind of cursor : RealDictCursor ?
For those who came where because they really like the easy reference of the dictionary for column:value record representation, the answer by PRMoureu which notes that the DictRow has all the usual dictionary logic means that you can iterate over the DictRow with .items() and get the key:value pairs.
rows = cur.fetchall()
row_dict = [{k:v for k, v in record.items()} for record in rows]
Will turn your list of DictRow records into a list of dict records.

sqlite selecting multiple tables

I have a database in sqlite with c.300 tables. Currently i am iterating through a list and appending the data.
Is there a faster way / more pythonic way of doing this?
df = []
for i in Ave.columns:
try:
df2 = get_mcap(i)
df.append(df2)
#print (i)
except:
pass
df = pd.concat(df, axis=0
Ave is a dataframe where the column in the list i want to iterate through.
def get_mcap(Ticker):
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query("SELECT * FROM '%s'"%(Ticker), cnx)
df.columns = ['Date', 'Mcap-Ave', 'Mcap-High', 'Mcap-Low']
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
cnx.close
return df
Before I post my solution, I should include a quick warning that you should never use string manipulation to generate SQL queries unless it's absolutely unavoidable, and in such cases you need to be certain that you are in control of the data which is being used to format the strings and it won't contain anything that will cause the query to do something unintended.
With that said, this seems like one of those situations where you do need to use string formatting, since you cannot pass table names as parameters. Just make sure there's no way that users can alter what is contained within your list of tables.
Onto the solution. It looks like you can get your list of tables using:
tables = Ave.columns.tolist()
For my simple example, I'm going to use:
tables = ['table1', 'table2', 'table3']
Then use the following code to generate a single query:
query_template = 'select * from {}'
query_parts = []
for table in tables:
query = query_template.format(table)
query_parts.append(query)
full_query = ' union all '.join(query_parts)
Giving:
'select * from table1 union all select * from table2 union all select * from table3'
You can then simply execute this one query to get your results:
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query(full_query, cnx)
Then from here you should be able to set the index, convert to datetime etc, but now you only need to do these operations once rather than 300 times. I imagine the overall runtime of this should now be much faster.

Summing database column in python

I have recently encountered the problem of adding the elements of a database column. Here is the following code:
import sqlite3
con = sqlite3.connect("values.db")
cur = con.cursor()
cur.execute('SELECT objects FROM data WHERE firm = "sony"')
As you can see, I connect to the database (sql) and I tell to Python to select the column "objects".
The problem is that I do not know the appropriate command for summing the selected objects.
Any ideas/ advices are highly reccomended.
Thank you in advance!!
If you can, have the database do the sum, as that reduces data transfer and lets the database do what it's good at.
cur.execute("SELECT sum(objects) FROM data WHERE firm = 'sony'")
or, if you're really just looking for the total count of objects.
cur.execute("SELECT count(objects) FROM data WHERE firm = 'sony'")
either way, your result is simply:
count = cur.fetchall()[0][0]
Try the following line:
print sum([ row[0] for row in cur.fetchall()])
If you want the items instead adding them together:
print ([ row[0] for row in cur.fetchall()])

Keep smallest value for each unique ID with arcpy/numpy

I've got a ESRI Point Shape file with (amongst others) a nMSLINK field and a DIAMETER field. The MSLINK is not unique, because of a spatial join. What I want to achieve is to keep only the features in the shapefile that have a unique MSLINK and the smallest DIAMETER value, together with the corresponding values in the other fields. I can use a searchcursor to achieve this (looping through all features and removing each feature that does not comply, but this takes ages (> 75000 features). I was wondering if eg. numpy could do the trick faster in ArcMap/arcpy.
I think, making that kind of processing would definitely be a lot faster if you work on memory instead of interacting with arcgis. For example, by putting all the rows first into a python object (probably a namedtuple would be a good option here). Then you can find out which rows you want to delete or insert.
The fastest approach depends on a) if you have a lot of (MSLINK) repeated rows, then the fastest would be inserting just the ones you need in a new layer. Or b) if the rows to be deleted are just a few compared to the total of rows, then deleting is faster.
For a) you'll need to fetch all fields into the tuple, including the point coordinates, so that you can just create a new feature class and insert the new rows.
# Example of Variant a:
from collections import namedtuple
# assuming the following:
source_fc # contains name of the fclass
the_path # contains path to the shape
cleaned_fc # the name of the cleaned fclass
# use all fields of source_fc plus the shape token to get a touple with xy
# coordinates (using 'mslink' and 'diam' here to simplify the example)
fields = ['mslink', 'diam', 'field3', ... ]
all_fields = fields + ['SHAPE#XY']
# define a namedtuple to hold and work with the rows, use the name 'point' to
# hold the coordinates-tuple
Row = namedtuple('Row', fields + ['point'])
data = []
with arcpy.da.SearchCursor(source_fc, fields) as sc:
for r in sc:
# unzip the values from each row into a new Row (namedtuple) and append
# to data
data.append(Row(*r))
# now just delete the rows we don't want, for this, the easiest way, is probably
# to order the tuple first after MSLINK and then after the diamater...
data = sorted(data, key = lambda x : (x.mslink, x.diam))
# ... now just keep the first ones for each mslink
to_keep = []
last_mslink = None
for d in data:
if last_mslink != d.mslink:
last_mslink = d.mslink
to_keep.append(d)
# create a new feature class with the same fields as the source_fc
arcpy.CreateFeatureclass_management(
out_path=the_path, out_name=cleaned_fc, template=source_fc)
with arcpy.da.InsertCursor(cleaned_fc, all_fields) as ic:
for r in to_keep:
ic.insertRow(*r)
And for alternative b) I would just fetch 3 fields, a unique ID, MSLINK and the diameter. Then make a delete list (here you only need the unique ids). Then loop again through the feature class and delete the rows with the id on your delete-list. Just to be sure, I would duplicate the feature class first, and work on a copy.
There are a few steps you can take to accomplish this task more efficiently. First and foremost, making use of the data analyst cursor as opposed to the older version of cursor will increase the speed of your process. This assumes you are working in 10.1 or beyond. Then you can employ summary statistics, namely its ability to find a minimum value based off a case field. For yours, the case field would be nMSLINK.
The code below first creates a statistics table with all unique 'nMSLINK' values, and its corresponding minimum 'DIAMETER' value. I then use a table select to select out only rows in the table whose 'FREQUENCY' field is not 1. From here I iterate through my new table and start to build a list of strings that will make up a final sql statement. After this iteration, I use the python join function to create an sql string that looks something like this:
("nMSLINK" = 'value1' AND "DIAMETER" <> 624.0) OR ("nMSLINK" = 'value2' AND "DIAMETER" <> 1302.0) OR ("nMSLINK" = 'value3' AND "DIAMETER" <> 1036.0) ...
The sql selects rows where nMSLINK values are not unique and where DIAMETER values are not the minimum. Using this SQL, I select by attribute and delete selected rows.
This SQL statement is written assuming your feature class is in a file geodatabase and that 'nMSLINK' is a string field and 'DIAMETER' is a numeric field.
The code has the following inputs:
Feature: The feature to be analyzed
Workspace: A folder that will store a couple intermediate tables temporarily
TempTableName1: A name for one temporary table.
TempTableName2: A name for a second temporary table
Field1 = The nonunique field
Field2 = The field with the numeric values that you wish to find the lowest of
Code:
# Import modules
from arcpy import *
import os
# Local variables
#Feature to analyze
Feature = r"C:\E1B8\ScriptTesting\Workspace\Workspace.gdb\testfeatureclass"
#Workspace to export table of identicals
Workspace = r"C:\E1B8\ScriptTesting\Workspace"
#Name of temp DBF table file
TempTableName1 = "Table1"
TempTableName2 = "Table2"
#Field names
Field1 = "nMSLINK" #nonunique
Field2 = "DIAMETER" #field with numeric values
#Make layer to allow selection
MakeFeatureLayer_management (Feature, "lyr")
#Path for first temp table
Table = os.path.join (Workspace, TempTableName1)
#Create statistics table with min value
Statistics_analysis (Feature, Table, [[Field2, "MIN"]], [Field1])
#SQL Select rows with frequency not equal to one
sql = '"FREQUENCY" <> 1'
# Path for second temp table
Table2 = os.path.join (Workspace, TempTableName2)
# Select rows with Frequency not equal to one
TableSelect_analysis (Table, Table2, sql)
#Empty list for sql bits
li = []
# Iterate through second table
cursor = da.SearchCursor (Table2, [Field1, "MIN_" + Field2])
for row in cursor:
# Add SQL bit to list
sqlbit = '("' + Field1 + '" = \'' + row[0] + '\' AND "' + Field2 + '" <> ' + str(row[1]) + ")"
li.append (sqlbit)
del row
del cursor
#Create SQL for selection of unwanted features
sql = " OR ".join (li)
print sql
#Select based on SQL
SelectLayerByAttribute_management ("lyr", "", sql)
#Delete selected features
DeleteFeatures_management ("lyr")
#delete temp files
Delete_management ("lyr")
Delete_management (Table)
Delete_management (Table2)
This should be quicker than a straight-up cursor. Let me know if this makes sense. Good luck!

Variable in list name

I have this code :
cur.execute("SELECT * FROM foo WHERE date=?",(date,))
for row in cur:
list_foo.append(row[2])
cur.execute("SELECT * FROM bar WHERE date=?",(date,))
for row in cur:
list_bar.append(row[2])
It works fine, but I’d like to automize this. I have made a list of the tables in my sqlite database, and I’d like something like this :
table_list = ['foo','bar']
for t in table_list:
cur.execute("SELECT * FROM "+t+" WHERE date=?",(date,))
for row in cur:
# and here I’d like to append to the list which name depends of t (list_foo, then list_bar, etc.)
But I don’t know how to do that. Any idea ?
Use a dictionary to collect your data. Don't try to set new local names for each list.
You could use string templating too, and a list comprehension to turn your result rows into lists:
data = {}
for t in table_list:
cur.execute("SELECT * FROM {} WHERE date=?".format(t), (date,))
data[t] = [row[2] for row in cur]
One caveat: only do this with a pre-defined list of table names; don't ever interpolate untrusted input like that without hefty escaping to prevent SQL injection attacks.

Categories