Retrieve SQLite result as ndarray - python

I am retrieving a set of latitude and longitudinal points from an sqlite database as this:
cur = con.execute("SELECT DISTINCT latitude, longitude FROM MessageType1 WHERE latitude>{bottomlat} AND latitude<={toplat} AND longitude>{bottomlong} AND longitude<={toplong}".format(bottomlat = bottomlat, toplat = toplat, bottomlong = bottomlong, toplong = toplong))
However, as I am supposed to make a ConvexHull (http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html#scipy.spatial.ConvexHull) which has an ndarray as input, I need to save the results for cur as a ndarray. How would I do that?
Right now I fetch the values as :
positions = [[floats(x[0]) floats(x[1])] for x in cur]
Is this correct?

You don't need to convert positions to an ndarray. scipy.spatial.ConvexHull can accept a list of lists as well:
import scipy.spatial as spatial
hull = spatial.ConvexHull(positions)
Also, if your MessageType1 table has latitude, longitude fields of type float, then you should not need to call float explicitly. Instead of
positions = [[floats(x[0]) floats(x[1])] for x in cur]
you could use
positions = cur.fetchall()
Note that if you are using NumPy slicing syntax, such as positions[:, 0],
then you would need to convert the list of lists/tuples to a NumPy array:
positions = np.array(cur.fetchall())

Related

Plotting values against time in MatPlotlib using SQL

I am trying to plot a list of numerical values against a list of timestamps, which are stored in a SQL database called 'scores'. After SELECTING two columns and fetching the data from the database, I then convert the tuple of tuples into lists, and then remove the "None" records, so that it could be plotted using plt.plot(x,y).
However, when I run the program below, I get this error:
ValueError: x and y must have same first dimension, but have shapes (5,) and (4,)
Here is some sample data for yaxis_list and xaxis_final:
yaxis_list = [20,40,60,80]
xaxis_final = ['2023-02-18 02:09:15', '2023-02-18 02:18:00', '2023-02-18 18:52:10']
my imports:
from matplotlib import pyplot as plt
import _sqlite3
I am a beginner at matplotlib and any other code rebuilding suggestions would really help me :)
with _sqlite3.connect("scores.db") as db:
cursor = db.cursor()
cursor.execute("SELECT (phy) FROM scores WHERE phy > 0")
yaxis_tuple = cursor.fetchall()
yaxis_list = [item for t in yaxis_tuple for item in t]
cursor.execute("SELECT (phystamp) FROM scores")
xaxis_tuple = cursor.fetchall()
xaxis_list = [item for t in xaxis_tuple for item in t]
xaxis_final = [i for i in xaxis_list if i is not None]
plt.plot(xaxis_final,yaxis_list)
plt.show()

Keep smallest value for each unique ID with arcpy/numpy

I've got a ESRI Point Shape file with (amongst others) a nMSLINK field and a DIAMETER field. The MSLINK is not unique, because of a spatial join. What I want to achieve is to keep only the features in the shapefile that have a unique MSLINK and the smallest DIAMETER value, together with the corresponding values in the other fields. I can use a searchcursor to achieve this (looping through all features and removing each feature that does not comply, but this takes ages (> 75000 features). I was wondering if eg. numpy could do the trick faster in ArcMap/arcpy.
I think, making that kind of processing would definitely be a lot faster if you work on memory instead of interacting with arcgis. For example, by putting all the rows first into a python object (probably a namedtuple would be a good option here). Then you can find out which rows you want to delete or insert.
The fastest approach depends on a) if you have a lot of (MSLINK) repeated rows, then the fastest would be inserting just the ones you need in a new layer. Or b) if the rows to be deleted are just a few compared to the total of rows, then deleting is faster.
For a) you'll need to fetch all fields into the tuple, including the point coordinates, so that you can just create a new feature class and insert the new rows.
# Example of Variant a:
from collections import namedtuple
# assuming the following:
source_fc # contains name of the fclass
the_path # contains path to the shape
cleaned_fc # the name of the cleaned fclass
# use all fields of source_fc plus the shape token to get a touple with xy
# coordinates (using 'mslink' and 'diam' here to simplify the example)
fields = ['mslink', 'diam', 'field3', ... ]
all_fields = fields + ['SHAPE#XY']
# define a namedtuple to hold and work with the rows, use the name 'point' to
# hold the coordinates-tuple
Row = namedtuple('Row', fields + ['point'])
data = []
with arcpy.da.SearchCursor(source_fc, fields) as sc:
for r in sc:
# unzip the values from each row into a new Row (namedtuple) and append
# to data
data.append(Row(*r))
# now just delete the rows we don't want, for this, the easiest way, is probably
# to order the tuple first after MSLINK and then after the diamater...
data = sorted(data, key = lambda x : (x.mslink, x.diam))
# ... now just keep the first ones for each mslink
to_keep = []
last_mslink = None
for d in data:
if last_mslink != d.mslink:
last_mslink = d.mslink
to_keep.append(d)
# create a new feature class with the same fields as the source_fc
arcpy.CreateFeatureclass_management(
out_path=the_path, out_name=cleaned_fc, template=source_fc)
with arcpy.da.InsertCursor(cleaned_fc, all_fields) as ic:
for r in to_keep:
ic.insertRow(*r)
And for alternative b) I would just fetch 3 fields, a unique ID, MSLINK and the diameter. Then make a delete list (here you only need the unique ids). Then loop again through the feature class and delete the rows with the id on your delete-list. Just to be sure, I would duplicate the feature class first, and work on a copy.
There are a few steps you can take to accomplish this task more efficiently. First and foremost, making use of the data analyst cursor as opposed to the older version of cursor will increase the speed of your process. This assumes you are working in 10.1 or beyond. Then you can employ summary statistics, namely its ability to find a minimum value based off a case field. For yours, the case field would be nMSLINK.
The code below first creates a statistics table with all unique 'nMSLINK' values, and its corresponding minimum 'DIAMETER' value. I then use a table select to select out only rows in the table whose 'FREQUENCY' field is not 1. From here I iterate through my new table and start to build a list of strings that will make up a final sql statement. After this iteration, I use the python join function to create an sql string that looks something like this:
("nMSLINK" = 'value1' AND "DIAMETER" <> 624.0) OR ("nMSLINK" = 'value2' AND "DIAMETER" <> 1302.0) OR ("nMSLINK" = 'value3' AND "DIAMETER" <> 1036.0) ...
The sql selects rows where nMSLINK values are not unique and where DIAMETER values are not the minimum. Using this SQL, I select by attribute and delete selected rows.
This SQL statement is written assuming your feature class is in a file geodatabase and that 'nMSLINK' is a string field and 'DIAMETER' is a numeric field.
The code has the following inputs:
Feature: The feature to be analyzed
Workspace: A folder that will store a couple intermediate tables temporarily
TempTableName1: A name for one temporary table.
TempTableName2: A name for a second temporary table
Field1 = The nonunique field
Field2 = The field with the numeric values that you wish to find the lowest of
Code:
# Import modules
from arcpy import *
import os
# Local variables
#Feature to analyze
Feature = r"C:\E1B8\ScriptTesting\Workspace\Workspace.gdb\testfeatureclass"
#Workspace to export table of identicals
Workspace = r"C:\E1B8\ScriptTesting\Workspace"
#Name of temp DBF table file
TempTableName1 = "Table1"
TempTableName2 = "Table2"
#Field names
Field1 = "nMSLINK" #nonunique
Field2 = "DIAMETER" #field with numeric values
#Make layer to allow selection
MakeFeatureLayer_management (Feature, "lyr")
#Path for first temp table
Table = os.path.join (Workspace, TempTableName1)
#Create statistics table with min value
Statistics_analysis (Feature, Table, [[Field2, "MIN"]], [Field1])
#SQL Select rows with frequency not equal to one
sql = '"FREQUENCY" <> 1'
# Path for second temp table
Table2 = os.path.join (Workspace, TempTableName2)
# Select rows with Frequency not equal to one
TableSelect_analysis (Table, Table2, sql)
#Empty list for sql bits
li = []
# Iterate through second table
cursor = da.SearchCursor (Table2, [Field1, "MIN_" + Field2])
for row in cursor:
# Add SQL bit to list
sqlbit = '("' + Field1 + '" = \'' + row[0] + '\' AND "' + Field2 + '" <> ' + str(row[1]) + ")"
li.append (sqlbit)
del row
del cursor
#Create SQL for selection of unwanted features
sql = " OR ".join (li)
print sql
#Select based on SQL
SelectLayerByAttribute_management ("lyr", "", sql)
#Delete selected features
DeleteFeatures_management ("lyr")
#delete temp files
Delete_management ("lyr")
Delete_management (Table)
Delete_management (Table2)
This should be quicker than a straight-up cursor. Let me know if this makes sense. Good luck!

MySQLdb query containing NULLs to Numpy array

I am trying to efficiently read some legacy DBs contents into a numpy (rec-)array. I was following this What's the most efficient way to convert a MySQL result set to a NumPy array? and this MySQLdb query to Numpy array posts.
Now it happens that some entries in the DB contains NULL, which are returned as None.
So np.fromiter will react like this e.g.
TypeError: long() argument must be a string or a number, not 'NoneType'
I would like to kind of, tell it how it should behave in case it encounters None.
Is that even possible?
Here is (something like) my code:
cur = db.cursor()
query = ("SELECT a, b, c from Table;")
cur.execute(query)
dt = np.dtype([
('a', int),
('b', int),
('c', float),
])
r = np.fromiter(cur.fetchall(), count=-1, dtype=dt)
And I would like to be able to specify, that the resulting array should contain np.nan in case None is encountered in column 'c', while it should contain the number 9999 when None is found for column 'a' or 'b'. Is something like that possible?
Or is there another (beautiful) method to get MySQL DB contents into numpy arrays, in case some values are unknown?
I would be very hesitant to suggest that this is the best way of doing this, but np.rec.fromrecords has worked well for me in the past.
The fix_duplicate_field_names function is there to ensure that numpy doesn't bork when MySQL returns multiple columns with the same name (it just fudges new names).
In the get_output function, the some info is parsed out of the cursor to get field names for the rec array, after which numpy is allowed to decide the data type of the MySQL data.
def fix_duplicate_field_names(self,names):
"""Fix duplicate field names by appending an integer to repeated names."""
used = []
new_names = []
for name in names:
if name not in used:
new_names.append(name)
else:
new_name = "%s_%d"%(name,used.count(name))
new_names.append(new_name)
used.append(name)
return new_names
def get_output(cursor):
"""Get sql data in numpy recarray form."""
if cursor.description is None:
return None
names = [i[0] for i in cursor.description]
names = fix_duplicate_field_names(names)
output = cursor.fetchall()
if not output or len(output) == 0:
return None
else:
return np.rec.fromrecords(output,names=names)

Correct way to fill a list from a SQL query using pyodbc

Is this the correct way to get a list from a SQL query in Python 2.7? Using a loop just seems somehow spurious. Is there a neater better way?
import numpy as np
import pyodbc as SQL
from datetime import datetime
con = SQL.connect('Driver={SQL Server};Server=MyServer; Database=MyDB; UID=MyUser; PWD=MyPassword')
cursor = con.cursor()
#Function to convert the unicode dates returned by SQL Server into Python datetime objects
ConvertToDate = lambda s:datetime.strptime(s,"%Y-%m-%d")
#Parameters
Code = 'GBPZAR'
date_query = '''
SELECT DISTINCT TradeDate
FROM MTM
WHERE Code = ?
and TradeDate > '2009-04-08'
ORDER BY TradeDate
'''
#Get a list of dates from SQL
cursor.execute(date_query, [Code])
rows = cursor.fetchall()
Dates = [None]*len(rows) #Initialize array
r = 0
for row in rows:
Dates[r] = ConvertToDate(row[0])
r += 1
Edit:
What about when I want to put a query into a structured array? At the moment I do something like this:
#Initialize the structured array
AllData = np.zeros(num_rows, dtype=[('TradeDate', datetime),
('Expiry', datetime),
('Moneyness', float),
('Volatility', float)])
#Iterate through the record set using the cursor and populate the structure array
r = 0
for row in cursor.execute(single_date_and_expiry_query, [TradeDate, Code, Expiry]):
AllData[r] = (ConvertToDate(row[0]), ConvertToDate(row[1])) + row[2:] #Convert th0e date columns and concatenate the numeric columns
r += 1
There is no need to pre-create a list, you could use list.append() instead. This also avoids having to keep a counter to index into Dates.
I'd use a list comprehension here, looping directly over the cursor to fetch rows:
cursor.execute(date_query, [Code])
Dates = [datetime.strptime(r[0], "%Y-%m-%d") for r in cursor]
You may want to add .date() to the datetime.strptime() result to get datetime.date objects instead.
Iterating over the cursor is preferable as it avoids loading all rows as a list into memory, only to replace that list with another, processed list of dates. See the cursor.fetchall() documentation:
Since this reads all rows into memory, it should not be used if there are a lot of rows. Consider iterating over the rows instead.
To produce your numpy.array, don't prepopulate. Instead use numpy.asarray() to turn the cursor items into an array, with the help of a generator:
dtype=[('TradeDate', datetime), ('Expiry', datetime),
('Moneyness', float), ('Volatility', float)]
dt = lambda v: datetime.strptime(v, "%Y-%m-%d")
filtered_rows = ((dt(r[0]), dt(r[1]) + r[2:]) for r in cursor)
all_values = np.asarray(filtered_rows, dtype=dtype)
For future reference, you can use enumerate() to produce a counter with a loop:
for r, row in enumerate(rows):
# r starts at 0 and counts along

Downloading arrays off cur.fetchall() in Python and Oracle 11g

I'm trying to download a single array off of Oracle 11g into Python using the cur.fetchall command. I'm using the following syntax:
con = cx_Oracle.connect('xxx')
print con.version
cur = con.cursor()
cur.execute("select zc.latitude from orders o, zip_code zc where o.date> '24-DEC-12' and TO_CHAR(zc.ZIP_CODE)=o.POSTAL_CODE")
latitudes = cur.fetchall()
cur.close()
print latitudes
when I print latitudes, I get this:
[(-73.98353999999999,), (-73.96565,), (-73.9531,),....]
the problems is that when I try to manipulate the data -- in this case, via:
x,y = map(longitudes,latitudes)
I get the following error -- note, I'm doing the same exact type of syntax to create 'longitudes':
TypeError: a float is required
I suspect this is because cur.fetchall is returning tuples with commas inside the tuple elements. How do I run the query so I don't get the comma inside the parenthesis, and get an array instead of a tuple? Is there a nice "catch all" command like cur.fetchall, or do I have to manually loop to get the results into an array?
my full code is below:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
import cx_Oracle
con = cx_Oracle.connect('xxx')
print con.version
cur = con.cursor()
cur.execute("select zc.latitude from orders o, zip_code zc where psh.ship_date> '24-DEC-12' and TO_CHAR(zc.ZIP_CODE)=o.CONSIGNEE_POSTAL_CODE")
latitudes = cur.fetchall()
cur.close()
cur = con.cursor()
cur.execute("select zc.longitude from orders o, zip_code zc where psh.ship_date> '24-DEC-12' and TO_CHAR(zc.ZIP_CODE)=o.CONSIGNEE_POSTAL_CODE")
longitudes = cur.fetchall()
print 'i made it!'
print latitudes
print longitudes
cur.close()
con.close()
map = Basemap(resolution='l',projection='merc', llcrnrlat=25.0,urcrnrlat=52.0,llcrnrlon=-135.,urcrnrlon=-60.0,lat_ts=51.0)
# draw coastlines, country boundaries, fill continents.
map.drawcoastlines(color ='C')
map.drawcountries(color ='C')
map.fillcontinents(color ='k')
# draw the edge of the map projection region (the projection limb)
map.drawmapboundary()
# draw lat/lon grid lines every 30 degrees.
map.drawmeridians(np.arange(0, 360, 30))
map.drawparallels(np.arange(-90, 90, 30))
plt.show()
# compute the native map projection coordinates for the orders.
x,y = map(longitudes,latitudes)
# plot filled circles at the locations of the orders.
map.plot(x,y,'yo')
The trailing commas are fine that is valid tuple syntax and what you get when you print a tuple.
I don't know what you are trying to achieve, but map is probably not what you want. map takes a function and a list as arguments but you are giving it 2 lists. Something more useful might be to retrieve the latitude and longitude from the database together:
cur.execute("select zc.longitude, zc.latitude from orders o, zip_code zc where o.date> '24-DEC-12' and TO_CHAR(zc.ZIP_CODE)=o.POSTAL_CODE")
Update to Comments
From the original code it looks like you are trying to use the built-in map function which is not the case from your updated code.
The reason you are getting the TypeError is matplotlib is expecting a list of floats but you are providing a list of one tuples. you can unwrap the tuples from your original latitudes with a simple list comprehension (the map built-in would also do the trick):
[row[0] for row in latitudes]
Using one query to return the latitudes and longitudes:
cur.execute("select zc.longitude, zc.latitude from...")
points = cur.fetchall()
longitudes = [point[0] for point in longitudes]
latitudes = [point[1] for point in latitudes]
Now longitudes and latitudes are lists of floats.
Found another way to tackle this -- check out TypeError: "list indices must be integers" looping through tuples in python. Thanks for your hard work in helping me out!
I faced the same issue and one work around is to clean up the commas from the tuple elements in a loop as you mentioned (not so efficient though), something like this:
cleaned_up_latitudes=[] #initialize the list
for x in xrange(len(latitudes)):
pass
cleaned_up_latitudes.append(latitudes[x][0]) #add first element
print cleaned_up_latitudes
[-73.98353999999999, -73.96565, -73.9531,....]
I hope that helps.

Categories