I'll explain the structure I have, but first I'll tell what I need. I've got a table with forecasts and one with the data of what really happened. And I need to calculate the forecast - happened field. Both tables have the coordenates field (lon,lat), the date and the precipitation. Forecast has one more field which is the date the forecast was made.
class Real(Base):
__tablename__ = 'tbl_real'
id = Column(Integer, primary_key=True, autoincrement=True)
lon = Column(Integer,index=True)
lat = Column(Integer,index=True)
date = Column(DATE,index=True)
prec = Column(Integer)
class Forecast(Base):
__tablename__ = 'tbl_forecast'
id = Column(Integer, primary_key=True, autoincrement=True)
real_id = Column(Integer,ForeignKey('tbl_real.id'))
date_pub = Column(DATE,index=True)
date_prev = Column(DATE,index=True)
lon = Column(Integer,index=True)
lat = Column(Integer,index=True)
prec = Column(Integer,)
class Error(Base):
__tablename__ = 'tbl_error'
id = Column(Integer, primary_key=True)
forecast_id = Column(Integer,ForeignKey('tbl_forecast.id'))
real_id = Column(Integer,ForeignKey('tbl_realizado.id'))
error = Column(Integer)
To insert data on Error I'm using:
def insert_error_by_coord_data(self,real_id,lon,lat,date,prec,session):
ec_ext = session.query(Forecast.id,Forecast.prec).filter((Forecast.lon == lon)&
(Extendido.lat == lat)&
(Extendido.date_prev == date)).all()
data = list()
for row in ec_ext:
id = row[0]
if session.query(Erro).get(id) is None:
prev = comb[1]
error = prev - prec
data.append(Erro(id = id,
ext_id = id,
real_id = real_id,
error= error))
if len(data) > 0:
session.bulk_save_objects(objects=data)
session.commit()
session.close()
Each forecast file has 40 data_prev and 25000 coordenates. And each real file has 25000 coordenates. It's been I think some 2 hours and I only got 80000 rows on Error. It started taking 1.03 s to insert a record and now it's 3.04s. I'm using 12 cpus with multiprocessing, if you think the mistake is here please point out and I can show the code, but I don't think it is.
The question is what should I do differently?
Related
I am getting my data from my postgres database but it is truncated. For VARCHAR, I know it's possible to set the max size but is it possible to do it too with JSON, or is there an other way?
Here is my request:
robot_id_cast = cast(RobotData.data.op("->>")("id"), String)
robot_camera_cast = cast(RobotData.data.op("->>")(self.camera_name), JSON)
# Get the last upload time for this robot and this camera
subquery_last_upload = (
select([func.max(RobotData.time).label("last_upload")])
.where(robot_id_cast == self.robot_id)
.where(robot_camera_cast != None)
).alias("subquery_last_upload")
main_query = (
select(
[subquery_last_upload.c.last_upload,RobotData.data.op("->")(self.camera_name).label(self.camera_name),])
.where(RobotData.time == subquery_last_upload.c.last_upload)
.where(robot_id_cast == self.robot_id)
.where(robot_camera_cast != None)
)
The problem is with this select part RobotData.data.op("->")(self.camera_name).label(self.camera_name)
Here is my table
class RobotData(PGBase):
__tablename__ = "wr_table"
time = Column(DateTime, nullable=False, primary_key=True)
data = Column(JSON, nullable=False)
Edit: My JSON is 429 characters
The limit of JSON datatype is 1GB in PostgreSQL.
Refs:
https://dba.stackexchange.com/a/286357
https://stackoverflow.com/a/12633183
I have a table of events generated by devices, with this structure:
class Events(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
timestamp_event = db.Column(db.DateTime, nullable=False, index=True)
device_id = db.Column(db.Integer, db.ForeignKey('devices.id'), nullable=True)
which I have to query joined to:
class Devices(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
dev_name = db.Column(db.String(50))
so I can retrieve Device data for every Event.
I´m doing a ranking of the 20 top max events generated in a single hour. It already works, but as my Events table grows (over 1M rows now) the query gets slower and slower. This is my code. Any ideas on how to optimize the query? Maybe a composite index device.id + timestamp_event? Would that work even if searching for a part of the timedate column?
pkd = db.session.query(db.func.count(Events.id),
db.func.date_format(Events.timestamp_event,'%d/%m %H'),\
Devices.dev_name).select_from(Events).join(Devices)\
.filter(Events.timestamp_event >= (datetime.now() - timedelta(days=peak_days)))\
.group_by(db.func.date_format(Events.timestamp_event,'%Y%M%D%H'))\
.group_by(Events.device_id)\
.order_by(db.func.count(Events.id).desc()).limit(20).all()
Here´s sample output of first 3 rows of the query: Number of events, when (DD/MM HH), and which device:
[(2710, '15/01 16', 'Device 002'),
(2612, '11/01 17', 'Device 033'),
(2133, '13/01 15', 'Device 002'),...]
and here´s SQL generated by SQLAlchemy:
SELECT count(events.id) AS count_1,
date_format(events.timestamp_event,
%(date_format_2)s) AS date_format_1,
devices.id AS devices_id,
devices.dev_name AS devices_dev_name
FROM events
INNER JOIN devices ON devices.id = events.device_id
WHERE events.timestamp_event >= %(timestamp_event_1)s
GROUP BY date_format(events.timestamp_event, %(date_format_3)s), events.device_id
ORDER BY count(events.id) DESC
LIMIT %(param_1)s
# This example is for postgresql.
# I'm not sure what db you are using but the date formatting
# is different.
with Session(engine) as session:
# Use subquery to select top 20 event creating device ids
# for each hour since the beginning of the peak.
hour_fmt = "dd/Mon HH24"
hour_col = func.to_char(Event.created_on, hour_fmt).label('event_hour')
event_count_col = func.count(Event.id).label('event_count')
sub_q = select(
event_count_col,
hour_col,
Event.device_id
).filter(
Event.created_on > get_start_of_peak()
).group_by(
hour_col, Event.device_id
).order_by(
event_count_col.desc()
).limit(
20
).alias()
# Now join in the devices to the top ids to get the names.
results = session.execute(
select(
sub_q.c.event_count,
sub_q.c.event_hour,
Device.name
).join_from(
sub_q,
Device,
sub_q.c.device_id == Device.id
).order_by(
sub_q.c.event_count.desc(),
Device.name
)
).all()
I have a sqllite data table created with sqlalchemy that i would like to represent on a PyQt5 tablewidget.
def createTable(self, tableData):
self.qTable = session.query(tableData).all()
self.tableWidget = QTableWidget()
self.tableWidget.setRowCount(0)
self.tableWidget.setColumnCount(tableData.cols)
self.tableWidget.setHorizontalHeaderLabels(tableData.col_headers)
for row, form in enumerate(self.qTable):
self.tableWidget.setRowCount(row+1)
for col,record in enumerate(form):
self.tableWidget.setItem(row, col, QTableWidgetItem(record))
This breaks at the line
for col,record in enumerate(form):
with an error
"TypeError: 'Tests' object is not iterable"
The ORM is built with this code
class Tests(Base):
__tablename__ = 'tests'
id = Column(Integer, primary_key=True)
current = Column(Boolean)
temp = Column(Float)
brine = Column (Float)
test = Column(String)
pc = Column(Float)
wait_time = Column(Integer)
headers = {"current","temp","brine","test","pc","wait time"}
is there a way to make this iterable? or a neater way of dealing with this??
Thanks #SuperShoot, this worked pretty well for me here is the final code I used
for row, form in enumerate(self.qTable):
col = 0
self.tableWidget.setRowCount(row+1)
for c in columns:
for k,v in vars(form).items():
if k == c:
self.tableWidget.setItem(row, col, QTableWidgetItem(str(v)))
col +=1
I added extra logic so i can define the column order,
I'm having problem using the following code to load a large(23,000 records, 10 fields) airport code csv file into a database with sqlalchemy:
from numpy import genfromtxt
from time import time
from datetime import datetime
from sqlalchemy import Column, Integer, Float, Date, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
def Load_Data(file_name):
f = lambda s: str(s)
data = genfromtxt(file_name, delimiter=',', skiprows=1, converters={0: f, 1:f, 2:f, 6:f, 7:f, 8:f, 9:f, 10:f})
return data.tolist()
Base = declarative_base()
class AirportCode(Base):
#Tell SQLAlchemy what the table name is and if there's any table-specific arguments it should know about
__tablename__ = 'AirportCode'
__table_args__ = {'sqlite_autoincrement': True}
#tell SQLAlchemy the name of column and its attributes:
id = Column(Integer, primary_key=True, nullable=False)
ident = Column(String)
type = Column(String)
name = Column(String)
latitude_deg = Column(String)
longitude_deg = Column(String)
elevation_ft = Column(String)
continent = Column(String)
iso_country = Column(String)
iso_region = Column(String)
municipality = Column(String)
gps_code = Column(String)
def __repr__(self):
#return "<AirportCode(name='%s', municipality='%s')>\n" % (self.name, self.municipality)
return "name:{} municipality:{}\n".format(self.name, self.municipality)
if __name__ == "__main__":
t = time()
#Create the database
engine = create_engine('sqlite:///airport-codes.db')
Base.metadata.create_all(engine)
#Create the session
session = sessionmaker()
session.configure(bind=engine)
s = session()
records_to_commit = 0
file_name = "airport-codes.csv" #23,000 records fails at next line
#file_name = "airport-codes.alaska 250 records works fine"
print file_name #for debugging
data = Load_Data(file_name) # fails here on large files and triggers the except: below
print 'file loaded' #for debugging
for i in data:
records_to_commit += 1
record = AirportCode(**{
'ident' : i[0].lower(),
'type' : i[1].lower(),
'name' : i[2].lower(),
'latitude_deg' : i[3],
'longitude_deg' : i[4],
'elevation_ft' : i[5],
'continent' : i[6],
'iso_country' : i[7],
'iso_region' : i[8],
'municipality' : i[9].lower(),
'gps_code' : i[10].lower()
})
s.add(record) #Add all the records
#if records_to_commit == 1000:
#s.flush() #Attempt to commit batch of 1000 records
#records_to_commit = 0
s.commit() # flushes everything remaining + commits
s.close() #Close the connection
print "Time elapsed: " + str(time() - t) + " s."
I adapted this code from another post on this forum and it works fine if I use a subset of the main csv file (Alaska airports) that is only 250 records.
When I try the entire data base of 23,000 records the program fails to load at this line in the code:
data = Load_Data(file_name)
I am working on a raspberry pi 3
Thanks for the helpful comments. Removing try/except revealed the issues. There were many international characters, extra commas within fields, and special characters, etc that caused the issue when loading the file. The Alaska airport entries were error free so it loaded fine.
Database now loads 22,000 records in 32 seconds. I deleted about 1000 entries since they were foreign entries and I want this be a US airport directory
I am trying to do a complex hybrid_property using SQLAlchemy: my model is
class Consultation(Table):
patient_id = Column(Integer)
patient = relationship('Patient', backref=backref('consultations', lazy='dynamic'))
class Exam(Table):
consultation_id = Column(Integer)
consultation = relationship('Consultation', backref=backref('exams', lazy='dynamic'))
class VitalSign(Table):
exam_id = Column(Integer)
exam = relationship('Exam', backref=backref('vital', lazy='dynamic'))
vital_type = Column(String)
value = Column(String)
class Patient(Table):
patient_data = Column(String)
#hybrid_property
def last_consultation_validity(self):
last_consultation = self.consultations.order_by(Consultation.created_at.desc()).first()
if last_consultation:
last_consultation_conclusions = last_consultation.exams.filter_by(exam_type='conclusions').first()
if last_consultation_conclusions:
last_consultation_validity = last_consultation_conclusions.vital_signs.filter_by(sign_type='validity_date').first()
if last_consultation_validity:
return last_consultation_validity
return None
#last_consultation_validity.expression
def last_consultation_validity(cls):
subquery = select([Consultation.id.label('last_consultation_id')]).\
where(Consultation.patient_id == cls.id).\
order_by(Consultation.created_at.desc()).limit(1)
j = join(VitalSign, Exam).join(Consultation)
return select([VitalSign.value]).select_from(j).select_from(subquery).\
where(and_(Consultation.id == subquery.c.last_consultation_id, VitalSign.sign_type == 'validity_date'))
As you can see my model is quite complicated.
Patients get Consultations. Exams and VitalSigns are cascading data for the Consultations. The idea is that all consultations do not get a validity but that new consultations make the previous consultations validity not interesting: I only want the validity from the last consultation; if a patient has a validity in previous consultations, I'm not interested.
What I would like to do is to be able to order by the hybrid_property last_consultation_validity.
The output SQL looks ok to me:
SELECT vital_sign.value
FROM (SELECT consultation.id AS last_consultation_id
FROM consultation, patient
WHERE consultation.patient_id = patient.id ORDER BY consultation.created_at DESC
LIMIT ? OFFSET ?), vital_sign JOIN exam ON exam.id = vital_sign.exam_id JOIN consultation ON consultation.id = exam.consultation_id
WHERE consultation.id = last_consultation_id AND vital_sign.sign_type = ?
But when I order the patients by last_consultation_validity, the rows do not get ordered ...
When I execute the same select outside of the hybrid_property, to retrieve the date for each patient (just setting the patient.id), I get the good values. Surprising is that the SQL is slightly different, removing patient in the FROMin the SELECT.
So I'm actually wondering if this is a bug in SQLAlchemy or if I'm doing something wrong ... Any help would be greatly appreciated.