I may be guilty of thinking in too object-oriented a manner here, but I keep coming back to a certain pattern which doesn't look SQL-friendly at all.
Simplified description:-
I have an Attendance table, with various attributes including clockin, lunchout, lunchin, and clockout (all of type Time), an employee ID, and a date.
Different employees can have different shift hours, which depends on the day of the week and day of the month. The employee's categories are summarized in an Employees table, the shift hours etc. are summarized in an Hours table. A Leave table specifies half/full day leave taken.
I have raw check-in times from another source (third party, basically consider this a csv) which I sort and then need to insert to the Attendance table properly, by which I imagine I should be doing:-
a) Take the earliest time as clockin
b) Take the latest time as clockout
c) Compare remaining times with assigned lunch hours according to an employee's shift for the day
c) is the one giving me problems, as some employees check-in at odd times (perhaps they need to run out for an hour to meet a client), and some simply don't check-in for meals (eating in, etc). So I need to fuzzy-match the remaining times with assigned lunch hours, meaning I have to access the Hours table as well.
I'm trying to do all of this in a function within the Attendance class, so that it looks something like this:-
class Attendance(Base):
... my attributes ...
def calc_attendance(self, listOfTimes):
# Do the necessary
However it looks like I'd be required to do more queries within the calc_attendance function. Not even sure if that's possible, and certainly doesn't look like something I should be doing.
I could do multiple queries in the calling function, but that seems more messy. I've also considered multiple joins so the requisite information (for each employee per day) is available. Neither of them feel right, please do correct me though, if necessary.
In short, yes you seem to be trying to jam everything into one class, but that is not "wrong" it's just which class you're jamming things into.
to answer the actual question: Yes, you can perform a query inside another class, and or make that class turn itself into a dict or other iter-able format which makes comparing things easier in some ways.
However I have found in practice it is usually more efficient from a time perspective to build a class per table and apply modifications to that table's specific data using the internal functions, and a overriding umbrella "job" class that performs the specific "tasks" comparison, data input, etc... for the group but which encapsulates this process so that it can scale up and be used by other higher level classes in future. This allows for table specific work done in their respective classes, but the group to be compared and fuzzy matched without cluttering up their respective tables. This makes maintaining the code easier and splitting the work amongst a group possible. It also allows for a much sharper focus on each task. In theory would also allow for asyncio and other optimizations in future.
so:
Main Job (class)
(various comparison functions,data input verification, etc...)
|-> table 1 class instance
|-> table 2 class instance
happy to help further if you need actual examples.
Related
I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice
in my app I have a mixin that defines 2 fields like start_date and end_date. I've added this mixin to all table declarations which require these fields.
I've also defined a function that returns filters (conditions) to test a timestamp (e.g. now) to be >= start_date and < end_date. Currently I'm manually adding these filters whenever I need to query a table with these fields.
However sometimes me or my colleagues forget to add the filters, and I wonder whether it is possible to automatically extend any query on such a table. Like e.g. an additional function in the mixin that is invoked by SQLalchemy whenever it "compiles" the statement. I'm using 'compile' only as an example here, actually I don't know when or how to best do that.
Any idea how to achieve this?
In case it works for SELECT, does it also work for INSERT and UPDATE?
thanks a lot for your help
Juergen
Take a look at this example. You can change the criteria expressed in the private method to refer to your start and end dates.
Note that this query will be less efficient because it overrides the get method to bypass the identity map.
I'm not sure what the enable_assertions false call does; I'd recommend understanding that before proceeding.
I tried extending Query but had a hard time. Eventually (and unfortunately) I moved back to my previous approach of little helper functions returning filters and applying them to queries.
I still wish I would find an approach that automatically adds certain filters if a table (Base) has certain columns.
Juergen
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.
I'm building a calendar system from the ground up (requirement, as I'm working with a special type of calendar alongside Gregorian) , and I need some help with logic. I'm writing the application in Django and Python.
Essentially, the logical issues I'm running into is how to persist as few objects as possible as smartly as possible without running up the tab on CPU cycles. I'm feeling that polymorphism would be a solution to this, but I'm not exactly sure how to express it here.
I have two basic subsets of events, repeating events and one-shot events.
Repeating events will have subscribers, people which are notified about their changes. If, for example, a class is canceled or moved to a different address or time, people who have subscribed need to know about this. Some events simply happen every day until the end of time, won't be edited, and "just happen." The problem is that if I have one object that stores the event info and its repeating policy, then canceling or modifying one event in the series really screws things up and I'll have to account for that somehow, keeping subscribers aware of the change and keeping the series together as a logical group.
Paradox: generating unique event objects for each normal event in a series until the end of time (if it repeats indefinitely) doesn't make sense if they're all going to store the same information; however, if any change happens to a single event in the series, I'll almost have to create a different object in the database to represent a cancellation.
Can someone help me with the logic here? It's really twisting my mind and I can't really think straight anymore. I'd really like some input on how to solve this issue, as repeating events isn't exactly the easiest logical thing either (repeat every other day, or every M/W/F, or on the 1st M of each month, or every 3 months, or once a year on this date, or once a week on this date, or once a month on this date, or at 9:00 am on Tuesdays and 11:00am on Thursdays, etc.) and I'd like help understanding the best route of logic for repeating events as well.
Here's a thought on how to do it:
class EventSeries(models.Model):
series_name = models.TextField()
series_description = models.TextField()
series_repeat_policy = models.IHaveNoIdeaInTheWorldOnHowToRepresentThisField()
series_default_time = models.TimeField()
series_start_date = models.DateField()
series_end_date = models.DateField()
location = models.ForeignKey('Location')
class EventSeriesAnomaly(models.Model):
event_series = models.ForeignKey('EventSeries', related_name="exceptions")
override_name = models.TextField()
override_description = models.TextField()
override_time = models.TimeField()
override_location = models.ForeignKey('Location')
event_date = models.DateField()
class EventSeriesCancellation(models.Model):
event_series = models.ForeignKey('EventSeries', related_name="cancellations")
event_date = models.TimeField()
cancellation_explanation = models.TextField()
This seems to make a bit of sense, but as stated above, this is ruining my brain right now so anything seems like it would work. (Another problem and question, if someone wants to modify all remaining events in the series, what in the heck do I do!?!? I suppose that I could change 'series_default_time' and then generate anomaly instances for all past instances to set them to the original time, but AHHHHHH!!!)
Boiling it down to three simple, concrete questions, we have:
How can I have a series of repeating events, yet allow for cancellations and modifications on individual events and modifications on the rest of the series as a whole, while storing as few objects in the database as absolutely necessary, never generating objects for individual events in advance?
How can I repeat events in a highly customizable way, without losing my mind, in that I can allow events to repeat in a number of ways, but again making things easy and storing as few objects as possible?
How can I do all of the above, allowing for a switch on each event series to make it not happen if it falls out on a holiday?
This could become a heated discussion, as date logic usually is much harder than it first looks and everyone will have her own idea how to make things happen.
I would probably sacrifice some db space and have the models be as dumb as possible (e.g by not having to define anomalies to a series). The repeat condition could either be some simple terms which would have to be parsed (depending on your requirements) or - KISS - just the interval the next event occurs.
From this you can generate the "next" event, which will copy the repeat condition, and you generate as much events into the future as practically necessary (define some max time window into the future for which to generate events but generate them only, when somebody in fact looks at the time intervall in question). The events could have a pointer back to its parent event, so a whole series is identifiable (just like a linked list).
The model should have an indicator, whether a single event is cancelled. (The event remains in the db, to be able to copy the event into the future). Cancelling a whole series deletes the list of events.
EDIT: other answers have mentioned the dateutil package for interval building and parsing, which really looks very nice.
I want to address only question 3, about holidays.
In several reporting databases, I have found it handy to define a table, let's call it "Almanac", that has one row for each date, within a certain range. If the range spans ten years, the table will contain about 3,652 rows. That's small by today's standards. The primary key is the date.
Some other columns are things like whether the date is a holiday, a normal working day, or a weekend day. I know, I know, you could compute the weekend stuff by using a built in function. But it turns out to be convenient to include this stuff as data. It makes your joins simpler and more similar to each other.
Then you have one application program that populates the Almanac. It has all the calendar quirks built into it, including the enterprise rules for figuring out which days are holidays. You can even include columns for which "fiscal month" a given date belongs to, if that's relevant to your case. The rest of the application, both entry programs and extraction programs, all treat the Almanac like plain old data.
This may seem suboptimal because it's not minimal. But trust me, this design pattern is useful in a wide variety of situations. It's up to you to figure how it applies to your case.
The Almanac is really a subset of the principles of data warehousing and star schema design.
if you want to do the same thing inside the CPU, you could have an "Almanac" object with public features such as Almanac.holiday(date).
For an event series model I created, my solution to IHaveNoIdeaInTheWorldOnHowToRepresentThisField was to use pickled object field to save a recurrence rule (rrule) from dateutil in my event series model.
I faced the exact same problems as you a while ago. However, the initial solution did not contain any anomalies or cancellations, just repeating events. The way we modeled a set of repeating events, is to have some field indicating the interval type (like monthly/weekly/daily) and then the distances (like every 2nd day, ever 2nd week, etc.) starting from a given starting day. This simple way for repetition does not cover too many scenarios, but it was very easy to calculate the repeating dates. Other ways of repetition are also possible, for instance something in the way cronjobs are defined.
To generate the repetitions, we created a table function, that given some userid generated all the event repetitions on the fly up to like 5 years into the future on the fly using recursive SQL (so as in your approach, for a set of repetitions, only one event has to be stored). This so far works very well and the table function can be queried as if the individual repetitions were actually stored in the database. It also could be easily extended to exclude any cancelled events and to replaced changed events based on dates, also on the fly. I do not know if this is possible with your database and your ORM.
I'm a beginner/intermediate Python programmer but I haven't written an application, just scripts. I don't currently use a lot of object oriented design, so I would like this project to help build my OOD skills. The problem is, I don't know where to start from a design perspective (I know how to create the objects and all that stuff). For what it's worth, I'm also self taught, no formal CS education.
I'd like to try writing a program to keep track of a portfolio stock/options positions.
I have a rough idea about what would make good object candidates (Portfolio, Stock, Option, etc.) and methods (Buy, Sell, UpdateData, etc.).
A long position would buy-to-open, and sell-to-close while a short position has a sell-to-open and buy-to-close.
portfolio.PlaceOrder(type="BUY", symbol="ABC", date="01/02/2009", price=50.00, qty=100)
portfolio.PlaceOrder(type="SELL", symbol="ABC", date="12/31/2009", price=100.00, qty=25)
portfolio.PlaceOrder(type="SELLSHORT", symbol="XYZ", date="1/2/2009", price=30.00, qty=50)
portfolio.PlaceOrder(type="BUY", symbol="XYZ", date="2/1/2009", price=10.00, qty=50)
Then, once this method is called how do I store the information? At first I thought I would have a Position object with attributes like Symbol, OpenDate, OpenPrice, etc. but thinking about updating the position to account for sales becomes tricky because buys and sells happen at different times and amounts.
Buy 100 shares to open, 1 time, 1 price. Sell 4 different times, 4 different prices.
Buy 100 shares. Sell 1 share per day, for 100 days.
Buy 4 different times, 4 different prices. Sell entire position at 1 time, 1 price.
A possible solution would be to create an object for each share of stock, this way each share would have a different dates and prices. Would this be too much overhead? The portfolio could have thousands or millions of little Share objects. If you wanted to find out the total market value of a position you'd need something like:
sum([trade.last_price for trade in portfolio.positions if trade.symbol == "ABC"])
If you had a position object the calculation would be simple:
position.last * position.qty
Thanks in advance for the help. Looking at other posts it's apparent SO is for "help" not to "write your program for you". I feel that I just need some direction, pointing down the right path.
ADDITIONAL INFO UPON REFLECTION
The Purpose
The program would keep track of all positions, both open and closed; with the ability to see a detailed profit and loss.
When I think about detailed P&L I want to see...
- all the open dates (and closed dates)
- time held
- open price (closed date)
- P&L since open
- P&L per day
#Senderle
I think perhaps you're taking the "object" metaphor too literally, and so are trying to make a share, which seems very object-like in some ways, into an object in the programming sense of the word. If so, that's a mistake, which is what I take to be juxtapose's point.
This is my mistake. Thinking about "objects" a share object seems natural candidate. It's only until there may be millions that the idea seems crazy. I'll have some free coding time this weekend and will try creating an object with a quantity.
There are two basic precepts you should keep in mind when designing such a system:
Eliminate redundancy from your data. No redundancy insures integrity.
Keep all the data you need to answer any inquiry, at the lowest level of detail.
Based on these precepts, my suggestion is to maintain a Transaction Log file. Each transaction represents a change of state of some kind, and all the pertinent facts about it: when, what, buy/sell, how many, how much, etc. Each transaction would be represented by a record (a namedtuple is useful here) in a flat file. A years worth (or even 5 or 10 years) of transactions should easily fit in a memory resident list. You can then create functions to select, sort and summarize whatever information you need from this list, and being memory resident, it will be amazingly fast, much faster than a SQL database.
When and if the Transaction Log becomes too large or too slow, you can compute the state of all your positions as of a particular date (like year-end), use that for the initial state for the following period, and archive your old log file to disc.
You may want some auxiliary information about your holdings such as value/price on any particular date, so you can plot value vs. time for any or all holdings (There are on-line sources for this type of information, yahoo finance for one.) A master database containing static information about each of your holdings would also be useful.
I know this doesn't sound very "object oriented", but OO design could be useful to hide the detailed workings of the system in a TransLog object with methods to save/restore the data to/from disc (save/open methods), enter/change/delete a transaction; and additional methods to process the data into meaningful information displays.
First write the API with a command line interface. When this is working to your satisfaction, then you can go on to creating a GUI front end if you wish.
Good luck and have fun!
Avoid objects. Object oriented design is flawed. Think about your program as a collection of behaviors that operate on data (lists and dictionaries). Then group your related behaviors as functions in a module.
Each function should have clear input and outputs. Store your data globally in each module.
Why do it without objects? Because it maps closer to the problem space. Object oriented programming creates too much indirection to solve a problem. Unnecessary indirection causes software bloat and bugs.
A possible solution would be to create an object for each share of stock, this way each share would have a different dates and prices. Would this be too much overhead? The portfolio could have thousands or millions of little Share objects. If you wanted to find out the total market value of a position you'd need something like:
Yes it would be too much overhead. The solution here is you would store the data in a database. Finding the total market value of a position would be done in SQL unless you use a NOSQL scheme.
Don't try to design for all possible future outcomes. Just make your program work that way it needs to work now.
I think I'd separate it into
holdings (what you currently own or owe of each symbol)
orders (simple demands to buy or sell a single symbol at a single time)
trades (collections of orders)
This makes it really easy to get a current value, queue orders, and build more complex orders, and maps easily into data objects with a database behind them.
To answer your question: You appear to have a fairly clear idea of your data model already. But it looks to me like you need to think more about what you want this program to do. Will it keep track of changes in stock prices? Place orders, or suggest orders to be placed? Or will it simply keep track of the orders you've placed? Each of these uses may call for different strategies.
That said, I don't see why you would ever need to have an object for every share; I don't understand the reasoning behind that strategy. Even if you want to be able to track your order history in great detail, you could just store aggregate data, as in "x shares at y dollars per share, on date z".
It would make more sense to have a position object (or holding object, in Hugh's terminology) -- one per stock, perhaps with an .order_history attribute, if you really need a detailed history of your holdings in that stock. And yes, a database would definitely be useful for this kind of thing.
To wax philosophical for a moment: I think perhaps you're taking the "object" metaphor too literally, and so are trying to make a share, which seems very object-like in some ways, into an object in the programming sense of the word. If so, that's a mistake, which is what I take to be juxtapose's point.
I disagree with him that object oriented design is flawed -- that's a pretty bold pronouncement! -- but his answer is right insofar as an "object" (a.k.a. a class instance) is almost identical to a module**. It's a collection of related functions linked to some shared state. In a class instance, the state is shared via self or this, while in a module, it's shared through the global namespace.
For most purposes, the only major difference between a class instance and a module is that there can be many class instances, each one with its own independent state, while there can be only one module instance. (There are other differences, of course, but most of the time they involve technical matters that aren't very important for learning OOD.) That means that you can think about objects in a way similar to the way you think about modules, and that's a useful approach here.
**In many compiled languages, the file that results when you compile a module is called an "object" file. I think that's where the "object" metaphor actually comes from. (I don't have any real evidence of that! So anyone who knows better, feel free to correct me.) The ubiquitous toy examples of OOD that one sees -- car.drive(mph=50); car.stop(); car.refuel(unleaded, regular) -- I believe are back-formations that can confuse the concept a bit.
I would love to hear with what you came up with. I am ~ 4 months (part time) into creating an Order Handler and although its mostly complete, I still have the same questions as you as I'd like it to be made properly.
Currently, I save two files
A "Strategy Hit Log" where each buy/sell signal that comes from any strategy script is saved. For example: when the buy_at_yesterdays_close_price.py strategy is triggered, it saved that buy request in this file, and passes the request to the Order Handler
An "Order Log" which is a single DataFrame - this file fits the purpose you were focusing on.
Each request from a strategy pertains to a single underlying security (for example, AAPL stock) and creates an Order which is saved as a row in the DataFrame, containing columns for the Ticker, the Strategy name that spawned this Order as well as Suborder and Broker Suborder columns (explained below).
Each Order has a list of Suborders (dicts) stored in the Suborder column. For example: if you are bullish on AAPL, the suborders could be:
[
{'security': 'equity', 'quantity':10},
{'security': 'option call', 'quantity':10}
]
Each Order also has a list of Broker Suborders (dicts) stored in the Broker Suborder column. Each Broker Suborder is a request to the Broker to buy or sell a security, and is indexed with the "order ID" that the Broker provides for that request. Every new request to the Broker is a new Broker Suborder, and canceling that Broker Suborder is recorded in that dict. To record modifications to Broker Suborders, you cancel the old Broker Suborder and send and record a new one (its the same commission using IBKR).
Improvements
List of Classes instead of DataFrame: I think it'd be much more pythonic to save each Order as an instance of an Order_Class (instead of a row of a DataFrame), which has Suborder and Broker_Suborder attributes, both of which are also instances of Suborder_Class and Broker_Suborder_Class.
My current question is whether saving a list of classes as my record of all open and closed Orders is pythonic or silly.
Visualization Considerations: It seems like Orders should be saved in table form for easier viewing, but maybe it is better to save them in this "list of class instances" form and use a function to tabularize them at the time of viewing? Any input by anyone would be greatly appreciated. I'd like to move away from this and start playing with ML, but I don't want to leave the Order Handler unfinished.
Is it all crap?: should each Broker Suborder (buy/sell request to a Broker) be attached to an Order (which is just a specific strategy request from a strategy script) or should all Broker Suborders be recorded in chronological order and simply have a reference to the strategy-order (Order) that spawned the Broker Suborder? Idk... but I would love everyone's input.