Inserting Increasing Numbers in SQL - python

Does SQL have a function similar to the range function in python that will generate a set of increasing numerics? I am aware of the identity function, but I don't want to keep on creating and recreating tables just so I can get a set of increasing numbers.
Ultimately, I want to be able to dynamically create a range of numbers based on the results of a search function, eg
count(*) from teams where team= 'knicks'
would give me a number, say 20
an then I dynamically use that number as part of a function
function(20) ---> 1,2,3,4,5,6... 20
I want to use the result of the function to have a sorted table where each number corresponds to a player. I can't use identity here, because I'm keeping all the teams in one table, and I'll use the team name to pull out numbered team lists.
Still shakey on how to use stackoverflow's designing, so please bear with.

Try SQL SEQUENCE. For example this guide.
/edit: i´ve read it again, and maybe you should try to make better design of DB. Use normalizations... I don't really understand this sentence:
I want to use the result of the function to have a sorted table where
each number corresponds to a player.

Related

SQLAlchemy: how can I order a table by a column permanently?

I'm not sure if this has been answered before, I didn't get anything on a quick search.
My table is built in a random order, but thereafter it is modified very rarely. I do frequent selects from the table and in each select I need to order the query by the same column. Now is there a way to sort a table permanently by a column so that it does not need to be done again for each select?
You can add an index sorted by the column you want. The data will be presorted according to that index.
You can have only one place where you define it, and re-use that for
every query:
def base_query(session, what_for):
return session.query(what_for).order_by(what_for.rank_or_whatever)
Expand that as needed, then for all but very complex queries you can use that like so:
some_query = base_query(session(), Employee).filter(Employee.feet > 3)
The resulting query will be ordered by Employee.rank_or_whatever. If you are always querying for the same, You won't habve to use it as argument, of course.
EDIT: If you could somehow define a "permanent" order on your table which is observed by the engine without being issued an ORDER BY I'd think this would be an implementation feature specific to which RDBMS you use, and just convenience. Internally it makes for a DBMS no sense to being coerced how to store the data, since retrieving this data in a specific order is easily and efficiently accomplished by using an INDEX - forcing a specific order would probably decrease overall performance.

SQLAlchemy: is it possible to do a select between a certain section of records

I'm not looking for the count or filter expressions, I just want to select from the 5th record to the 10th record, if that makes sense.
I'm working with a very large table but only in small sections at a time, currently each time I need a section I query my entire table and choose my section from the result. But is there a faster way to do it, for example only selecting the records between index 5 and index 10 (My table is indexed by the way)?
Looking at the documentation, it looks as if you could use the slice filter, or use limit and offset.

"Nested" queries in SQL / SQLAlchemy

I'm using SQLAlchemy (being relatively new both to it and SQL) and I want to get a list of all comments posted to a set of things, but I'm only interested in comments that have been posted since a certain date, and the date is different for each thing:
To clarify, here's what I'm doing now: I begin with a dictionary that maps the ID code of each thing I'm interested in to the date I'm interested in for that thing. I do a quick list comprehension to get a list of just the codes (thingCodes) and then do this query:
things = meta.Session.query(Thing)\
.filter(Thing.objType.in_(['fooType', 'barType']))\
.filter(Thing.data.any(and_(Data.key == 'thingCode',Data.value.in_(thingCodes))))\
.all()
which returns a list of the thing objects (I do need those in addition to the comments). I then iterate through this list, and for each thing do a separate query:
comms = meta.Session.query( Thing )
.filter_by(objType = 'comment').filter(Thing.data.any(wc('thingCode', code))) \
.filter(Thing.date >= date) \
.order_by('-date').all()
This works, but it seems horribly inefficient to be to be doing all these queries separately. So, I have 2 questions:
a) Rather than running the second query n times for an n-length list of things, is there a way I could do it in a single query while still returning a separate set of results for each ID (presumably in the form of a dictionary of ID's to lists)? I suppose I could do a value_in(listOfIds) to get a single list of all the comments I want and then iterate through that and build the dictionary manually, but I have a feeling there's a way to use JOINs for this.
b) Am I over-optimizing here? Would I be better off with the second approach I just mentioned? And is it even that important that I roll them all into a single transactions? The bulk of my experience is with Neo4j, which is pretty good at transparently nesting many small transactions into larger ones - does SQL/SQLAlchemy have similar functionality, or is it definitely in my interest to minimize the number of queries?

Get the latest entries django

I currently have a function that writes one to four entries into a database every 12 hours. When certain conditions are met the function is called again to write another 1-4 entries based on the previous ones. Now since time isn't the only factor I have to check whether or not the conditions are met and because the entries are all in the same database I have to differentiate them based on their time posted into the database (DateTimeField is in the code)
How could I achieve this? Is there a function built in in django that I just couldn't find? Or would I have to take a look at a rather complicated solution.
as a sketch I would say i'd expect something like this:
latest = []
allData = myManyToManyField.objects.get(externalId=2)
for data in allData:
if data.Timestamp.checkIfLatest(): #checkIfLatest returns true/false
latest.append(data)
or even better something like this (although I don't think that's implemented)
latest = myManyToManyField.objects.get.latest.filter(externalId=2)
The django documentation is very very good, especially with regards to querysets and model layer functions. It's usually the first place you should look. It sounds like you want .latest(), but it's hard to tell with your requirements regarding conditions.
latest_entry = m2mfield.objects.latest('mydatefield')
if latest_entry.somefield:
# do something
Or perhaps you wanted:
latest_entry = m2mfield.objects.filter(somefield=True).latest('mydatefield')
You might also be interested in order_by(), which will order the rows according to a field you specify. You could then iterate on all the m2m fields until you find the one that matches a condition.
But without more information on what these conditions are, it's hard to be more specific.
IT's just a thought.. we can keep epoch time(current time of the entrie) field in database as a primary key and compare with the previous entiries and diffrentiate them

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

Categories