Creating a nested dictionary comprehension in Python 2.7 - python

I have a nested tuple returned from a MySQL cursor.fetchall() containing some results in the form (datetime.date, float). I need to separate these out in to a nested dictionary of the form [month/year][day of month] - so I would like to have a dictionary (say) readings which I would reference like readings['12/2011'][13] to get the reading for 13th day of the month '12/2011'. This is with a view to producing graphs showing the daily readings for multiple months overlaid.
My difficulty is that (I believe) I need to set up the first dimension of the dictionary with the unique month/year identifiers. I am currently getting a list of these via:
list(set(["%02d/%04d" % (z[0].month, z[0].year) for z in raw]))
where raw is a list of tuples returned from the database.
Now I can easily do this as a two stage process - set up the first dimenion of the dictionary then go through the data once more to set-up the second. I wondered though if there is a readable way to do both steps at once possibly with nested dictionary/list comprehensions.
I'd be graetful for any advice. Thank you.

it seems difficult to do both levels in a concise oneliner, I propose you instead to use defaultdict like this:
res = defaultdict(dict)
for z in raw:
res["%02d/%04d"%(z[0].month, z[0].year)][z[0].day] = z

Related

Efficient Data Structure to Save on Disk

I am working with many datasets that are of the structure Key|Date|Value.
The Key values can be variable length strings, or integers. The value can be any date type. The dates can be non continuous. An example set might be:
ABC|12-Dec-2021|1.0
DE|21-Dec-2022|5.0
HIJGSDFSDF|13-Dec-2021|1.0
ABC|15-Dec-2021|5.0
In general there can be ~5000 dates and ~20000 identifiers for each dataset. I am trying to store this on disk, so that can be loaded into Python into Numpy arrays efficiently. The modes of access could be:
Return all Key, Dates and Values from a file
Return all Dates and Values for a given list of Key
Return all Values, for an input list of Keys and Dates (maintaining order of inputs). The date lookup can be fuzzy, with lookback and tolerance - e.g. return the most recent value within 10 days
The focus is on fast read speed - writing can be slower.
My idea so far is to lay the file out like:
a) Header information including data types etc
b) Array of Unique Keys, and Offset into the file for Data
c) At each offset, Store (Date, Value) pairs sorted in date order
All reads would be based on a memory map of the file. The three reads would then look like:
Read all keys from b), calculate size of required array from offsets and data size, then allocate the three arrays for Key/Date/Value and iterate through the file and copy across to each array
Same as 1, but filter the array of keys based on input
First sort the Key and Date arrays, then iterate through and for each Key, move to the offset, and perform a binary search for each date to get the value. Once this has been complete, perform another sort to take it back to the original order.
I am wondering if there are better data structures or approaches to this problem.
Edit: Have considered a database, e.g. SQLITE, however do not believe this is performant for read type 3. E.g. If my input key array was (a,b,a,a,b,b) and date array was (11-Nov, 11-Nov,13-Nov,12-Nov,12-Nov,15-Nov), the SQL query would need to: Build a where clause for each unique pair of key/date, extract this, then sort again.
In addition, the lookback would require even more complexity, as if there as no a,11-Nov pair, but there was a a,5-Nov pair, this should be returned.
I'm no expert but I use 'Parquet" to improve on disk storage and read times.
https://www.rstudio.com/blog/speed-up-data-analytics-with-parquet-files/

What's the fastest way to do these tasks?

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

python - extracting values from dictionary in order

I have a dictionary with some share related data:
share_data = {'2016-06-13':{'open': 2190, 'close':2200}, '2015-09-10':{'open': 2870, 'close':2450} # and so on, circa 1,500 entries
is there a way of iterating over the dictionary in order, so the oldest date is retrieved first, then the one soon after etc?
thanks!
Sure, the default lexicographical order of your date strings will map to chronological order. So it is very easy:
for key in sorted(share_data.keys()):
#do something
This post has some nice examples of custom sorting on dictionaries.

Bulk MongoDB / Pymongo Insert with Datetime

I have code that looks like the following, which executes every minute:
huge_list = query_results() # Returns a long list of dictionaries.
db.objects.insert(huge_list)
I need the current datetime appended to each object in the list before insertion. Is there some way I can modify the insert command so it also appends a 'datetime' field, and if not what would be the most efficient way of doing this? There are several thousands records in the response dictionary, so I feel like visiting each index of the list and appending a field may not be the most efficient method.
Later I will need to be able to query for records individually from the whole group, and also for records within a specific datetime range.
Thanks in advance!

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Categories