Efficient Data Structure to Save on Disk

Efficient Data Structure to Save on Disk - python

I am working with many datasets that are of the structure Key|Date|Value.
The Key values can be variable length strings, or integers. The value can be any date type. The dates can be non continuous. An example set might be:
ABC|12-Dec-2021|1.0
DE|21-Dec-2022|5.0
HIJGSDFSDF|13-Dec-2021|1.0
ABC|15-Dec-2021|5.0
In general there can be ~5000 dates and ~20000 identifiers for each dataset. I am trying to store this on disk, so that can be loaded into Python into Numpy arrays efficiently. The modes of access could be:
Return all Key, Dates and Values from a file
Return all Dates and Values for a given list of Key
Return all Values, for an input list of Keys and Dates (maintaining order of inputs). The date lookup can be fuzzy, with lookback and tolerance - e.g. return the most recent value within 10 days
The focus is on fast read speed - writing can be slower.
My idea so far is to lay the file out like:
a) Header information including data types etc
b) Array of Unique Keys, and Offset into the file for Data
c) At each offset, Store (Date, Value) pairs sorted in date order
All reads would be based on a memory map of the file. The three reads would then look like:
Read all keys from b), calculate size of required array from offsets and data size, then allocate the three arrays for Key/Date/Value and iterate through the file and copy across to each array
Same as 1, but filter the array of keys based on input
First sort the Key and Date arrays, then iterate through and for each Key, move to the offset, and perform a binary search for each date to get the value. Once this has been complete, perform another sort to take it back to the original order.
I am wondering if there are better data structures or approaches to this problem.
Edit: Have considered a database, e.g. SQLITE, however do not believe this is performant for read type 3. E.g. If my input key array was (a,b,a,a,b,b) and date array was (11-Nov, 11-Nov,13-Nov,12-Nov,12-Nov,15-Nov), the SQL query would need to: Build a where clause for each unique pair of key/date, extract this, then sort again.
In addition, the lookback would require even more complexity, as if there as no a,11-Nov pair, but there was a a,5-Nov pair, this should be returned.

I'm no expert but I use 'Parquet" to improve on disk storage and read times.
https://www.rstudio.com/blog/speed-up-data-analytics-with-parquet-files/

Related

Applying python dictionary to column of a CSV file

I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.

It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.

How to compute multiple counts with different conditions on a pyspark DataFrame, fast?

Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?

Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)

While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()

Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.

What's the fastest way to do these tasks?

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.

On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

Is it possible to read field names from a compound Dataset in an HDF5 file in Python?

I have an HDF5 file that contains a 2D table with column names. It shows up as such in HDFView when I loot at this object, called results.
It turns out that results is a "compound Dataset", a one-dimensional array where each element is a row. Here are its properties as displayed by HDFView:
I can get a handle of this object, let's call it res.
The column names are V2pt, R2pt, etc.
I can read the entire array as data, and I can read one element with
res[0,...,"V2pt"].
This will return the number in the first row of column V2pt. Replacing 0 with 1 will return the second row value, etc.
That works if I know the colunm name a priori. But I don't.
I simply want to get the whole Dataset and its column names. How can I do that?
I see that there is a get_field_info function in the HDF5 documentation in the HDF5 documentation, but I find not such function in h5py.
Am I screwed?
Even better would be a solution to read this table as a pandas DataFrame...

This is pretty easy to do in h5py and works just like compound types in Numpy.
If res is a handle to your dataset, res.dtype.fields.keys() will return a
list of all the field names.
If you need to know a specific dtype, something like res.dtype.fields['V2pt'] will give it.

Creating a nested dictionary comprehension in Python 2.7

I have a nested tuple returned from a MySQL cursor.fetchall() containing some results in the form (datetime.date, float). I need to separate these out in to a nested dictionary of the form [month/year][day of month] - so I would like to have a dictionary (say) readings which I would reference like readings['12/2011'][13] to get the reading for 13th day of the month '12/2011'. This is with a view to producing graphs showing the daily readings for multiple months overlaid.
My difficulty is that (I believe) I need to set up the first dimension of the dictionary with the unique month/year identifiers. I am currently getting a list of these via:
list(set(["%02d/%04d" % (z[0].month, z[0].year) for z in raw]))
where raw is a list of tuples returned from the database.
Now I can easily do this as a two stage process - set up the first dimenion of the dictionary then go through the data once more to set-up the second. I wondered though if there is a readable way to do both steps at once possibly with nested dictionary/list comprehensions.
I'd be graetful for any advice. Thank you.

it seems difficult to do both levels in a concise oneliner, I propose you instead to use defaultdict like this:
res = defaultdict(dict)
for z in raw:
res["%02d/%04d"%(z[0].month, z[0].year)][z[0].day] = z

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.