WeatherLink File (WLK) to CSV with Python - python

I really need your help because I'm trying to solve this for some time now but cannot get it done!
We have a Davis Weather Station with the WeatherLink Software. This is downloading all weatherdata from the station and saving it to a yyyy-mm.wlk File.
I would like to visualize it within PowerBI but the wlk Files seem to be saved in a binary encoded kind.
So I started programming with python to convert it anyhow to a csv file. I eaven found a Java based solution here: Java Program WLK Reader
I tried to use this code and make it running in python but nothing worked. Does anyone have an idea how I can convert these files?
So far I tried this:
#from PyByteBuffer import ByteBuffer #https://pypi.org/project/PyByteBuffer/
import struct
in_file = open(r"D:\Temp\WeatherStation\2013-08.wlk", "rb")
data = in_file.read()
in_file.close()
#content = data.decode('ansi').splitlines()
content = data.decode('ansi', 'slashescape')
#offset = 0
#content = struct.unpack_from("<L", data, offset)
with open(r"D:\Temp\WeatherStation\Output.txt", "w") as text_file:
for line in content:
text_file.write(line)
#decoded = data.decode('ansi', 'slashescape')
#offset = 0
#content = struct.unpack_from("<d", data, offset)
#data.decode('ansi', 'slashescape')
I'm comletely lost with this and have no idea what I can do to get this done and hope that maybe someone of you can help me...
As soon as I get it done I'll post the code here so everybody can use it within PowerBI to visulize it more nicely than within WeatherLink :-)
You can find example files right here: Example Files
Thank you very much!
Victor
Here comes the detailed File Description from the Readme as Ronald mentioned:
Data File Structure
What follows is a technical description of the .WLK weather database
files. This is of interest mostly to programmers who want to write
their own programs to read the data files.
The data filename has the following format: YYYY-MM.wlk where YYYY is
the four digit year and MM is the two digit month of the data
contained in the file.
The structures defined below assume that no bytes are added to the
structures to make the fields are on the "correct" address boundaries.
With the Microsoft C++ compiler, you can use the directive "#pragma
pack (1)" to enforce this and use "#pragma pack ()" to return the
compiler to its default behavior.
// Data is stored in monthly files. Each file has the following
header. struct DayIndex { short recordsInDay; // includes any
daily summary records long startPos; // The index (starting at
0) of the first daily summary record };
// Header for each monthly file. // The first 16 bytes are used to
identify a weather database file and to identify // different file
formats. (Used for converting older database files.) class HeaderBlock
{ char idCode [16]; // = {'W', 'D', 'A', 'T', '5', '.', '0', 0, 0,
0, 0, 0, 0, 0, 5, 0} long totalRecords; DayIndex dayIndex [32];
// index records for each day. Index 0 is not used
// (i.e. the 1'st is at index 1, not index 0) };
// After the Header are a series of 88 byte data records with one of
the following // formats. Note that each day will begin with 2
daily summary records
// Daily Summary Record 1 struct DailySummary1 { BYTE dataType = 2;
BYTE reserved; // this will cause the rest of the fields to start
on an even address
short dataSpan; // total # of minutes accounted for by physical
records for this day short hiOutTemp, lowOutTemp; // tenths of a
degree F short hiInTemp, lowInTemp; // tenths of a degree F
short avgOutTemp, avgInTemp; // tenths of a degree F (integrated over
the day) short hiChill, lowChill; // tenths of a degree F
short hiDew, lowDew; // tenths of a degree F short
avgChill, avgDew; // tenths of a degree F short hiOutHum,
lowOutHum; // tenths of a percent short hiInHum, lowInHum; //
tenths of a percent short avgOutHum; // tenths of a
percent short hiBar, lowBar; // thousandths of an inch Hg
short avgBar; // thousandths of an inch Hg short
hiSpeed, avgSpeed; // tenths of an MPH short dailyWindRunTotal;
// 1/10'th of an mile short hi10MinSpeed; // the highest
average wind speed record BYTE dirHiSpeed, hi10MinDir; //
direction code (0-15, 255) short dailyRainTotal; //
1/1000'th of an inch short hiRainRate; // 1/100'th
inch/hr ??? short dailyUVDose; // 1/10'th of a standard
MED BYTE hiUV; // tenth of a UV Index BYTE
timeValues[27]; // space for 18 time values (see below) };
// Daily Summary Record 2 struct DailySummary2 { BYTE dataType = 3;
BYTE reserved; // this will cause the rest of the fields to start
on an even address
// this field is not used now. unsigned short todaysWeather; //
bitmapped weather conditions (Fog, T-Storm, hurricane, etc)
short numWindPackets; // # of valid packets containing wind
data,
// this is used to indicate reception quality short hiSolar; // Watts per meter squared
short dailySolarEnergy; // 1/10'th Ly short minSunlight;
// number of accumulated minutes where the avg solar rad > 150
short dailyETTotal; // 1/1000'th of an inch short hiHeat,
lowHeat; // tenths of a degree F short avgHeat; //
tenths of a degree F short hiTHSW, lowTHSW; // tenths of a
degree F short hiTHW, lowTHW; // tenths of a degree F
short integratedHeatDD65; // integrated Heating Degree Days (65F
threshold)
// tenths of a degree F - Day
// Wet bulb values are not calculated short hiWetBulb,
lowWetBulb; // tenths of a degree F short avgWetBulb; //
tenths of a degree F
BYTE dirBins[24]; // space for 16 direction bins
// (Used to calculate monthly dominant Dir)
BYTE timeValues[15]; // space for 10 time values (see below)
short integratedCoolDD65; // integrated Cooling Degree Days (65F
threshold)
// tenths of a degree F - Day BYTE reserved2[11]; };
// standard archive record struct WeatherDataRecord { BYTE dataType
= 1; BYTE archiveInterval; // number of minutes in the archive // see below for more details about these next two fields) BYTE
iconFlags; // Icon associated with this record, plus Edit
flags BYTE moreFlags; // Tx Id, etc.
short packedTime; // minutes past midnight of the end of
the archive period short outsideTemp; // tenths of a degree
F short hiOutsideTemp; // tenths of a degree F short
lowOutsideTemp; // tenths of a degree F short insideTemp;
// tenths of a degree F short barometer; // thousandths
of an inch Hg short outsideHum; // tenths of a percent
short insideHum; // tenths of a percent unsigned short
rain; // number of clicks + rain collector type code short
hiRainRate; // clicks per hour short windSpeed;
// tenths of an MPH short hiWindSpeed; // tenths of an MPH
BYTE windDirection; // direction code (0-15, 255) BYTE
hiWindDirection; // direction code (0-15, 255) short
numWindSamples; // number of valid ISS packets containing wind
data
// this is a good indication of reception short solarRad, hisolarRad;// Watts per meter squared
BYTE UV, hiUV; // tenth of a UV Index
BYTE leafTemp[4]; // (whole degrees F) + 90
short extraRad; // used to calculate extra heating
effects of the sun in THSW index
short newSensors[6]; // reserved for future use BYTE forecast; // forecast code during the archive interval
BYTE ET; // in thousandths of an inch
BYTE soilTemp[6]; // (whole degrees F) + 90 BYTE soilMoisture[6]; // centibars of dryness BYTE leafWetness[4];
// Leaf Wetness code (0-15, 255) BYTE extraTemp[7]; //
(whole degrees F) + 90 BYTE extraHum[7]; // whole percent
};
Notes:
Always check the dataType field to make sure you are reading the
correct record type
There are extra fields that are not used by the current software. For
example, there is space for 7 extra temperatures and Hums, but current
Vantage stations only log data for 3 extra temps and 2 extra hums.
Extra/Soil/Leaf temperatures are in whole degrees with a 90 degree
offset. A database value of 0 = -90 F, 100 = 10 F, etc.
The rain collector type is encoded in the most significant nibble of
the rain field. rainCollectorType = (rainCode & 0xF000); rainClicks =
(rainCode & 0x0FFF);
Type rainCollectorType
0.1 inch 0x0000
0.01 inch 0x1000
0.2 mm 0x2000
1.0 mm 0x3000
0.1 mm 0x6000 (not fully supported)
Use the rainCollectorType to interpret the hiRainRate field. For
example, if you have a 0.01 in rain collector, a rain rate value of
19 = 0.19 in/hr = 4.8 mm/hr, but if you have a 0.2 mm rain collector,
a rain rate value of 19 = 3.8 mm/hr = 0.15 in/hr.
Format for the iconFlags field The lower nibble will hold a value that
will represent an Icon to associate with this data record (i.e. snow,
rain, sun, lightning, etc.). This field is not used.
Bit (0x10) is set if the user has used the edit record function to
change a data value. This allows tracking of edited data.
Bit (0x20) is set if there is a data note associated with the archive
record. If there is, it will be found in a text file named
YYYYMMDDmmmm.NOTE. YYYY is the four digit year, MM is the two digit
month (i.e. Jan = 01), DD is the two digit day, and mmmm is the number
of minutes past midnight (i.e. the packedTime field). This file is
found in the DATANOTE subdirectory of the station directory.
Format for the moreFlags field The lowest 3 bits contain the
transmitter ID that is the source of the wind speed packets recorded
in the numWindSamples field. This value is between 0 and 7. If your
ISS is on ID 1, zero will be stored in this field.
WindTxID = (moreFlags & 0x07);
Time values and Wind direction values in Daily Summary records These
values are between 0 and 1440 and therefore will fit in 1 1/2 bytes,
and 2 values fit in 3 bytes. Use this code to extract the i'th time or
direction value. See below for the list of i values.
fieldIndex = (i/2) * 3; // note this is integer division (rounded
down)
if (i is even) value = field[fieldIndex] + (field[fieldIndex+2] &
0x0F)<<8;
if (i is odd) value = field[fieldIndex+1] +
(field[fieldIndex+2] & 0xF0)<<4;
A value of 0x0FFF or 0x07FF indicates no data available (i.e. invalid
data)
The time value represents the number of minutes after midnight that
the specified event took place (actually the time of the archive
record).
The wind direction bins represent the number of minutes that that
direction was the dominant wind direction for the day.
Index values for Daily Summary Record 1 time values Time of High
Outside Temperature 0 Time of Low Outside Temperature 1 Time of
High Inside Temperature 2 Time of Low Inside Temperature 3 Time
of High Wind Chill 4 Time of Low Wind Chill 5
Time of High Dew Point 6 Time of Low Dew Point
7 Time of High Outside Humidity 8 Time of Low Outside Humidity
9 Time of High Inside Humidity 10 Time of Low Inside Humidity
11 Time of High Barometer 12 Time of Low Barometer
13 Time of High Wind Speed 14 Time of High Average Wind
Speed 15 Time of High Rain Rate 16 Time of High UV
17
Index values for Daily Summary Record 2 time values Time of High
Solar Rad 0 Time of High Outside Heat Index 1 Time of
Low Outside Heat Index 2 Time of High Outside THSW Index 3 Time
of Low Outside THSW Index 4 Time of High Outside THW Index 5
Time of Low Outside THW Index 6 Time of High Outside Wet Bulb
Temp 7 Time of Low Outside Wet Bulb Temp 8 (Time value 9 is not used)
Index values for Dominant Wind direction bins in Daily Summary Record
2 N 0 NNE 1 NE 2 ... NW 14 NNW 15

Related

Python DOcplex how to get end and start value for interval_var

I am trying to model a scheduling task using IBMs DOcplex Python API. The goal is to optimize EV charging schedules and minimize charging costs. However, I am having problems working with the CPO interval variable.
Charging costs are defined by different price windows, e.g., charging between 00:00 - 06:00 costs 0.10$ per kW while charging between 06:00 - 18:00 costs 0.15$ per kW.
My initial idea was this:
schedule_start = start_of(all_trips[trip_id].interval)
schedule_end = end_of(all_trips[trip_id].interval)
cost_windows = {
"morning":{ "time":range(0,44),
"cost":10},
"noon":{ "time":range(44,64),
"cost":15},
"afternoon":{ "time":range(64,84),
"cost":15},
"night":{ "time":range(84,97),
"cost":10}
}
time_low = 0
time_high = 0
for i in range(schedule_start,schedule_end):
for key in cost_windows.keys():
if i in cost_windows.get(key).get("time"):
if cost_windows.get(key).get("cost") == 10:
time_low += 1
else:
time_high += 1
cost_total = ((time_low * 10 * power) + (time_high * 15 * power)) / 400
As seen above, the idea was to loop through the interval start to end (interval size can be a maximum of 96, each unit representing a 15 minute time block) and check in what price window the block is. We later calculate the total cost by multiplying the number of blocks in each window with the power (integer variable) and price.
However, this approach does not work as we cannot use the start_of(interval) like a regular integer. Is there a way to get the start and end values for an interval and use them like regular integers? Or is there another approach that I am missing?
Regards
Have you tried to use overlap_length as can be seen in
How to initiate the interval variable bounds in docplex (python)?
?
start_of and end_of do not return values but something that is not set until the model is run.
What you were trying to do is a bit like
using CP;
dvar int l;
dvar interval a in 0..10 size 3;
subject to
{
l==sum(i in 0..10) ((startOf(a)<=i) && (endOf(a)>i));
}
execute
{
writeln("l=",l);
}
in OPL but you enumerate time and that's not the good way
Small example with overlapLength and 3 time windows with 3 prices
using CP;
dvar int l;
tuple pricewindow
{
int s;
int e;
float price;
}
{pricewindow} windows={<0,5,1>,<5,6,0>,<6,10,0.5>};
dvar interval pwit[w in windows] in w.s..w.e size (w.e-w.s);
dvar interval a in 0..10 size 6;
dexpr float cost=sum(w in windows) overlapLength(a,pwit[w])*w.price;
minimize cost;
subject to
{
}
which gives
// solution with objective 3
a = <1 4 10 6>;

C++ vs. python: daylight saving time not recognized when extracting Unix time?

I'm attempting to calculate the Unix time of a given date and time represented by two integers, e.g.
testdate1 = 20060711 (July 11th, 2006)
testdate2 = 4 (00:00:04, 4 seconds after midnight)
in a timezone other than my local timezone. To calculate the Unix time, I feed testdate1, testdate2 into a function I adapted from Convert date to unix time stamp in c++
int unixtime (int testdate1, int testdate2) {
time_t rawtime;
struct tm * timeinfo;
//time1, ..., time6 are external functions that extract the
//year, month, day, hour, minute, seconds digits from testdate1, testdate2
int year=time1(testdate1);
int month=time2(testdate1);
int day=time3(testdate1);
int hour=time4(testdate2);
int minute=time5(testdate2);
int second=time6(testdate2);
time ( &rawtime );
timeinfo = localtime ( &rawtime );
timeinfo->tm_year = year - 1900;
timeinfo->tm_mon = month - 1;
timeinfo->tm_mday = day;
timeinfo->tm_hour = hour;
timeinfo->tm_min = minute;
timeinfo->tm_sec = second;
int date;
date = mktime(timeinfo);
return date;
}
Which I call from the main code
using namespace std;
int main(int argc, char* argv[])
{
int testdate1 = 20060711;
int testdate2 = 4;
//switch to CET time zone
setenv("TZ","Europe/Berlin", 1);
tzset();
cout << testdate1 << "\t" << testdate2 << "\t" << unixtime(testdate1,testdate2) << "\n";
return 0;
}
With the given example, I get unixtime(testdate1,testdate2) = 1152572404, which according to
https://www.epochconverter.com/timezones?q=1152572404&tz=Europe%2FBerlin
is 1:00:04 am CEST, but I want this to be 0:00:04 CEST.
The code seems to work perfectly well if I choose a testdate1, testdate2 in which daylight saving time (DST) isn't being observed. For example, simply setting the month to February with all else unchanged is accomplished by setting testdate1 = 20060211. This gives
unixtime(testdate1,testdate2) = 1139612404, corresponding to hh:mm:ss = 00:00:04 in CET, as desired.
My impression is that setenv("TZ","Europe/Berlin", 1) is supposed to account for DST when applicable, but perhaps I am mistaken. Can TZ interpret testdate1, testdate2 in such a way that it accounts for DST?
Interestingly, I have a python code that performs the same task by changing the local time via os.environ['TZ'] = 'Europe/Berlin'. Here I have no issues, as it seems to calculate the correct Unix time regardless of DST/non-DST.
localtime sets timeinfo->tm_isdst to that of the current time - not of the date you parse.
Don't call localtime. Set timeinfo->tm_isdst to -1:
The value specified in the tm_isdst field informs mktime() whether or not daylight saving time (DST) is in effect for the time supplied in the tm structure: a positive value means DST is in effect; zero means that DST is not in effect; and a negative value means that mktime() should (use timezone information and system databases to) attempt to determine whether DST is in effect at the specified time.
See the code example in https://en.cppreference.com/w/cpp/chrono/c/mktime
Maxim's answer is correct, and I've upvoted it. But I also thought it might be helpful to show how this can be done in C++20 using the newer <chrono> tools. This isn't implemented everywhere yet, but it is here in Visual Studio and will be coming elsewhere soon.
There's two main points I'd like to illustrate here:
<chrono> is convenient for conversions like this, even if both the input and the output does not involve std::chrono types. One can convert the integral input to chrono, do the conversion, and then convert the chrono result back to integral.
There's a thread safety weakness in using the TZ environment variable, as this is a type of global. If another thread is also doing some type of time computation, it may not get the correct answer if the computer's time zone unexpectedly changes out from under it. The <chrono> solution is thread-safe. It doesn't involve globals or environment variables.
The first job is to unpack the integral data. Here I show how to do this, and convert it into chrono types in one step:
std::chrono::year_month_day
get_ymd(int ymd)
{
using namespace std::chrono;
day d(ymd % 100);
ymd /= 100;
month m(ymd % 100);
ymd /= 100;
year y{ymd};
return y/m/d;
}
get_ymd takes "testdate1", extracts the individual integral fields for day, month, and year, then converts each integral field into the std::chrono types day, month and year, and finally combines these three separate fields into a std::chrono::year_month_day to return it as one value. This return type is simple a {year, month, day} data structure -- like a tuple but with calendrical meaning.
The / syntax is simply a convenient factory function for constructing a year_month_day. And this construction can be done with any of these three orderings: y/m/d, d/m/y and m/d/y. This syntax, when combined with auto, also means that you often don't have to spell out the verbose name year_month_day:
auto
get_ymd(int ymd)
{
// ...
return y/m/d;
}
get_hms unpacks the hour, minute and second fields and returns that as a std::chrono::seconds:
std::chrono::seconds
get_hms(int hms)
{
using namespace std::chrono;
seconds s{hms % 100};
hms /= 100;
minutes m{hms % 100};
hms /= 100;
hours h{hms};
return h + m + s;
}
The code is very similar to that for get_ymd except that the return is the sum of the hours, minutes and seconds. The chrono library does the job for you of converting hours and minutes to seconds while performing the summation.
Next is the function for doing the conversion, and returning the result back as an int.
int
unixtime(int testdate1, int testdate2)
{
using namespace std::chrono;
auto ymd = get_ymd(testdate1);
auto hms = get_hms(testdate2);
auto ut = locate_zone("Europe/Berlin")->to_sys(local_days{ymd} + hms);
return ut.time_since_epoch().count();
}
std::chrono::locate_zone is called to get a pointer to the std::chrono::time_zone with the name "Europe/Berlin". The std::lib manages the lifetime of this object, so you don't have to worry about it. It is a const singleton, created on demand. And it has no impact on what time zone your computer considers its "local time zone".
The std::chrono::time_zone has a member function called to_sys that takes a local_time, and converts it to a sys_time, using the proper UTC offset for this time zone (taking into account daylight saving rules when applicable).
Both local_time and sys_time are std::chrono::time_point types. local_time is "some local time", not necessarily your computer's local time. You can associate a local time with a time zone in order to specify the locality of that time.
sys_time is a time_point based on system_clock. This tracks UTC (Unix time).
The expression local_days{ymd} + hms converts ymd and hms to local_time with a precision of seconds. local_days is just another local_time time_point, but with a precision of days.
The type of ut is time_point<system_clock, seconds>, which has a convenience type alias called sys_seconds, though auto makes that name unnecessary in this code.
To unpack the sys_seconds into an integral type, the .time_since_epoch() member function is called which results in the duration seconds, and then the .count() member function is called to extract the integral value from that duration.
When int is 32 bits, this function is susceptible to the year 2038 overflow problem. To fix that, simply change the return type of unixtime to return a 64 bit integral type (or make the return auto). Nothing else needs to change as std::chrono::seconds is already required to be greater than 32 bits and will not overflow at 68 years. Indeed std::chrono::seconds is usually represented by a signed 64 bit integral type in practice, giving it a range greater than the age of the universe (even if the scientists are off by an order of magnitude).

Decoding the received data from ATM90E32AS energy meter IC

I wrote a python code (raspberry pi) to receive voltage, current and power values from ATM90E32AS energy meter IC. Iam using spidev library for SPI communication with the energy meter IC. I initialized two bytearrays (each 4 bytes wide) for reading and writing the energy meter IC like
writeBuffer = bytearray ([0x00,0x00,0x00,0x00])
readBuffer = bytearray ([0x00,0x00,0x00,0x00])
For example reading active R phase voltage i initialized the register values like
VrRead_Reg = bytearray ([0x80, 0xD9])
And i try to write the above value to the IC using following subroutine to read the R phase voltage
def Vr_read():
writeBuffer[0] = VrRead_Reg[0]
writeBuffer[1] = VrRead_Reg[1]
#print(writeBuffer)
readBuffer = spi.xfer(writeBuffer)
print("Vr:",readBuffer)
time.sleep(0.5)
And iam getting the output like
Vr: [255,255,89,64]
Vr: [255,255,89,170]
Vr: [255,255,89,220]
Vr: [255,255,89,1]
Vr: [255,255,89,10]
I measured the voltage at mains it shows 230V. Then i try to match the above output with the measured voltage. Here the third byte 89 corresponds to 230V. Then i used a variac to change the voltage this time for 220V the third byte value becomes 85 and for 210V its 81 and for 100V it was 39 and so on.
I don't know how to relate 89 with 230V and also about other bytes. Plz help to decode the above output.
Do some ratio calculation:
(consider the max value of a byte is 255)
255 / 89 * 230 = 658.99 (approximately 660V)
85 / 255 * 660 = 220(220V)
81 / 255 * 660 = 209.65(210V)
39 / 255 * 660 = 100.94(100V)
But you had better find the device manual to get reference.

How to generate a time-ordered uid in Python?

Is this possible? I've heard Cassandra has something similar : https://datastax.github.io/python-driver/api/cassandra/util.html
I have been using a ISO timestamp concatenated with a uuid4, but that ended up way too large (58 characters) and probably overkill.
Keeping a sequential number doesn't work in my context (DynamoDB NoSQL)
Worth noticing that for my application it doesn't matter if items created in batch/same second are in a random order, as long as the uid don't collapse.
I have no specific restriction on maximum length, ideally I would like to see the different collision chance for different lengths, but it needs to be smaller than 58 (my original attempt)
This is to use with DynamoDB(NoSQL Database) as Sort-key
Why uuid.uuid1 is not sequential
uuid.uuid1(node=None, clock_seq=None) is effectively:
60 bits of timestamp (representing number of 100-ns intervals after 1582-10-15 00:00:00)
14 bits of "clock sequence"
48 bits of "Node info" (generated from network card's mac-address or from hostname or from RNG).
If you don't provide any arguments, then System function is called to generate uuid. In that case:
It's unclear if "clock sequence" is sequential or random.
It's unclear if it's safe to be used in multiple processes (can clock_seq be repeated in different processes or not?). In Python 3.7 this info is now available.
If you provide clock_seq or node, then "pure python implementation is used". IN this case even with "fixed value" for clock_seq:
timestamp part is guaranteed to be sequential for all the calls in current process even in threaded execution.
clock_seq part is randomly generated. But that is not critical annymore because timestamp is sequential and unique.
It's NOT safe for multiple processes (processes that call uuid1 with the same clock_seq, node might return conflicting values if called during the "same 100-ns time interval")
Solution that reuses uuid.uuid1
It's easy to see, that you can make uuid1 sequential by providing clock_seq or node arguments (to use python implementation).
import time
from uuid import uuid1, getnode
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
# .hex attribute of this value is 32-characters long string
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
Multiple processes issues
BUT if you are running some parallel processes on the same machine, then:
node which defaults to uuid.get_node() will be the same for all the processes;
clock_seq has small chance to be the same for some processes (chance of 1/16384)
That might lead to conflicts! That is general concern for using
uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential (but unique) ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value
As a result you always have to remember:
If you are running multiple processes on the same machine, then you must:
Ensure uniqueness of node. The problem for this solution: you can't be sure to have sequential ids from different processes generated during the same 100-ns interval. But this is very "light" operation executed once on process startup and achieved by: "adding" something to default node, e.g. int(time.time()*1e9) - 0x118494406d1cc000, or by adding some counter from machine-level atomic db.
Ensure "machine-level atomic clock_seq" and the same node for all processes on one machine. That way you'll have some overhead for "locking" clock_seq, but id-s are guaranteed to be sequential even if generated in different processes during the same 100-ns interval (unless you are calling uuid from several threads in the same process).
For processes on different machines:
either you have to use some "global counter service";
or it's not possible to have sequential ids generated on different machines during the same 100-ns interval.
Reducing size of the id
General approach to generate UUIDs is quite simple, so it's easy to implement something similar from scratch, and for example use less bits for node_info part:
import time
from random import getrandbits
_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0
timestamp_multiplier = 1e7 # I'd recommend to use this value
# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
time_bits = 62 # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
time_bits = 60 # up to year 2335
elif timestamp_multiplier == 1e7:
time_bits = 56 # Up to year 2198.
else:
raise ValueError('Please calculate and set time_bits')
time_mask = 2**time_bits - 1
seq_bits = 16
seq_mask = 2**seq_bits - 1
node_bits = 12
node_mask = 2**node_bits - 1
max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2 # 21
_default_node_number = getrandbits(node_bits) # or `uuid.getnode() & node_mask`
def sequential_uuid(node_number=None):
"""Return 21-characters long hex string that is sequential and unique for each call in current process.
Results from different processes may "overlap" but are guaranteed to
be unique if `node_number` is different in each process.
"""
global _my_clock_seq
global _last_timestamp_part
global _used_clock_seq
if node_number is None:
node_number = _default_node_number
if not 0 <= node_number <= node_mask:
raise ValueError("Node number out of range")
timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
_my_clock_seq = (_my_clock_seq + 1) & seq_mask
if _last_timestamp_part >= timestamp_part:
timestamp_part = _last_timestamp_part
if _used_clock_seq == _my_clock_seq:
timestamp_part = (timestamp_part + 1) & time_mask
else:
_used_clock_seq = _my_clock_seq
_last_timestamp_part = timestamp_part
return hex(
(timestamp_part << (node_bits+seq_bits))
|
(_my_clock_seq << (node_bits))
|
node_number
)[2:]
Notes:
Maybe it's better to simply store integer value (not hex-string) in the database
If you are storing it as text/char, then its better to convert integer to base64-string instead of converting it to hex-string. That way it will be shorter (21 chars hex-string → 16 chars b64-encoded string):
from base64 import b64encode
total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)
def int_to_b64(int_value):
return b64encode(int_value.to_bytes(total_bytes, 'big'))
Collision chances
Single process: collisions not possible
Multiple processes with manually set unique clock_seq or unique node in each process: collisions not possible
Multiple processes with randomly set node (48-bits, "fixed" in time):
Chance to have the node collision in several processes:
in 2 processes out of 10000: ~0.000018%
in 2 processes out of 100000: 0.0018%
Chance to have single collision of the id per second in 2 processes with the "colliding" node:
for "timestamp" interval of 100-ns (default for uuid.uuid1 , and in my code when timestamp_multiplier == 1e7): proportional to 3.72e-19 * avg_call_frequency²
for "timestamp" interval of 10-ns (timestamp_multiplier == 1e8): proportional to 3.72e-21 * avg_call_frequency²
In the article you've linked too, the cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] seems to be exactly what you're looking for.
def uuid_from_time(time_arg, node=None, clock_seq=None):
"""
Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.
:param time_arg:
The time to use for the timestamp portion of the UUID.
This can either be a :class:`datetime` object or a timestamp
in seconds (as returned from :meth:`time.time()`).
:type datetime: :class:`datetime` or timestamp
:param node:
None integer for the UUID (up to 48 bits). If not specified, this
field is randomized.
:type node: long
:param clock_seq:
Clock sequence field for the UUID (up to 14 bits). If not specified,
a random sequence is generated.
:type clock_seq: int
:rtype: :class:`uuid.UUID`
"""
if hasattr(time_arg, 'utctimetuple'):
seconds = int(calendar.timegm(time_arg.utctimetuple()))
microseconds = (seconds * 1e6) + time_arg.time().microsecond
else:
microseconds = int(time_arg * 1e6)
# 0x01b21dd213814000 is the number of 100-ns intervals between the
# UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
intervals = int(microseconds * 10) + 0x01b21dd213814000
time_low = intervals & 0xffffffff
time_mid = (intervals >> 32) & 0xffff
time_hi_version = (intervals >> 48) & 0x0fff
if clock_seq is None:
clock_seq = random.getrandbits(14)
else:
if clock_seq > 0x3fff:
raise ValueError('clock_seq is out of range (need a 14-bit value)')
clock_seq_low = clock_seq & 0xff
clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)
if node is None:
node = random.getrandbits(48)
return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
clock_seq_hi_variant, clock_seq_low, node), version=1)
There's nothing Cassandra specific to a Type 1 UUID...
You should be able to encode a timestamp precise to the second for a time range of 135 years in 32 bits. That will only take 8 characters to represent in hex. Added to the hex representation of the uuid (32 hex characters) that will amount to only 40 hex characters.
Encoding the time stamp that way requires that you pick a base year (e.g. 2000) and compute the number of days up to the current date (time stamp). Multiply this number of days by 86400, then add the seconds since midnight. This will give you values that are less than 2^32 until you reach year 2135.
Note that you have to keep leading zeroes in the hex encoded form of the timestamp prefix in order for alphanumeric sorting to preserve the chronology.
With a few bits more in the time stamp, you could increase the time range and/or the precision. With 8 more bits (two hex characters), you could go up to 270 years with a precision to the hundredth of a second.
Note that you don't have to model the fraction of seconds in a base 10 range. You will get optimal bit usage by breaking it down in 128ths instead of 100ths for the same number of characters. With the doubling of the year range, this still fits within 8 bits (2 hex characters)
The collision probability, within the time precision (i.e. per second or per 100th or 128th of a second) is driven by the range of the uuid so it will be 1 in 2^128 for the chosen precision. Increasing the precision of the time stamp has the most impact on reducing the collision chances. It is also the factor that has the lowest impact on total size of the key.
More efficient character encoding: 27 to 29 character keys
You could significantly reduce the size of the key by encoding it in base 64 instead of 16 which would give you 27 to 29 characters (depending on you choice of precision)
Note that, for the timestamp part, you need to use an encoding function that takes an integer as input and that preserves the collating sequence of digit characters.
For example:
def encode64(number, size):
chars = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
result = list()
for _ in range(size):
result.append(chars[number%64])
number //= 64
return "".join(reversed(result))
a = encode64(1234567890,6) # '-7ZU9G'
b = encode64(9876543210,6) # '7Ag-Pe'
print(a < b) # True
u = encode64(int(uuid.uuid4()),22) # '1QA2LtMg30ztnugxaokVMk'
key = a+u # '-7ZU9G1QA2LtMg30ztnugxaokVMk' (28 characters)
You can save some more characters by combining the time stamp and uuid into a single number before encoding instead of concatenating the two encoded values.
The encode64() function needs one character every 6 bits.
So, for 135 years with precision to the second: (32+128)/6 = 26.7 --> 27 characters
instead of (32/6 = 5.3 --> 6) + (128/6 = 21.3 --> 22) ==> 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
key = encode64( timeStamp<<128 | int(uid) ,27)
with a 270 year span and 128th of a second precision: (40+128)/6 = 28 characters
uid = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
precision = 128
timeStamp = timeStamp * precision + int(factionOfSecond * precision)
key = encode64( timeStamp<<128 | int(uid) ,28)
With 29 characters you can raise precision to 1024th of a second and year range to 2160 years.
UUID masking: 17 to 19 characters keys
To be even more efficient, you could strip out the first 64 bits of the uuid (which is already a time stamp) and combine it with your own time stamp. This would give you keys with a length of 17 to 19 characters with practically no loss of collision avoidance (depending on your choice of precision).
mask = (1<<64)-1
key = encode64( timeStamp<<64 | (int(uid) & mask) ,19)
Integer/Numeric keys ?
As a final note, if your database supports very large integers or numeric fields (140 bits or more) as keys, you don't have to convert the combined number to a string. Just use it directly as the key. The numerical sequence of timeStamp<<128 | int(uid) will respect the chronology.
The uuid6 module (pip install uuid6) solves the problem. It aims at implementing the corresponding draft for a new uuid variant standard, see here.
Example code:
import uuid6
for i in range(0, 30):
u = uuid6.uuid7()
print(u)
time.sleep(0.1)
The package suggests to use uuid6.uuid7():
Implementations SHOULD utilize UUID version 7 over UUID version 1 and
6 if possible.
UUID version 7 features a time-ordered value field derived from the
widely implemented and well known Unix Epoch timestamp source, the
number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap
seconds excluded. As well as improved entropy characteristics over
versions 1 or 6.

NetCDF file - why is file 1/3 size after fixing record dimension?

I am struggling to get to grips with this.
I create a netcdf4 file with the following dimensions and variables (note in particular the unlimited point dimension):
dimensions:
point = UNLIMITED ; // (275935 currently)
realization = 24 ;
variables:
short mod_hs(realization, point) ;
mod_hs:scale_factor = 0.01 ;
short mod_ws(realization, point) ;
mod_ws:scale_factor = 0.01 ;
short obs_hs(point) ;
obs_hs:scale_factor = 0.01 ;
short obs_ws(point) ;
obs_ws:scale_factor = 0.01 ;
short fchr(point) ;
float obs_lat(point) ;
float obs_lon(point) ;
double obs_datetime(point) ;
}
I have a Python program that populated this file with data in a loop (hence the unlimited record dimension - I don't know apriori how big the file will be).
After populating the file, it is 103MB in size.
My issue is that reading data from this file is quite slow. I guessed that this is something to do with chunking and the unlmited point dimension?
I ran ncks --fix_rec_dmn on the file and (after a lot of churning) it produced a new netCDF file that is only 32MB in size (which is about the right size for the data it contains).
This is a massive difference in size - why is the original file so bloated? Also - accessing the data in this file is orders of magnitude quicker. For example, in Python, to read in the contents of the hs variable takes 2 seconds on the original file and 40 milliseconds on the fixed record dimension file.
The problem I have is that some of my files contain a lot of points and seem to be too big to run ncks on (my machine runs out of memoery and I have 8GB), so I can't convert all the data to fixed record dimension.
Can anyone explain why the file sizes are so different and how I can make the original files smaller and more efficient to read?
By the way - I am not using zlib compression (I have opted for scaling floating point values to an integer short).
Chris
EDIT
My Python code is essentially building up one single timeseries file of collocated model and observation data from multiple individual model forecast files over 3 months. My forecast model runs 4 times a day, and I am aggregateing 3 months of data, so that is ~120 files.
The program extracts a subset of the forecast period from each file (e.t. T+24h -> T+48h), so it is not a simple matter of concatenating the files.
This is a rough approxiamtion of what my code is doing (it actually reads/writes more variables, but I am just showing 2 here for clarity):
# Create output file:
dout = nc.Dataset(fn, mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = ncd.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False)
v.scale_factor = 0.01
# Cycle over dates
date = <some start start>
end_dat = <some end date>
# Keeo track if record dimension ('point') size:
n = 0
while date < end_date:
din = nc.Dataset("<path to input file>", mode='r')
fchr = din.variables['fchr'][:]
# get mask for specific forecast hour range
m = np.logical_and(fchr >= 24, fchr < 48)
sz = np.count_nonzero(m)
if sz == 0:
continue
dout.variables['mod_hs'][n:n+sz,:] = din.variables['mod_hs'][:][m,:]
dout.variables['mod_ws'][n:n+sz,:] = din.variables['mod_wspd'][:][m,:]
# Increment record dimension count:
n += sz
din.close()
# Goto next file
date += dt.timedelta(hours=6)
dout.close()
Interestingly, if I make the output file format NETCDF3_CLASSIC rather that NETCDF4 the output size the size that I would expect. NETCDF4 output seesm to be bloated.
My experience has been that the default chunksize for record dimensions depends on the version of the netCDF library underneath. For 4.3.3.1, it is 524288. 275935 records is about half a record-chunk. ncks automatically chooses (without telling you) more sensible chunksizes than netCDF defaults, so the output is better optimized. I think this is what is happening. See http://nco.sf.net/nco.html#cnk
Please try to provide a code that works without modification if possible, I had to edit to get it working, but it wasn't too difficult.
import netCDF4 as nc
import numpy as np
dout = nc.Dataset('testdset.nc4', mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = dout.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False,chunksizes=[1000,24])
v.scale_factor = 0.01
date = 1
end_date = 5000
n = 0
while date < end_date:
sz=100
dout.variables['mod_hs'][n:n+sz,:] = np.ones((sz,24))
dout.variables['mod_ws'][n:n+sz,:] = np.ones((sz,24))
n += sz
date += 1
dout.close()
The main difference is in createVariable command. For file size, without providing "chunksizes" in creating variable, I also got twice as large file compared to when I added it. So for file size it should do the trick.
For reading variables from file, I did not notice any difference actually, maybe I should add more variables?
Anyway, it should be clear how to add chunk size now, You probably need to test a bit to get good conf for Your problem. Feel free to ask more if it still does not work for You, and if You want to understand more about chunking, read the hdf5 docs
I think your problem is that the default chunk size for unlimited dimensions is 1, which creates a huge number of internal HDF5 structures. By setting the chunksize explicitly (obviously ok for unlimited dimensions), the second example does much better in space and time.
Unlimited dimensions require chunking in HDF5/netCDF4, so if you want unlimited dimensions you have to think about chunking performance, as you have discovered.
More here:
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html

Categories