documenting CSV data, using python - python

Not sure this is the right place to ask, but it has to be a common problem.
I'm collecting AC voltage(real, apparent, etc.) vs. an input parameter to measure the performance of a solar inverter. I read data from a meter, the invert, and other test equipment and then write the data to a CSV file for later plotting. The details here are not important except to say that part works.
There are various modes available with their own values that I want to record with the data. I.e., I might set the foo-parameter to 0.5 and then record the complex voltage output versus the DC input. I need to do this for several values of the foo-parameter, and many parameters should be recorded.
I've modified my plot routine to allow #comment lines. This will allow me to record the parameters that were selected for any data set.
Is there a standard way of doing this, or otherwise documenting data collected like this?

Related

MATLAB: Is it possible to extract signals within a dataset into individual variables?

When exporting Simulink Simulation data to .mat files, the data is stored as a Simulink.SimulationData.Dataset class which houses all the recorded signals (of class Simulink.SimulationData.Signal). Is it possible to extract all of the signal value data into new array variables with the same signal names?
For examples, DS (1x1 dataset) contains the two signals:
speed (1x1 signal)
command (1x1 signal)
Then Iā€™d like to programmatically create the following variables in my workspace from DS where each variable contains only their data values:
Speed (100x1 double)
Command (100x1 double)
My initial thought was to write a script to create new variables in a for loop. Something like the following:
NumDatasetElements=data.numElements
for a = 1:NumDatasetElements
data{a}.Name=data{a}.Values.data
end
This obviously doesn't work, but I think it shows what I'm trying to do. I need to create a variable with the name data{a}.Name and then set it to data{a}.Values.data.
The reason I'm trying to do this is because I've found that a .mat file populated with array variables easily imports into Python as a dictionary using the sio.loadmat function, whereas datasets do not. My end goal is to easily import Simulink Simulation data into Python to utilize matplotlib for data plotting.
Inside your loop you want
assignin('base',data{a}.Name,data{a}.Values.data);
However, there are potentially a few problems that you'll need to deal with. Specifically, what if the signal doesn't have a name, and what if the data isn't an array i.e. it may be a timeseries. (The above code will work, but not give you the data you need to easily read into python.) You'll need to add some code to handle both those cases.
There's also the issue of potentially creating lots and lots of variables in your workspace, depending on how much data you are logging.
You may also find that you can just change the format of the saved data to array, in which case none of the above would be required.

I would like to be able to compare values in one CSV with a nominal set of values in another

I have been given the task of injecting faults into a system and finding deviations from a norm. These deviations will serve as the failures of the system. So far we've had to detect these faults through observation, but I would like to develop a method for:
1.) Uploading each CSV which will include a fault of a certain magnitude.
2.) Comparing the CSV containing the fault with the nominal value.
3.) Being able to print out where the failure occured, or if no failure occured at all.
I was wondering which language would make the most sense for these three tasks. We've been given the system in Simulink, and have been able to output well-formatted CSV files containing information about the components which comprise the system. For each component, we have a nominal set of values and a set of values given after injecting a fault. I'd like to be able to compare these two and find where the fault has occurred. So far we've had very little luck in Python or in Matlab itself, and have been strongly considering using C to do this.
Any advice on which software will provide which advantages would be fantastic. Thank you.
If you want to store the outcomes in a database, it might be worth considering a tool like Microsoft SSIS (Sql Server Integration Services) where you could use your CSV files and sets of values as data sources, compare them / perform calculations and store outcomes / datasets in tables. SSIS has an easy enough learning curve and easy to use components as well as support for bespoke SQL / T-SQL and you can visually differentiate your components in separate processes. The package(s) can then be run either manually or in automated batches as desired.
https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017
Good luck!

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Standardizing GPX traces

I have two GPX files (from a race I ran twice, obtained via the Strava API) and I would like to be able to compare the effort across both. The sampling frequency is irregular however (i.e. data is not recorded every second, or every meter), so a straightforward comparison is not possible and I would need to standardize the data first. Preferably, I would resample the data so that I have data points for every 10 meters for example.
I'm using Pandas, so I'm currently standardizing a single file by inserting rows for every 10 meters and interpolating the heartrate, duration, lat/lng, etc from the surrounding data points. This works, but doesn't make the data comparable across files, as the recording does not start at the exact same location.
An alternative is first standardizing the course coordinates using something like geohashing and then trying to map both efforts to this standardized course. Since coordinates can not be easily sorted, I'm not sure how to do that correctly however.
Any pointers are appreciated, thanks!

HDF5 Links to Events in Dataset

I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.
I think what you want is a Region Reference ā€” essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally ā€” they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.

Categories