Concatenate or merge .mdf files with different channels - python

When attempting to merge mdf files, I receive some version of the following error:
asammdf.blocks.utils.MdfException: internal structure of file 2 is
different; different channels
I understand the reason for this error. My CAN device is only logging signals whenever it detects a change in their value. Each mdf file in my configuration represents about ten minutes of data collection, so if any particular signal remains unchanged during this period of time, its channel will not be present in the mdf file.
If I were to filter out the channels that are periodically missing, I would effectively be filtering out most of my useful data.
Is there a way to fill in the missing channels with blank data on the mdfs which are missing them? I'm looking for a way to take all the unique channels from all mdf files and place them in the merged final output. Similar to the stack functionality, but only for adding channel names, no data, and without duplicates.
For further context, I'm a beginner with these traces. I've done most of my logging through MongoDB and saved the CAN traces as backup, and have up until now never needed to actually use them.
Any ideas? Thanks in advance!

Related

Python - Change and update the same header files from two different projects

I am performing data analysis. I want to segment the steps of data analysis into different projects, as the analysis will be performed in the same order, but not usually all at the same time. There is so much code and data cleaning that keeping all of this in the same project may get confusing.
However, I have been keeping any header files for tracking columns of information in the data consistent. It is possible I will change the sequence of these headers at some point and want to run all sequences of code. I also want to make sure that the header used remains the same so I don't erroneously analyze one piece of data instead of another. I use headers so that if a column order changes at any time, I am accessing the index of the data based on the header that matches the data changes rather than changing every instance of appearance of a particular column number throughout my code.
To accomplish this, I would like to file track multiple projects that access the SAME header files, and update and alter the header files without having to access the header files from each project individually.
Finally, I don't want to just store it somewhere on my computer and not track it, because I work from two different work stations.
Any good solutions or best practice for what I want to do? Have I made an error somewhere in project set-up? I am mostly self-taught and so have developed my own project organization and sequence of data analysis based on my own ideas and research as I go, so if I've done some terribly bad practice that would be great to know.
I've found a possible solution that uses independent branches in the same repo for tracking two separate projects, but I'm not convinced this is the best solution either.
Thanks!

PyArrow SIGSEGV error when using UnionDatasets

The context:
I am using PyArrow to read a folder structured as exchange/symbol/date.parquet. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.
If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.
So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.
The problem:
When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.
But when I read the UnionDataset filtered data, I always get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.
I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.
Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.

Is there a way to compare CSVs in Python while ignoring row ordering?

I'm setting up an automatic job that needs to parse csv files from an ftp site, each file contains several 10 thousand lines. I want to pre-process the directory to eliminate duplicate files before parsing the remaining files. The problem is that duplicate files are being pushed to the ftp but with different row ordering (i.e. same data, different order). That results in "duplicate files" having different hashes and byte-by-byte comparisons. with minimal processing.
I want to keep file manipulation to the minimum so I tried sorting the CSVs using the csvsort module but this is giving me an index error: IndexError: list index out of range. Here is the relevant code:
from csvsort import csvsort
csvsort(input_filename=file_path,columns=[1,2])
I tried finding and eliminating empty rows but this didn't seem to be the problem and, like I said, I want to keep file manipulation to the minimum to retain file integrity. Moreover, I have no control over the creation of files or pushing of the files to the ftp
I can think of a number of ways to work around this issue but they would all involve opening the CSVs and reading the contents, manipulating it, etc. Is there anyway I can do a lightweight file comparison that ignores row ordering or will I have to go for heavier processing?
You do not specify how much data you have. My take on this would differ depending on size. Are we talking 100's of lines ? Or multi-million lines ?
If you have "few" lines you can easily sort the lines. But if the data gets longer, you can use other strategies.
I have previously solved the problem of "weeding out lines from file A that appear in file B" using AWK, since AWK is able to do this with just 1 run through the long file (A), making the process extremely fast. However you may need to call an external program. Not sure if this is ideal for you.
If your lines are not completely identical - let's say you need to compare just one of several fields - AWK can do that as well. Just extract fields into variables.
If you choose to go this way, the script is something along:
FNR==NR{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
Use with
c:\path\to\awk.exe -f awkscript.awk SMALL_LIST FULL_LIST > DIFF_LIST
DIFF_LIST is items from FULL that are NOT in SMALL.
So, it turns out that pandas has a built in hash function with an option to ignore the index. As the hash is calculated on each row, you need to run an additional sum function. In terms of code, it is about as lightweight as I could wish, in terms of runtime, it parses ~15 files in ~5 seconds (~30k rows, 17 columns in each file).
from pandas import read_csv
from pandas.util import hash_pandas_object
from collections import defaultdict
duplicate_check = defaultdict(list)
for f in files:
duplicate_check[hash_pandas_object(read_csv(f),index=False).sum()].append(f)

PDF Grabbing Code is terminated (Cache full trying to find a workaround)

So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

Categories