I have two flavors of DICOM data, the first works with my existing code (what I built it on), but I can not import the second.
First style has a bottom level folder with all the slices from one scan in that folder (in ordered ".dcm" files). I simply point vtk to the directory using this code:
vtkSmartPointer<vtkDICOMImageReader> reader = vtkSmartPointer<vtkDICOMImageReader>::New();
reader->SetDirectoryName(dicomDirectory.c_str());
reader->Update();
vtkSmartPointer<vtkImageData> sliceData = reader->GetOutput();
double tempIntensity = sliceData->GetScalarComponentAsDouble(x, y, z, 0);
This is not the direct source (I check dimensions and set up iterating through and such). But in short, it works... I have pulled in several different dicom volumes through this method. (and have inspected and manipulated the resulting volume clouds)
This depends on VTK interpreting the directory though. It is stated here: that there are some particulars about what VTK is capable of managing (under detailed description, warning) in terms of DICOM data. (I am not sure my current data violates this spec)
The second style of DICOM has a directory structure where the bottom level of folders is named as A-Z and each one contains 25 files (with no suffix) which are named (in each folder) Z01-Z25.
I can open the files individually using:
reader->SetFileName(tempFile.c_str());
Instead of specifying the directory. If I read all of the 25 in one of the bottom folders, it is a mix of different ordered chunks from different scans. I was prepared to set up a function to skim all folders and files in the directory to find and assemble all slices associated with one scan, but so far I have been unable to find/appropriately implement a function within vtkDICOMImageReader to:
A: detect which unique series set of slices I am in (series label)
nor
B: detect my current slice number in series as well as the total series count (slice count/ series slice total)
I can post more source as necesary, but basically have tried monitoring all parameters in "reader" and "sliceData" while loading slices from different series and as of yet have not gotten anything to provide me with the above data. I am assuming that either I am not appropriately updating in between slice loads and or am not looking at the right object parameters.
Any information on what I am doing wrong in terms of code or even my poor understanding of DICOM structure would be greatly appreciated!
ps: I am working in c++, but am fairly certain the usage is similar in Python
Unfortunately Dicom is horribly complex and things get implemented slightly differently depending on what company's scanner the data is from and how old the system is. In your first example it sounds like you have a simply formatted directory with individual slice files and no extra scout images so VTK is able to read in and render the slices and it looks fine.
Your second example sound like there's a more complex structure that may contain multiple series and possibly things like scout images or even non-image type Dicom files. For dealing with this type of data you'll need some logic to read the meta-data and figure out which files you're interested in and how to assemble them. The meta-data for the entire set is contained in a single file named "dicomdir" which should be in the top level directory. This data is redundant with data in .dcm file headers but reading from this file saves you the trouble of scanning the header from every file individually.
VTK is an image manipulation/display library not a Dicom system. I'm not sure they have good support for complex directory structures. You could try reader->SetFileName('dicomdir'); and see if they have logic to automatically handle this but I'd be a bit surprised if that worked.
If you're going to be working with complex multi-series data like this you'll probably need to use another library to extract the info you want. I highly recommend DCMTK. It's a great open-source C++ library for working with Dicom, just don't expect it to be super simple.
You should not assume the content of a DICOM file by its name or position in the directory structure.
Instead, the root of the DICOM folder should contain a DICOMDIR file, which contains the list of files and their relationship (e.g., patient objects contain studies objects which contain series and then images).
I don't know if VTK offers a way of reading and interpreting DICOMDIR files; if not then you could try to interpret them with dcmtk or Imebra.
Disclosure: I'm the author of Imebra
As I have posted in the comments, in this case, the vtk-dicom is more suitable for your needs. Here are some useful links:
A tutorial on how to use vtk-dicom API to scan directories, organize your data according to series and studies: https://dgobbi.github.io/vtk-dicom/doc/api/directory.html
The vtk-dicom documentation, with installation instructions on the Appendix: http://dgobbi.github.io/vtk-dicom/doc/vtk-dicom.pdf
Related
Is there a function similar to ncdisp in MATLAB to view .npy files?
Alternatively, it would be helpful to have some command that would spit out header titles in a .npy file. I can see everything in the file, but it is absolutely enormous. There has to be a way to view what categories of data are in this file.
Looking at the code for np.lib.npyio.load we see that it calls np.lib.format.read_array. That in turn calls as np.lib.format._read_array_header.
This can studied and perhaps even used, but it isn't in the public API.
But if you are such a MATLAB fan as you claim, you already know (?) that you can explore .m files to see the MATLAB code. Same with python/numpy. Read the files/functions until you hit compiled 'builtin' functions.
Since a npy contains only one array,the header isn't that interesting by itself - just the array dtype and shape (and total size). This isn't like the matlab save file with lots of variables. scipy.io.loadmat can read those.
But looking up ncdisp I see that's part of its NetCDF reader. That's a whole different kind of file.
I'm looking at some machine learning/forecasting code using Keras, and the input data sets are stored in npz files instead of the usual csv format.
Why would the authors go with this format instead of csv? What advantages does it have?
It depends of the expected usage. If a file is expected to have broad use cases including direct access from an ordinary client machines, then csv is fine because it can be directly loaded in Excel or LibreOffice calc which are widely deployed. But it is just an good old text file with no indexes nor any additional feature.
On the other hand is a file is only expected to be used by data scientists or generally speaking numpy aware users, then npz is a much better choice because of the additional features (compression, lazy loading, etc.)
Long story made short, you exchange a larger audience for higher features.
From https://kite.com/python/docs/numpy.lib.npyio.NpzFile
A dictionary-like object with lazy-loading of files in the zipped archive provided on construction.
So, it is a zipped archive (smaller size than CSV on the disk, more than one file can be stored) and files can be loaded from disk only when needed (in CSV, when you only need 1 column, you still have to read whole file to parse it).
=> advantages are: performance and more features
I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.
I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
(Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?
HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.
Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.
In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)