How to save Numpy array to root file using uproot4 - python

I'm trying to save a large amount of Numpy arrays to a root file using uproot. I've read through the uproot documentation, and as far as I can tell this ability was remove in uproot3. Note I can't save this as a histogram because the order of the data is important (it's a waveform). Does anyone know of a work around, or thing I'm missing to do this?
e.g: I want something like this
'''
import numpy as np
import uproot as up
array = np.random.normal(0,1,1000)
file = up.recreate('./test.root')
file['Tree1'] = array
'''

Related

Accessing each value in a big h5ad file

I am trying to access h5ad file which was downloaded from Human Cell Atlas.
Some packages have loaded to assist the file reading.
The first attempt is reading the whole file into memory.
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
adata = ad.read('local.h5ad', backed='r')
The content has 483152 x 58559.
Its's obs are mostly cell-related content and the var are genes.
So, I am trying to get each cell x Ensembl_id(gene) content from the adata.X.
Since R can use this approach to get 2 columns data.
ad$X[,c("var1", "var2")]
I assume python can use similar approach.
adata.X['ENSG00000223972.5','macrophage']
or
adata.X['macrophage', 'ENSG00000223972.5']
But, both attempt got nothing.

Combine two FITS files by taking a 'pizza wedge' slice?

I would like to combine two FITS files, by taking a slice out of one and inserting it into the other. The slice would be based on an angle measured from the centre pixel, see example image below:
Can this be done using Astropy? There are many questions on combining FITS on the site, but most of these are related to simply adding two files together, rather that combining segments like this.
Here is one recommended approach:
1.Read in your two files
Assuming the data is in an ImageHDU data array..
from astropy.io import fits
# read the numpy arrays out of the files
# assuming they contain ImageHDUs
data1 = fits.getdata('file1.fits')
data2 = fits.getdata('file2.fits')
2. Cut out the sections and put them into a new numpy array
Build up indices1 & indices2 for the desired sections in the new file... A simple numpy index to fill in the missing section into a new numpy array.
After being inspired by https://stackoverflow.com/a/18354475/15531842
The sector_mask function defined in the answer to get indices for each array using angular slices.
mask = sector_mask(data1.shape, centre=(53,38), radius=100, angle_range=(280,340))
indices1 = ~mask
indices2 = mask
Then these indices can be used to transfer the data into a new array.
import numpy as np
newdata = np.zeros_like(data1)
newdata[indices1] = data1[indices1]
newdata[indices2] = data2[indices2]
If the coordinate system is well known then it may be possible to use astropy's Cutout2D class, although I was not able to figure out how to fully use it. It wasn't clear if it could do an angular slice from the example given. See astropy example https://docs.astropy.org/en/stable/nddata/utils.html#d-cutout-using-an-angular-size
3a. Then write out the new array as a new file.
If special header information is not needed in the new file. Then the numpy array with the new image can be written out to a FITS file with one line of astropy code.
# this is an easy way to write a numpy array to FITS file
# no header information is carried over
fits.writeto('file_combined.fits', data=newdata)
3b. Carry the FITS header information over to the new file
If there is a desire to carry over header information then an ImageHDU can be built from the numpy array and include the desired header as a dictionary.
img_hdu = fits.ImageHDU(data=newdata, header=my_header_dict)
hdu_list = fits.HDUList()
hdu_list.append(fits.PrimaryHDU())
hdu_list.append(img_hdu)
hdu_list.writeto('file_combined.fits')

Text transform with sklearn TF-IDF vectorizer generates too big csv file

I have a 1000 texts each text has 200-1000 words. size of text csv file is about 10 MB. when I vectorize them with this code, the size of output CSV is exceptionally big (2.5 GB). I am not sure what I did wrong. Your help is highly appreciated. Code:
import numpy as np
import pandas as pd
from copy import deepcopy
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy import savetxt
df = pd.read_csv('data.csv')
#data has two columns: teks and groups
filtered_df = deepcopy(df)
vectorizer = TfidfVectorizer()
vectorizer.fit(filtered_df["teks"])
vector = vectorizer.transform(filtered_df["teks"])
print(vector.shape) # shape (1000, 83000)
savetxt('dataVectorized1.csv', vector.toarray(), delimiter=',')
Sparse matrices (like your vector here) are not supposed to be converted to dense ones (as you do with .toarray()) and saved as CSV files; doing that makes no sense, and invalidates the whole concept of sparse matrices itself. Given that, the big size is not a surprise.
You should seriously consider saving your sparse vector to an appropriate format, e.g. using scipy.sparse:
import scipy.sparse
scipy.sparse.save_npz('dataVectorized1.npz', vector)
See also Save / load scipy sparse csr_matrix in portable data format for possible other options.
If, for any reason, you must stick to a CSV file for storage, you could try compressing the output file by simply using the .gz extension in the file name; from the np.savetxt() documentation:
If the filename ends in .gz, the file is automatically saved in compressed gzip format. loadtxt understands gzipped files transparently.
So, this should do the job:
np.savetxt('dataVectorized1.csv.gz', vector.toarray(), delimiter=',')
However, I would not really recommend this; keep in mind that:
Beyond their convenience for tutorials and introductory exhibitions, CSV files do not really hold any "special" status as inputs to ML tasks, as you might seem to believe.
There is absolutely no reason why the much more efficient .npz file cannot be used as input for further downstream tasks, like classification, visualization, and clustering; on the contrary, its use is very much justified and recommended in similar situations.

How can i save a list/matrix in binary format in KDB?

I am trying to save a matrix to file in binary format in KDB as per below:
matrix: (til 10)*/:til 10;
save matrix;
However, I get the error 'type.
I guess save only works with tables? In which case does anyone know of a workaround?
Finally, I would like to read the matrix from the binary file into Python with NumPy, which I presume is just:
import numpy as np
matrix = np.fromfile('C:/q/w32/matrix', dtype='f')
Is that right?
Note: I'm aware of KDB-Python libraries, but have been unable to install them thus far.
save does work, you just have to reference it by name.
save`matrix
You can also save using
`:matrix set matrix;
`:matrix 1: matrix;
But I don't think you'll be able to read this into python directly using numpy as it is stored in kdb format. It could be read into python using one of the python-kdb interfaces (e.g PyQ) or by storing it in a common format such as csv.
Another option is to save in KDB+ IPC format and then read it into Python with qPython as a Pandas DataFrame.
On the KDB+ side you can save it with
matrix:(til 10)*/:til 10;
`:matrix.ipc 1: -8!matrix;
On the Python side you do
from pandas import DataFrame
from qpython.qreader import QReader
with open('matrix.ipc',"rb") as f:
matrix = DataFrame(QReader(f).read().data)
print(matrix)

Load mat file into pandas data frame

I have a problem loading specific Matlab file into pandas data frame (data). Basically, it is a multidimensional array consisting of 2x4 arrays and then either 3 or 4 columns. I searched through and found this Convert mat file to pandas dataframe, matlab data file to pandas DataFrame but neither works for me. Here is what I am using. Thit creates a pandas for one of the nests. I would like to have a column which will tell me where I am, ie first column tells if it is all_params[0] or all_params[1], second distinguishes the next level for each of those etc.
import numpy as np
import scipy.io as sio
all_params = sio.loadmat(path+'all_params')
all_params = all_params['all_params']
pd.DataFrame(all_params[0][0][:], columns=['par1', 'par2', 'par3'])
It must be simple, I'm just not able to figure it out. Or is there a way how to do it directly using scipy or another loading tool?
Thanks

Categories