Accessing each value in a big h5ad file - python

I am trying to access h5ad file which was downloaded from Human Cell Atlas.
Some packages have loaded to assist the file reading.
The first attempt is reading the whole file into memory.
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
adata = ad.read('local.h5ad', backed='r')
The content has 483152 x 58559.
Its's obs are mostly cell-related content and the var are genes.
So, I am trying to get each cell x Ensembl_id(gene) content from the adata.X.
Since R can use this approach to get 2 columns data.
ad$X[,c("var1", "var2")]
I assume python can use similar approach.
adata.X['ENSG00000223972.5','macrophage']
or
adata.X['macrophage', 'ENSG00000223972.5']
But, both attempt got nothing.

Related

Difficulty combining csv files into a single file

My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)

Load mat file into pandas data frame

I have a problem loading specific Matlab file into pandas data frame (data). Basically, it is a multidimensional array consisting of 2x4 arrays and then either 3 or 4 columns. I searched through and found this Convert mat file to pandas dataframe, matlab data file to pandas DataFrame but neither works for me. Here is what I am using. Thit creates a pandas for one of the nests. I would like to have a column which will tell me where I am, ie first column tells if it is all_params[0] or all_params[1], second distinguishes the next level for each of those etc.
import numpy as np
import scipy.io as sio
all_params = sio.loadmat(path+'all_params')
all_params = all_params['all_params']
pd.DataFrame(all_params[0][0][:], columns=['par1', 'par2', 'par3'])
It must be simple, I'm just not able to figure it out. Or is there a way how to do it directly using scipy or another loading tool?
Thanks

Pyfits or astropy.io.fits add row to binary table in fits file

How can I add a single row to binary table inside large fits file using pyfits, astropy.io.fits or maybe some other python library?
This file is used as a log, so every second a single row will be added, eventually the size of the file will reach gigabytes so reading all the file and writing it back or keeping the copy of data in memory and writing it to the file every seconds is actually impossible. With pyfits or astropy.io.fits so far I could only read everything to memory add new row and then write it back.
Example. I create fits file like this:
import numpy, pyfits
data = numpy.array([1.0])
col = pyfits.Column(name='index', format='E', array=data)
cols = pyfits.ColDefs([col])
tbhdu = pyfits.BinTableHDU.from_columns(cols)
tbhdu.writeto('test.fits')
And I want to add some new value to the column 'index', i.e. add one more row to the binary table.
Solution This is a trivial task for cfitsio library (method fits_insert_row(...)), so I use python module which is based on it: https://github.com/esheldon/fitsio
And here is the solution using fitsio. To create new fits file the one can do:
import fitsio, numpy
from fitsio import FITS,FITSHDR
fits = FITS('test.fits','rw')
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 1
fits.write(data)
fits.close()
To append a row:
fits = FITS('test.fits','rw')
#you can actually use the same already opened fits file,
#to flush the changes you just need: fits.reopen()
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 2
fits[1].append(data)
fits.close()
Thank you for your help.

Extracting data from a large csv file:causes dtype warnings

I work for a company and I recently switched from using spreadsheet package to python. Since, I am very new to python there are alot of things that I have difficulty grasping.Using python, I am trying to extract data from a large csv file(37791 rows and 316 columns.) Here is a piece of code I wrote:
Solution 1
import numpy as np
import pandas as pd
df=pd.read_csv=('C:\\Users\\Maxwell\\Desktop\\Test.data.csv',skiprows=1)
data=df.loc[:,['Steps','Parameter']]
This command generates an error,i.e, it gives a DtypeWwarning:columns (0,1,2,3........81) have mixed types. Specify dtype option on import or set low memory= False
So, I found a workaround.
Solution 2
import pandas as pd
import numpy as np
df=pd.read_csv(('C:\\Users\\Maxwell\\Desktop\\Test.data.csv',skiprows=1,error_bad_lines=False, index_col=False, dtype='unicode')
data=df.loc[:,['Steps','Parameter']]
Two questions:
i)I was able to get around the error, but now the columns that I want(Steps & Parameter)have been converted to objects(probably due to the dtype='unicode' command). How can I convert Steps column into an integer type and parameter into a float.
ii) Some people say that dtype warning isn't really an error. But, I found out that when I use Solution 1 and read the csv file. The Steps column contains some floats.The original csv file doesn't have any floats in Steps column. It looks as if, some floats have been placed by python itself!! Why does this happen?
(I am not able to upload the original csv file, because my company doesn't allow it!)

open .dat file in python

I have a .dat file and I want to generate the content in python, thus I use the following code:
import numpy as np
bananayte=np.fromfile("U04_banana-ytest.dat",dtype=float)
print(bananayte)
However, my initial data should be like "1.0000000e+00", while the output is like "1.39804066e-76". What happened? and what should I do to get the correct value? Thanks!

Categories