Construct 1 dimensional array from several data files - python

I'm trying to code a small Python script to do some data analysis. I have several data files, each one being a single column of data, I know how to import each one in python using numpy.loadtxt giving me back a ndarray. But I can't figure out how to concatenate these ndarrays, numpy.concatenate or numpy.append always giving me back error messages even if I try to flatten them first.
Are you aware of a solution?
Ok, as you were asking for code and data details. Here is what my data file look like:
1.4533423
1.3709900
1.7832323
...
Just a column of float numbers, I have no problem import a single file using:
data = numpy.loadtxt("data_filename")
My code trying to concatenate the arrays looks like that now (after trying numpy.concatenate and numpy.append I'm now trying numpy.insert ) :
data = numpy.zeros(0) #creating an empty first array that will be incremented by each file after
for filename in sys.argv[1:]:
temp = numpy.loadtxt(filename)
numpy.insert(data, numpy.arange(len(temp), temp))
I'm passing the filenames when running my script with:
./my_script.py ALL_THE_DATAFILES
And the error message I get is:
TypeError: only length-1 arrays can be converted to Python scalars

numpy.concatenate will definitely be a valid choice - without sample data and sample code and corresponding error messages we cannot help further.
Alternatives would be numpy.r_, numpy.s_
EDIT
This code snippet:
import sys
import numpy as np
filenames = sys.argv[1:]
arrays = [np.loadtxt(filename) for filename in filenames]
final_array = np.concatenate(arrays, axis=0)

Related

Why do Pandas dataframe's data types change after exporting into a CSV file

I did export the following dataframe in Google Colab. Whichever method I used, when I import it later, my dataframe appears as pandas.core.series.Series, not as an array.
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/My Drive/output.csv'
with open(path, 'w', encoding = 'utf-8-sig') as f:
df_protein_final.to_csv(f)
After importing the dataframe looks like below
pandas.core.series.Series
Note: The first image and second image can be different order in terms of numbers (It can be look as a different dataset). Please don't get hung up on this. Don't worry. Those images are just an example.
Why does column, which is originally an array before exporting, converts to series after exporting?
The code below gives the same result. Can't export original structure.
from google.colab import files
df.to_csv('filename.csv')
files.download('filename.csv')
Edit: I am looking for a solution is there any way to keep original structure (e.g. array) while exporting.
Actually that is how pandas work. When you try to insert a list or an numpy array into a pandas dataframe, it converts that array to a series always. If you want to turn the series back to a list/array use Series.values, Series.array or Series.to_numpy() . refer this
EDIT :
I got an idea from your comments. You are asking to save dframe into a file while preserving its all properties. You are actually (intentionally or unintentionally) asking how to SERIALIZE the data frame. You have to use pickle for this. Refer this
Note : Pandas has inbuilt pickle support. So you can directly export dframe into pickle file like in this example
df.to_pickle(file_name)

Combine two FITS files by taking a 'pizza wedge' slice?

I would like to combine two FITS files, by taking a slice out of one and inserting it into the other. The slice would be based on an angle measured from the centre pixel, see example image below:
Can this be done using Astropy? There are many questions on combining FITS on the site, but most of these are related to simply adding two files together, rather that combining segments like this.
Here is one recommended approach:
1.Read in your two files
Assuming the data is in an ImageHDU data array..
from astropy.io import fits
# read the numpy arrays out of the files
# assuming they contain ImageHDUs
data1 = fits.getdata('file1.fits')
data2 = fits.getdata('file2.fits')
2. Cut out the sections and put them into a new numpy array
Build up indices1 & indices2 for the desired sections in the new file... A simple numpy index to fill in the missing section into a new numpy array.
After being inspired by https://stackoverflow.com/a/18354475/15531842
The sector_mask function defined in the answer to get indices for each array using angular slices.
mask = sector_mask(data1.shape, centre=(53,38), radius=100, angle_range=(280,340))
indices1 = ~mask
indices2 = mask
Then these indices can be used to transfer the data into a new array.
import numpy as np
newdata = np.zeros_like(data1)
newdata[indices1] = data1[indices1]
newdata[indices2] = data2[indices2]
If the coordinate system is well known then it may be possible to use astropy's Cutout2D class, although I was not able to figure out how to fully use it. It wasn't clear if it could do an angular slice from the example given. See astropy example https://docs.astropy.org/en/stable/nddata/utils.html#d-cutout-using-an-angular-size
3a. Then write out the new array as a new file.
If special header information is not needed in the new file. Then the numpy array with the new image can be written out to a FITS file with one line of astropy code.
# this is an easy way to write a numpy array to FITS file
# no header information is carried over
fits.writeto('file_combined.fits', data=newdata)
3b. Carry the FITS header information over to the new file
If there is a desire to carry over header information then an ImageHDU can be built from the numpy array and include the desired header as a dictionary.
img_hdu = fits.ImageHDU(data=newdata, header=my_header_dict)
hdu_list = fits.HDUList()
hdu_list.append(fits.PrimaryHDU())
hdu_list.append(img_hdu)
hdu_list.writeto('file_combined.fits')

How do I use the numpy.savetxt function to save this array to a specifc folder?

I am trying to save this array then move it to a folder but it will not allow me too because it is not the right type I do not think. I cannot seem to figure it out so any help would be appreciated. Here is the code I am trying to use. The location variable I have defined as the path to the folder I want to move it to after saving.
I did this and it worked when the array has only one dimension (if b = np.array([1,2,3,4,5]) but if I add another element in front such like I did in the code I posted, it won't work.
import numpy as np
import shutil
b = [5,np.array([1,2,2,3,6,7])]
np.savetxt('hi',b)
shutil.move('hi',location)
I got the following error message:
Mismatch between array dtype ('object') and format specifier ('%.18e')
Savetxt only saves array-like objects. So when b is the original array it will work fine. In the code above you set b to be a list containing the number 5 and an array.
I'm not quite sure where you want that five to end up, but you will need to insert it into the actual array. You might want to use the insert method to put the value at the beginning of the array such as
b = np.insert(b,0,5)
I might also add that you can save the file directly to the location that you would like to, which would eliminate the need for shutil. Just specify a path to the location you would like to save the file that is valid given where the program is run and you should be good to go.

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()

Memory runs out when creating a numpy array from a list of lists in python

I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky " character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.

Categories