How to import .dat file in Google Co-lab - python

I am implementing famous Iris classification problem in python for 1st time. I have a data file namely iris.data. I have to import this file in my python project. I try my hand in Google Colab.
Sample data
Attributes are:
1.sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I worte
import torch
import numpy as np
import matplotlib.pyplot as plt
FILE_PATH = "E:\iris dataset"
MAIN_FILE_NAME = "iris.dat"
data = np.loadtxt(FILE_PATH+MAIN_FILE_NAME, delimiter=",")
But it did not work and through errors.
But it worked when I wrote the code in Linux. But currently I am using windows 10 and it did not work.
Thank you for help in advance.

When constructing the file name for np.loadtxt, there is a \ missing, as FILE_PATH+MAIN_FILE_NAME = 'E:\iris_datasetiris.dat. To avoid having to add \manually between FILE_PATH and MAIN_FILE_NAME, you could use os.path.join, which does this for you.
import os
import numpy as np
FILE_PATH = 'E:\iris dataset'
MAIN_FILE_NAME = 'iris.dat'
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',') # not actually working due to last column of file
On the other hand, I am not sure why it did work with Linux, because numpy is not able to convert the string "Iris-setosa" into a number, which np.loadtxt tries to do. If you are only interested in the numeric values, you could use the usecols keyword of np.loadtxt
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',', usecols=(0, 1, 2, 3))

Related

Data loading in Python

I am trying to read in data using numpy or pandas which has been previously aperture by some camera software. The nominal matrix is 160 x 160 pixels with the aperture switched off but when the auto aperture is on I am losing most of my data during import. I am assuming because I am not handling the whitespace or NaN's appropriately. Does anyone have any suggestions on how to handle the data with the auto aperture ON?
from tkinter import filedialog
import numpy as np
import pandas as pd
file_path = filedialog.askopenfilename()
try:
my_data2 = genfromtxt(file_path, delimiter=',')
except:
my_data2 = genfromtxt(file_path, delimiter=' ')

Read a HDF data to a 3d array and save as a dataframe in python

I am currently working on the NASA aerosol optical depth data (MCD19A2), which is a NASA satellite level three product. I have uploaded the data. I want to save the data as a dataframe including all the information of longitude and latitude, and values. I have successfully converted the 0.47um band file into a three-dimensional array. I want to ask how to convert this array into a correct dataframe includes X, Y and the value.
Below are the codes I have tried:
from osgeo import gdal
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
rds = gdal.Open("MCD19A2.A2006001.h26v04.006.2018036214627.hdf")
names=rds.GetSubDatasets()
names[0][0]
*'HDF4_EOS:EOS_GRID:"MCD19A2.A2006001.h26v04.006.2018036214627.hdf":grid1km:Optical_Depth_047'*
aod_047 = gdal.Open(names[0][0])
a47=aod_047.ReadAsArray()
a47[1].shape
(1200,1200)
I would like the result to be like
X (n=1200)
Y (n=1200)
AOD_047
8896067
5559289
0.0123
I know that in R this can be done by
require('gdalUtils')
require('raster')
require('rgdal')
file.name<-"MCD19A2.A2006001.h26v04.006.2018036214627.hdf"
sds <- get_subdatasets(file.name)
gdal_translate(sds[1], dst_dataset = paste0('tmp047', basename(file.name), '.tiff'), b = nband)
r.047 <- raster(paste0('tmp047', basename(file.name), '.tiff'))
df.047 <- raster::as.data.frame(r.047, xy = T)
names(df.047)[3] <- 'AOD_047'
But, R really relies on memory and saving to 'tif' and reading 'tif' is using a lot of memory. So I want to do this task in python. Thanks a lot for your help.
You can use pandas:
import pandas as pd
df=pd.read_hdf('filename.hdf')

How does numpy.memmap work on HDF5 with multiple datasets?

I'm trying to memory-map individual datasets in an HDF5 file:
import h5py
import numpy as np
import numpy.random as rdm
n = int(1E+8)
rdm.seed(70)
dset01 = rdm.rand(n)
dset02 = rdm.normal(0, 1, size=n).astype(np.float32)
with h5py.File('foo.h5', mode='w') as f0:
f0.create_dataset('dset01', data=dset01)
f0.create_dataset('dset02', data=dset02)
fp = np.memmap('foo.h5', mode='r', dtype='double')
print(dset01[:3])
print(fp[:3])
del fp
However, the outputs below indicate that values in fp don't match those in dset01.
[0.92748054 0.87242629 0.58463127]
[5.29239776e-260 1.11688278e-308 5.18067355e-318]
I am guessing, maybe I should have set an 'offset' value when I did np.memmap. Is that the mistake in my code? If so, how do I find out the correct offset value of each dataset in an HDF5?

Writing axes to FITS file

Is there a way to write existing xy axes to a FITS file along with the data itself in Python?
For example here is some simple code saving a matrix to a FITS file named TestFITS:
import numpy as np
from astropy.io import fits
test_matrix = np.random.uniform(0,1,[5,3])
x = np.arange(5,5+len(test_matrix[:,0]))
y = np.arange(5,5+len(test_matrix[0,:]))
hdu = fits.PrimaryHDU(test_matrix)
hdu.writeto('TestFITS')
But if I wished to save x and y to the file as well could that be done?
You could save them as one-dimensional ImageHDUs in two extensions, next to the PrimaryHDU:
import numpy as np
from astropy.io import fits
test_matrix = np.random.uniform(0,1,[5,3])
x = np.arange(5,5+len(test_matrix[:,0]))
y = np.arange(5,5+len(test_matrix[0,:]))
fits.HDUList([
fits.PrimaryHDU(test_matrix),
fits.ImageHDU(x, name='X'),
fits.ImageHDU(y, name='Y'),
]).writeto('testxy.fits')
(The name parameter is not necessary, but can be a nice convenience.)

import the dataset in Python with sci-kit learn for machine learning problems_dataset Winscosin breast cancer

Hello I try to import a dataset to spyder
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('breast-cancer-wisconsin.data1.csv')
X = dataset.iloc[:,0:9].values
y= dataset.iloc[:,9].values
but when i display the X matrix in the variable explorer it says that object arrays are currently not supported
Try this:
X = dataset.drop('column_9', 1).values
y = dataset['column_9'].values
Just replace column_9 with whatever the target column's name is.
Actually in Spyder we can't see the object array. We can only see the data-frame data, but the Spyder team promised that they will provide the object array feature in Spyder 4 (to be released later in 2019).
You can even load data from sklearn module this way:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

Categories