Export synthetic dataset created with sklearn to csv - python

I need to create a synthetic dataset, cause i have to fix a clustering algorithm for my university thesis, so i need it to test the algorithm with a little dataset.
I managed to create it with sklearn make_classification, but the program takes in input a csv file that contains the features of the dataset.
Does anyone know how can i manage to create a synthetic dataset directly in csv, or export the one created with sklearn into a csv file?

You can export a numpy array to a csv file using numpy.savetxt.
This example uses a BytesIO instance as output, you would use a file name instead.
In [1]: import io
In [2]: import numpy as np
In [3]: x = np.random.randn(5, 2)
In [4]: x
Out[4]:
array([[-0.13114465, -0.72491874],
[-0.08375738, -1.23769691],
[-0.5583027 , -0.24086865],
[ 0.04590227, -0.6582806 ],
[-0.21433652, -0.78924272]])
In [5]: buf = io.BytesIO()
In [6]: np.savetxt(buf, x, delimiter=',')
In [7]: print(buf.getvalue().decode())
-1.311446488105691699e-01,-7.249187409818331762e-01
-8.375738326459475358e-02,-1.237696910731503452e+00
-5.583026953882282983e-01,-2.408686450946319058e-01
4.590226685041418758e-02,-6.582805971999975414e-01
-2.143365241670896482e-01,-7.892427231682124233e-01

Related

Read a HDF data to a 3d array and save as a dataframe in python

I am currently working on the NASA aerosol optical depth data (MCD19A2), which is a NASA satellite level three product. I have uploaded the data. I want to save the data as a dataframe including all the information of longitude and latitude, and values. I have successfully converted the 0.47um band file into a three-dimensional array. I want to ask how to convert this array into a correct dataframe includes X, Y and the value.
Below are the codes I have tried:
from osgeo import gdal
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
rds = gdal.Open("MCD19A2.A2006001.h26v04.006.2018036214627.hdf")
names=rds.GetSubDatasets()
names[0][0]
*'HDF4_EOS:EOS_GRID:"MCD19A2.A2006001.h26v04.006.2018036214627.hdf":grid1km:Optical_Depth_047'*
aod_047 = gdal.Open(names[0][0])
a47=aod_047.ReadAsArray()
a47[1].shape
(1200,1200)
I would like the result to be like
X (n=1200)
Y (n=1200)
AOD_047
8896067
5559289
0.0123
I know that in R this can be done by
require('gdalUtils')
require('raster')
require('rgdal')
file.name<-"MCD19A2.A2006001.h26v04.006.2018036214627.hdf"
sds <- get_subdatasets(file.name)
gdal_translate(sds[1], dst_dataset = paste0('tmp047', basename(file.name), '.tiff'), b = nband)
r.047 <- raster(paste0('tmp047', basename(file.name), '.tiff'))
df.047 <- raster::as.data.frame(r.047, xy = T)
names(df.047)[3] <- 'AOD_047'
But, R really relies on memory and saving to 'tif' and reading 'tif' is using a lot of memory. So I want to do this task in python. Thanks a lot for your help.
You can use pandas:
import pandas as pd
df=pd.read_hdf('filename.hdf')

How to import .dat file in Google Co-lab

I am implementing famous Iris classification problem in python for 1st time. I have a data file namely iris.data. I have to import this file in my python project. I try my hand in Google Colab.
Sample data
Attributes are:
1.sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I worte
import torch
import numpy as np
import matplotlib.pyplot as plt
FILE_PATH = "E:\iris dataset"
MAIN_FILE_NAME = "iris.dat"
data = np.loadtxt(FILE_PATH+MAIN_FILE_NAME, delimiter=",")
But it did not work and through errors.
But it worked when I wrote the code in Linux. But currently I am using windows 10 and it did not work.
Thank you for help in advance.
When constructing the file name for np.loadtxt, there is a \ missing, as FILE_PATH+MAIN_FILE_NAME = 'E:\iris_datasetiris.dat. To avoid having to add \manually between FILE_PATH and MAIN_FILE_NAME, you could use os.path.join, which does this for you.
import os
import numpy as np
FILE_PATH = 'E:\iris dataset'
MAIN_FILE_NAME = 'iris.dat'
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',') # not actually working due to last column of file
On the other hand, I am not sure why it did work with Linux, because numpy is not able to convert the string "Iris-setosa" into a number, which np.loadtxt tries to do. If you are only interested in the numeric values, you could use the usecols keyword of np.loadtxt
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',', usecols=(0, 1, 2, 3))

Loading the Iris Dataset in from a CSV file?

I am practicing data processing with Scikit learn, and I'm looking at Classification Probability. I've successfully ran the model using the data set from import dataset now I want to try and do the same thing with a CSV file, so I've downloaded the same dataset, and am trying to load it into my code.
iris = np.loadtxt('./iris.csv', delimiter=',', skiprows=1)
X = iris.data[:, 0:2]
y = iris.target
However I get an error stating ValueError: could not convert string to float: 'setosa' I understand that this is from the CSV as it is the name of the flower, is there any other way to import this CSV file so that this issue isnt an issue?
For this you can use pandas:
data = pandas.read_csv("iris.csv")
data.head() # to see first 5 rows
X = data.drop(["target"], axis = 1)
Y = data["target"]
or you can try (I would personally recommend to use pandas)
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

Can't export a NumPy array to R dataset

I'm having some difficulties trying to convert a NumPy array to R data set. I have a 2D image of 2362 x 2163 on gray scale. I have to import the gray scale value of each pixel to do some statistical analysis. First, this is what I imported for the process:
import cv2
import numpy as np
from skimage import img_as_ubyte
from skimage import data
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
from pandas import DataFrame
pandas2ri.activate()
First I import the image using cv2 as a NumPy array and just in case I converted the array to values between 0 and 255 (256 gray levels).
xchest = cv2.imread("/home/user/xchest.tif", cv2.IMREAD_GRAYSCALE)
xchestsk = img_as_ubyte(xchest)
The output of:
type(xchestsk)
is:
<class 'numpy.ndarray'>
I visually checked the array an as expected is something like:
[[8, 8, 9, ... 200, 234, 245]...[250, 234, 134, ... 67, 8, 8]]
I need all that pixel information on a simple data set that I can use and analyze on RStudio. I tried with:
xchest_R = DataFrame(xchestsk)
xchest_R = r.data('xchestsk')
r.assign("test", xchest_R)
r("save(test, file='/home/user/xchest.gzip', compress=TRUE)")
But when I load it on R:
> load("/home/user/xchest.gzip")
I just get a value: test > "xchestsk"
Like if I just imported a string.
I tried with:
np.save("/home/user/xchest.npy", xchestsk)
But when I try to import it on R with:
> library(RcppCNPy)
> xchest_R <- npyLoad("/home/user/xchest.npy", "integer")
RStudio crashes and I have to restart the session.
Finally, I tried converting the NumPy array to a CSV file:
np.savetxt("/home/eera5607/xchest.csv", xchestsk, delimiter=",")
But when I import it to R:
> xchest_data = read.csv(file="/home/eera5607/xchest.csv", header=FALSE, sep=",")
I can't do simple statistical analysis like:
> mean(xchest_data)
Because I get this warning:
> argument is not numeric or logical: returning NA
I tried converting the data to one variable and 5000000+ points with:
xchest_list = xchestsk.tolist()
xchest_ov = []
for list in xchest_list:
xchest_ov += list
Then I converted the xchest_ov list to CSV but I get the same warning in RStudio.
What I need is to import all those values, if possible, keeping the matrix structure (but it is not necessary, at least import the pixel values as a regular R data set) to which I can apply some statistical analysis. I know I can do some analysis directly on Python but I would like this data on RStudio. I have very little knowledge in this topics. I'm a radiologist and this is the first time I'm working with R.

matlab data file to pandas DataFrame [duplicate]

This question already has answers here:
Read .mat files in Python
(15 answers)
Closed 6 years ago.
Is there a standard way to convert matlab .mat (matlab formated data) files to Panda DataFrame?
I am aware that a workaround is possible by using scipy.io but I am wondering whether there is a straightforward way to do it.
I found 2 way: scipy or mat4py.
mat4py
Load data from MAT-file
The function loadmat loads all variables stored in the MAT-file into a
simple Python data structure, using only Python’s dict and list
objects. Numeric and cell arrays are converted to row-ordered nested
lists. Arrays are squeezed to eliminate arrays with only one element.
The resulting data structure is composed of simple types that are
compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
data = loadmat('datafile.mat')
From:
https://pypi.python.org/pypi/mat4py/0.1.0
Scipy:
Example:
import numpy as np
from scipy.io import loadmat # this is the SciPy module that loads mat-files
import matplotlib.pyplot as plt
from datetime import datetime, date, time
import pandas as pd
mat = loadmat('measured_data.mat') # load mat-file
mdata = mat['measuredData'] # variable in mat file
mdtype = mdata.dtype # dtypes of structures are "unsized objects"
# * SciPy reads in structures as structured NumPy arrays of dtype object
# * The size of the array is the size of the structure array, not the number
# elements in any particular field. The shape defaults to 2-dimensional.
# * For convenience make a dictionary of the data using the names from dtypes
# * Since the structure has only one element, but is 2-D, index it at [0, 0]
ndata = {n: mdata[n][0, 0] for n in mdtype.names}
# Reconstruct the columns of the data table from just the time series
# Use the number of intervals to test if a field is a column or metadata
columns = [n for n, v in ndata.iteritems() if v.size == ndata['numIntervals']]
# now make a data frame, setting the time stamps as the index
df = pd.DataFrame(np.concatenate([ndata[c] for c in columns], axis=1),
index=[datetime(*ts) for ts in ndata['timestamps']],
columns=columns)
From:
http://poquitopicante.blogspot.fr/2014/05/loading-matlab-mat-file-into-pandas.html
Finally you can use PyHogs but still use scipy:
Reading complex .mat files.
This notebook shows an example of reading a Matlab .mat file,
converting the data into a usable dictionary with loops, a simple plot
of the data.
http://pyhogs.github.io/reading-mat-files.html
Ways to do this:
As you mentioned scipy
import scipy.io as sio
test = sio.loadmat('test.mat')
Using the matlab engine:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat",nargout=1)

Categories