How do I write scikit-learn dataset to csv file - python

I can load a data set from scikit-learn using
from sklearn import datasets
data = datasets.load_boston()
print(data)
What I'd like to do is write this data set to a flat file (.csv)
Using the open() function,
f = open('boston.txt', 'w')
f.write(str(data))
works, but includes the description of the data set.
I'm wondering if there is some way that I can generate a simple .csv with headers from this Bunch object so I can move it around and use it elsewhere.

data = datasets.load_boston() will generate a dictionary. In order to write the data to a .csv file you need the actual data data['data'] and the columns data['feature_names']. You can use these in order to generate a pandas dataframe and then use to_csv() in order to write the data to a file:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df.to_csv('boston.txt', sep = ',', index = False)
and the output boston.txt should be:
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
...

There are various toy datasets in scikit-learn such as Iris and Boston datasets. Let's load Boston dataset:
from sklearn import datasets
boston = datasets.load_boston()
What type of object is this? If we examine its type, we see that this is a scikit-learn Bunch object.
print(type(boston))
Output:
<class 'sklearn.utils.Bunch'>
A scikit-learn Bunch object is a kind of dictionary. So, we should treat it as such. We can use dictionary methods. Let's look at the keys:
print(boston.keys())
output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Here we are interested in data, feature_names and target keys. We will import pandas module and use these keys to create a pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
We should also add the target variable to the DataFrame. Target variable is what we try to predict. We should learn the target variable's name. It is written in the "DESCR". We can
print(boston["DESCR"]) and read the full description of the dataset.
In the description we see that the name of the target variable is MEDV. Now, we can add the target variable to the DataFrame:
df['MEDV'] = boston['target']
There is only one step left. We are exporting the DataFrame to a csv file without index numbers:
df.to_csv("scikit_learn_boston_dataset.csv", index=False)
BONUS: Iris dataset has additional parameters that we can utilize (look at here). Following code automatically creates the DataFrame with the target variable included:
iris = datasets.load_iris(as_frame=True)
df = iris["frame"]
Note: If we print(iris.keys()), we can see the 'frame' key:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
BONUS2: If we print(boston["filename"]) or print(iris["filename"]), we can see the physical locations of the csv files of these datasets. For instance:
C:\Users\user\anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

Just wanted to modify the reply by adding that you should probably include the target variable--"MV"--as well. Added an additional line below:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
**df['MV'] = data['target']**
df.to_csv('boston.txt', sep = ',', index = False)

Related

UCI dataset: How to extract features and change the data into usable format after reading the data on python

I am looking to apply some ml algorithms on the data set from https://archive.ics.uci.edu/ml/datasets/University.
I noticed that the data is unstructured. Indeed, I want the data to have the features as the columns and have observations as the rows. Therefore, I need help with parsing this dataset.
Any help will be appreciated. Thanks.
Below is what I have tried:
column_names = ["University-name"
,"State"
,"location"
,"Control"
,"number-of-students"
,"male:female (ratio)"
,"student:faculty (ratio)",
"sat-verbal"
,"sat-math"
,"expenses"
,"percent-financial-aid"
,"number-of-applicants"
,"percent-admittance"
,"percent-enrolled"
,"academics"
,"social"
,"quality-of-life"
,"academic-emphasis"]
data_list =[]
data = ['https://archive.ics.uci.edu/ml/machine-learning-
databases/university/university.data','https://archive.ics.uci.edu/ml/machine-
learning-databases/university/university.data',...]'
for file in in data:
df = pd.read_csv(file, names = column_names)
data_list.append(df)
The data is not structured in a way you can parse it using pandas. Do something like this:
import requests
data = "https://archive.ics.uci.edu/ml/machine-learning-databases/university/university.data"
data = requests.get(data)
temp = data.text
import re
fdic = {'def-instance':[], 'state':[]}
for col in fdic.keys():
fdic[col].extend(re.findall(f'\({col} ([^\\\n)]*)' , temp))
import pandas as pd
pd.DataFrame(fdic)
The output:

How to optimize python script to pyspark def function

I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv
IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).
create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

Numpy: load multiple CSV files as dictionaries

I wanted to use the numpy loadtxt method to read .csv files for my experiment. I have three different time-series data of the following format with different characteristics where the first column is timestamp and the second column is the value.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
For reproducibility, I have shared the three time-series data I am using here.
If I do it like the following
import numpy as np
fname="data1.csv"
col_time,col_window = np.loadtxt(fname,delimiter=',').T
It works fine as intended. However instead of reading only a single file, I want to pass a dictionary to col_time,col_window = np.loadtxt(types,delimiter=',').T as the following
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
so that I can read multiple csv files and do plot all the results at ones using a one for loop as in the following.
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(quotient_times, quotient, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()
But it is giving me an error ValueError: could not convert string to float: b'data1'. How can I load multiple csv files as a dictionary?
Assuming that you want to build a protocols dict that will be useable in your code, you can easily build it with a simple loop:
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
protocols = {}
for name, file in types.items():
col_time, col_window = np.loadtxt(file, delimiter=',').T
protocols[name] = {'col_time': col_time, 'col_window': col_window}
You can then successfully plot the 3 graphs:
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(col_time, col_window, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()
Loading data from multiple CSV files is not supported in pandas and numpy. You can use concat function of pandas DataFrame and load all the files. The example bellow demonstrates using pandas. Replace StringIO with file object.
data="""
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
"""
data2="""
0.086206438,29
0.086425551,32
0.089227066,50
0.089262508,54
"""
data3="""
0.086206438,69
0.086425551,72
0.089227066,70
0.089262508,74
"""
import pandas as pd
from io import StringIO
files={"data1":data,"data2":data2,"data3":data3}
# Load the first file into data frame
key=list(files.keys())[0]
df=pd.read_csv(StringIO(files.get(key)),header=None,usecols=[0,1],names=['data1','data2'])
print(df.head())
# remove file from dictionary
files.pop(key,None)
print("final values")
# Efficient :Concat this dataframe with remaining files
df=pd.concat([pd.read_csv(StringIO(files[i]),header=None,usecols=[0,1],names=['data1','data2']) for i in files.keys()],
ignore_index=True)
print(df.tail())
For more insight: pandas append vs concat

How to add LabelBinarizer columns to DataFrame

I have recently started working with LabelBinarizer by running the following code. (here are the first couple of rows of the CSV file that I'm using):
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
#import matplotlib.pyplot as plot
#--------------------------------
label_conv = LabelBinarizer()
appstore_original = pd.read_csv("AppleStore.csv")
#--------------------------------
lb_conv = label_conv.fit_transform(appstore["cont_rating"])
column_names = label_conv.classes_
print(column_names)
print(lb_conv)
I get the lb_conv and the column names. Therefore:
how could I attach label_conv to appstore_original using column_names as the column names?
If anyone could help that would be great.
try this:
lb = LabelBinarizer()
df = pd.read_csv("AppleStore.csv")
df = df.join(pd.DataFrame(lb.fit_transform(df["cont_rating"]),
columns=lb.classes_,
index=df.index))
to make sure that a newly created DF will have the same index elements as the original DF (we need it for joining), we will specify index=df.index in the constructor call.

Categories