How to save a Tensorflow dataset to csv? - python

I find a lot of documents/forums telling how to convert a csv to a Tensorflow dataset, but not a single one saying how to convert a dataset to a csv. I have csv with two columns now (filename, weight - more columns maybe be added later). I read that into tensorflow and create a dataset. At the end of the script the 2nd column is modified and I need to save these columns to a csv. I need them in csv (not checkpoint) because I may need to do stuff with it on Matlab.
I tried to call the dataset map function and tried to save to csv inside map function. But it doesn't work as expected.
#reading csv to dataset
def map_func1(line):
FIELD_DEFAULTS = [[""], [0.0]]
sample,weight = tf.decode_csv(line, FIELD_DEFAULTS)
return sample,weight
ds = tf.data.TextLineDataset('sample_weights.csv')
ds_1 = ds.map(map_func1)
# then the dataset is modified to ds_2 then, not including code- it's just another map func
# trying to save to csv -
def map_func3(writer,x):
x0,x1 = x
writer.writerow([x0,x1])
return x
with open('sample_weights_mod.csv','w') as file:
writer = csv.writer(file)
ds_3 = ds_2.map(lambda *x: map_func3(writer,x))
This doesn't work as expected just writes the tensor shape to csv Tensor("arg0:0", shape=(), dtype=string) Tensor("arg1:0", shape=(), dtype=float32)
This solution is probably a bad one. I really need to get a neat way to do this

Though not a good way of doing for now I did it as below
type(movies) ## movies variable is of type tensorflow.python.data.ops.dataset_ops.MapDataset
z=[]
for example in movies:
z.append(example.numpy().decode("utf-8"))
mv={'movie_title':z}
pd.DataFrame(mv).to_csv('movie.csv')

Related

How can I save my results in the same file as different columns in case of a 'for-cylce'

def get_df():
df = pd.DataFrame()
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
av_a = np.average(a, axis=0)
np.savetxt('merged_average.csv', av_a, delimiter=',')
I've tried to save it but it always overwrites with the next file and deletes the previous results
At the moment, your code is a bit hard to read, as you are declaring variables which are not used (df) and using variables which are not declared (a). In the future, try to give a minimal reproducible example of your problematic code.
I'll still try to give you an interpreted answer:
If you want to store multiple columns from different files next to each other, the job becomes simpler by first acquiring all columns, and then afterwardds save them to the file in a single action.
Here is an interpretation of your code:
def get_df():
# create an empty list to collect all results
average_results = []
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
a = something(file) # unknown to me
average_results.append(np.average(a, axis=0))
# convert the results to a 2d numpy matrix,
# optionally transpose it to get the desired data orientation
data = np.array(average_results).transpose()
# save the full dataset
np.savetxt('merged_average.csv', data , delimiter=',')

Read txt-file with data and labels into tensorflow

I'm relativly new to tensorflow and therefore I'm struggling with the data preparation.
I have a folder with about 500 .txt files. Each of these files contain the data and a label of the data. (The data represents MFCCs, which are audio features that get generated for each "frame" of a .wav audio file.)
Each of these files look like this:
1
1.013302233064514191e+01
-1.913611804400369110e+01
1.067932213100989847e+00
1.308777013246182364e+01
-3.591032944037165109e+00
1.294307486784356698e+01
5.628056691023937574e+00
5.311223121033092909e+00
1.069261850699697014e+01
4.398722698218969995e+00
5.045254154360372389e+00
7.757820364628694954e+00
-2.666228281486863416e+00
9.236707894117541784e+00
-1.727334954006132151e+01
5.166050472560470119e+00
6.421742650353079007e+00
2.550240091606466031e+00
9.871269941885440602e+00
7.594591526898561984e-01
-2.877228968309437196e+00
5.592507658015017924e-01
8.828475996369435919e+00
2.946838169848354561e+00
8.420693074096489150e-01
7.032494888004835687e+00
...
In the first line of each file, I got the label of the data (in this case 1).
In the rest of the file, I got 13 numbers representing 13 MFCCs for each frame. Each frame MFCCs are separated with a newline.
So my question would be whats an easy way of getting the content of all these files into tensors so tensorflow can use them?
Thanks!
Not sure if this is the Optimized way of doing but this can be done as explained in the steps below:
Iterate through each Text File and append its data to a List
Replace '\n' in each element with ',' because our goal is to create CSV out of it
Write the Elements of the List whose elements are separated by Commas to a CSV File
Finally, convert CSV File to Tensorflow Dataset using tf.data.experimental.make_csv_dataset. Please find this Tutorial on how to convert CSV File to Tensorflow Dataset.
Code which performs First Three Steps mentioned above is given below:
import os
import pandas as pd
# The Folder where all the Text Files are present
Path_Of_Text_Files = '/home/mothukuru/Jupyter_Notebooks/Stack_Overflow/Text_Files'
List_of_Files = os.listdir(Path_Of_Text_Files)
List_Of_Elements = []
# Iterate through each Text File and append its data to a List
for EachFile in List_of_Files:
with open(os.path.join(Path_Of_Text_Files, EachFile), 'r') as FileObj:
List_Of_Elements.append(FileObj.readlines())
# Below code is to remove '\n' at the end of each Column
for i in range(len(List_Of_Elements)):
List_Of_Elements[i] = [sub.replace('\n', ',') for sub in List_Of_Elements[i]]
Column_Names = ['Label,', 'F1,', 'F2,', 'F3,', 'F4,', 'F5,', 'F6,', 'F7,',
'F8,', 'F9,', 'F10,', 'F11,', 'F12,', 'F13']
# Write the Data in the List, List_Of_Elements to a CSV File
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'w') as FileObj:
FileObj.writelines(Column_Names)
for EachElement in List_Of_Elements:
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'a') as FileObj:
FileObj.write('\n')
FileObj.writelines(EachElement)
Path_Of_Final_CSV = os.path.join(Path_Of_Text_Files, 'Final_Data.csv')
Data = pd.read_csv(Path_Of_Final_CSV, index_col = False)
To check if our Data is Fine, print(Data.head()) will output the below data:

How to write continuous outputs in a single txt file

I am working with multiple data files (File_1, File_2, .....). I want the desired outputs for each data file to be saved in the same txt file as row values of a new column.
I tried the following code for my first data file (File_1). The desired outputs (Av_Age_btwn_0_to_5, Av_Age_btwn_5_to_10) are stored as row values of a column in the output txt file (Result.txt). Now, I want these outputs to be stored as row values of a next column of the same txt file when I work with File_2. Then for File_3, in a similar manner, I want the outputs in the next column and so on.
import numpy as np
data=np.loadtxt('C:/Users/Hrihaan/Desktop/File_1.txt')
Age=data[:,0]
Age_btwn_0_to_5=Age[(Age<5) & (Age>0)]
Age_btwn_5_to_10=Age[(Age<10) & (Age>=5)]
Av_Age_btwn_0_to_5=np.mean(Age_btwn_0_to_5)
Av_Age_btwn_5_to_10=np.mean(Age_btwn_5_to_10)
np.savetxt('/Users/Hrihaan/Desktop/Result.txt', (Av_Age_btwn_0_to_5, Av_Age_btwn_5_to_10), delimiter=',')
Any help would be appreciated.
If I understand correctly, each of your files is a column, and you want to combine them into a matrix (one file per column).
Maybe something like this could work?
import numpy as np
# Simulate some dummy data
def simulate_data(n_files):
for i in range(n_files):
ages = np.random.randint(0,10,100)
np.savetxt("/tmp/File_{}.txt".format(i),ages,fmt='%i')
# Your file processing
def process(age):
age_btwn_0_to_5=age[(age<5) & (age>0)]
age_btwn_5_to_10=age[(age<10) & (age>=5)]
av_age_btwn_0_to_5=np.mean(age_btwn_0_to_5)
av_age_btwn_5_to_10=np.mean(age_btwn_5_to_10)
return (av_age_btwn_0_to_5, av_age_btwn_5_to_10)
n_files = 5
simulate_data(n_files)
results = []
for i in range(n_files):
# load data
data=np.loadtxt('/tmp/File_{}.txt'.format(i))
# Process your file and extract your information
data_processed = process(data)
# Store the result
results.append(data_processed)
results = np.asarray(results)
np.savetxt('/tmp/Result.txt',results.T,delimiter=',',fmt='%.3f')
In the end, you have something like that:
2.649,2.867,2.270,2.475,2.632
7.080,6.920,7.288,7.231,6.880
Is it what you're looking for?
import numpy as np
# some data
age = np.arange(10)
time = np.arange(10)
mean = np.arange(10)
output = np.array(list(zip(age,time,mean)))
np.savetxt('FooFile.txt', output, delimiter=',', fmt='%s')
# ^^^^^^^^ --> Use this keyword argument if you want to save it as int. For simplicity just don't use it.
output:
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6
7,7,7
8,8,8
9,9,9

reading csv file with tensorflow

In the examples. the variables for the columns in data set are manually given.
But my data set already has the names as headers. I want to use them. How to get the header names of a .csv file using tensor flow using python?
import tensorflow as tf
filename_queue=tf.train.string_input_producer(
['final_data1.csv'],num_epochs=1)
#to reada the csv file-----
print(5)
reader = tf.TextLineReader(skip_header_lines=1)
print(4)
_, csv_row = reader.read_up_to(filename_queue)
print(type(csv_row))
print(3)
with tf.Session() as sess:
print(reader.num_records_produced())
tf.global_variables_initializer()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord = coord)'
I can't understand exactly what you try to do -- you need to give more details.
If you want to work with csv-file, I think you'd better use DataFrame data structure from pandas library. There is the method 'read_csv' where you can choose the row you consider as header.
Try it:
from pandas import read_csv
df = read_csv('YOUR_DATAFILE.csv',header=0)
print(df)
With parameter 'header' you can select what row in your dataset contains headers. In this case it's the first row.

How to overwrite array inside h5 file using h5py

I'm trying to overwrite a numpy array that's a small part of a pretty complicated h5 file.
I'm extracting an array, changing some values, then want to re-insert the array into the h5 file.
I have no problem extracting the array that's nested.
f1 = h5py.File(file_name,'r')
X1 = f1['meas/frame1/data'].value
f1.close()
My attempted code looks something like this with no success:
f1 = h5py.File(file_name,'r+')
dset = f1.create_dataset('meas/frame1/data', data=X1)
f1.close()
As a sanity check, I executed this in Matlab using the following code, and it worked with no problems.
h5write(file1, '/meas/frame1/data', X1);
Does anyone have any suggestions on how to do this successfully?
You want to assign values, not create a dataset:
f1 = h5py.File(file_name, 'r+') # open the file
data = f1['meas/frame1/data'] # load the data
data[...] = X1 # assign new values to data
f1.close() # close the file
To confirm the changes were properly made and saved:
f1 = h5py.File(file_name, 'r')
np.allclose(f1['meas/frame1/data'].value, X1)
#True
askewchan's answer describes the way to do it (you cannot create a dataset under a name that already exists, but you can of course modify the dataset's data). Note, however, that the dataset must have the same shape as the data (X1) you are writing to it. If you want to replace the dataset with some other dataset of different shape, you first have to delete it:
del f1['meas/frame1/data']
dset = f1.create_dataset('meas/frame1/data', data=X1)
Different scenarios:
Partial changes to dataset
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][5] = val # change index 5 to scalar "val"
ds['meas/frame1/data'][3:7] = vals # change values of indices 3--6 to "vals"
Change each value of dataset (identical dataset sizes)
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][...] = X1 # change array values to those of "X1"
Overwrite dataset to one of different size
with h5py.File(file_name,'r+') as ds:
del ds['meas/frame1/data'] # delete old, differently sized dataset
ds.create_dataset('meas/frame1/data',data=X1) # implant new-shaped dataset "X1"
Since the File object is a context manager, using with statements is a nice way to package your code, and automatically close out of your dataset once you're done altering it. (You don't want to be in read/write mode if you only need to read off data!)

Categories