Read txt-file with data and labels into tensorflow - python

I'm relativly new to tensorflow and therefore I'm struggling with the data preparation.
I have a folder with about 500 .txt files. Each of these files contain the data and a label of the data. (The data represents MFCCs, which are audio features that get generated for each "frame" of a .wav audio file.)
Each of these files look like this:
1
1.013302233064514191e+01
-1.913611804400369110e+01
1.067932213100989847e+00
1.308777013246182364e+01
-3.591032944037165109e+00
1.294307486784356698e+01
5.628056691023937574e+00
5.311223121033092909e+00
1.069261850699697014e+01
4.398722698218969995e+00
5.045254154360372389e+00
7.757820364628694954e+00
-2.666228281486863416e+00
9.236707894117541784e+00
-1.727334954006132151e+01
5.166050472560470119e+00
6.421742650353079007e+00
2.550240091606466031e+00
9.871269941885440602e+00
7.594591526898561984e-01
-2.877228968309437196e+00
5.592507658015017924e-01
8.828475996369435919e+00
2.946838169848354561e+00
8.420693074096489150e-01
7.032494888004835687e+00
...
In the first line of each file, I got the label of the data (in this case 1).
In the rest of the file, I got 13 numbers representing 13 MFCCs for each frame. Each frame MFCCs are separated with a newline.
So my question would be whats an easy way of getting the content of all these files into tensors so tensorflow can use them?
Thanks!

Not sure if this is the Optimized way of doing but this can be done as explained in the steps below:
Iterate through each Text File and append its data to a List
Replace '\n' in each element with ',' because our goal is to create CSV out of it
Write the Elements of the List whose elements are separated by Commas to a CSV File
Finally, convert CSV File to Tensorflow Dataset using tf.data.experimental.make_csv_dataset. Please find this Tutorial on how to convert CSV File to Tensorflow Dataset.
Code which performs First Three Steps mentioned above is given below:
import os
import pandas as pd
# The Folder where all the Text Files are present
Path_Of_Text_Files = '/home/mothukuru/Jupyter_Notebooks/Stack_Overflow/Text_Files'
List_of_Files = os.listdir(Path_Of_Text_Files)
List_Of_Elements = []
# Iterate through each Text File and append its data to a List
for EachFile in List_of_Files:
with open(os.path.join(Path_Of_Text_Files, EachFile), 'r') as FileObj:
List_Of_Elements.append(FileObj.readlines())
# Below code is to remove '\n' at the end of each Column
for i in range(len(List_Of_Elements)):
List_Of_Elements[i] = [sub.replace('\n', ',') for sub in List_Of_Elements[i]]
Column_Names = ['Label,', 'F1,', 'F2,', 'F3,', 'F4,', 'F5,', 'F6,', 'F7,',
'F8,', 'F9,', 'F10,', 'F11,', 'F12,', 'F13']
# Write the Data in the List, List_Of_Elements to a CSV File
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'w') as FileObj:
FileObj.writelines(Column_Names)
for EachElement in List_Of_Elements:
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'a') as FileObj:
FileObj.write('\n')
FileObj.writelines(EachElement)
Path_Of_Final_CSV = os.path.join(Path_Of_Text_Files, 'Final_Data.csv')
Data = pd.read_csv(Path_Of_Final_CSV, index_col = False)
To check if our Data is Fine, print(Data.head()) will output the below data:

Related

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Read CSV with comma delimiter (sorting issue while importing csv)

I am trying to open a csv file by skipping first 5 rows. The data is not getting aligned in dataframe. See screenshot of file
PO = pd.DataFrame()
PO = pd.read_table(acct.csv',sep='\t',skiprows=5,skip_blank_lines=True)
PO
try to set it after import datewise as below.
First sort your data with proper import as it is sticked to the index values. see data image again and data as well. So, when you have proper separator / delimiter you can do following.
do = pd.read_csv('check_test.csv', "r", delimiter='\t', skiprows=range(1, 7),skip_blank_lines=True, encoding="utf8")
d01 = do.iloc[:,1:7]
d02 = d01.sort_values('Date,Reference,Debit')
This is sorting the values into the way you want.

extract specific data from a csv file with specified row and column names

The CSV module of python is pretty new for me and would like to get some help with a specific task.
I am looking to extract data (numeric values) from a csv-file-1 based on its row and column names. Secondly, I would like to put this data into another csv-file, in a new column, at the same line corresponding to the raw name's data from csv-file-1.
Here are examples of my two dataframes (csv format, sep = ","):
csv-file-1:
seq_label,id3,id4
id1,0.3,0.2
id2,0.4,0.7
csv-file-2:
seq_label,x1,...
id1,id3,...
id2,id4,...
For example, I would like to select values from csv-file-1, which correspond to the intersection of row names of "seq_label" and "x1" variables in csv-file-2.
Then, I would like to create a new csv-file (csv-file-3) which is the fusion of csv-file-1 and the extracted data from csv-file-1, in this way:
csv-file-3 ("x3" is the new variable or new column with extracted values):
seq_label,x1,...,x3
id1,id3,...,0.3
id2,id4,...,0.7
Could someone give me a hand on this?
Best regards
This is just an example with comments to explain the steps. Hope it'll help you.
import csv
with open("path to file", "r") as f: # to open the file with read mode
r = csv.reader(f) # create a csv reader
content = list(r) # get the content of the file in a list
column = ["x3", 0.3, 0.7, ...] # prepare the last column
content.append(column) # add it to content list
with open("path to file 2", "w") as f2 : ## Open file 2 in order to write into it
w = csv.writer(r, newline='')
w.writerows(content) ## write the new content
The csv lib will return you a list for each row.
What you want to do is
read the first csv
and convert it into something you can use (depends on whether you want row or column based access
do the same for csv2
for each line of csv1 search for a match in csv2
and add it to your internal data
write this data to your output file
You might also want to look at
https://pandas.pydata.org/
since it seems like you could save a lot of time using pandas instead of the plain csv methods.

Read multiple txt files into Dict into Pandas dataframe

I am trying to load multiple txt files into dataframe. I know how to load urls, csv, and excel, but I couldnt find any reference on how to load multiple txt files into dataframe and match with dictionary or viceversa.
the text file are not comma or tab separated just plain text containing plain text song lyrics.
I checked the pandas documents any assistance welcome.
https://pandas.pydata.org/pandas-docs/stable/reference/io.html
Ideally the dataframe
the dataframe I hope to achieve would be like this example
| lyrics
-------------+-----------------------------------------------------------------------------------------
bonjovi | some text from the text files HiHello! WelcomeThank you Thank you for coming.
-------------+---------------------------------------------------------------------------------------
lukebryan | some other text from the text files.Hi.Hello WelcomeThank you Thank you for coming.
-------------+-----------------------------------------------------------------------------------------
johnprine | yet some text from the text files. Hi.Hello WelcomeThank you Thank you for coming.
Basic example
folder structure /lyrics/
urls =
'lyrics/bonjovi.txt',
'lyrics/lukebryan.txt',
'lyrics/johnprine.txt',
'lyrics/brunomars.txt',
'lyrics/methodman.txt',
'lyrics/bobmarley.txt',
'lyrics/nickcannon.txt',
'lyrics/weeknd.txt',
'lyrics/dojacat.txt',
'lyrics/ladygaga.txt',
'lyrics/dualipa.txt',
'lyrics/justinbieber.txt',]
muscian names
bands = ['bonjovi', 'lukebryan', 'johnprine', 'brunomars', 'methodman', 'bobmarley', 'nickcannon', 'weeknd', 'dojacat', 'ladygaga', 'dualipa', 'justinbieber']
Open the text files
the files are in directory lyrics/ from where I running my Jupyter notebook.
for i, c in enumerate(bands):
with open("lyrics/" + c + ".txt", "wb") as file:
pickle.dump(lyrics[i], file)
Double check to make sure data has been loaded properly
data.keys()
hopefully get result like this
dict_keys(['bonjovi', 'lukebryan', 'johnprine', 'brunomars', 'methodman', 'bobmarley', 'nickcannon', 'weeknd', 'dojacat', 'ladygaga', 'dualipa', 'justinbieber'])
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
# We are going to change this to key: artist, value: string format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of text.'''
combined_text = ' '.join(list_of_text)
return combined_text
We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['lyrics']
data_df = data_df.sort_index()
data_df
import os
import re
import pandas as pd
#get full path of txt file
filePath = []
for file in os.listdir("./lyrics"):
filePath.append(os.path.join("./lyrics", file))
#pull file name from text file with regex, capturing the text before the .txt
fileName = re.compile('\\\\(.*)\.txt')
#make empty dict Data with the key as the file name, and the value as the words in the file.
data = {}
for file in filePath:
#capturing file name
key = fileName.search(file)
with open(file, "r") as readFile:
# note that key[1] is the capture group from our search, and that the text is put into a list.
data[key[1]] = [readFile.read()]
#make dataframe from dict, and rename columns.
df = pd.DataFrame(data).T.reset_index().rename(columns = {'index':'bands', 0:'lyrics'})
This is how I would do it. Notice I generalized the file manipulation, so I don't have to worry about manually making the list for the keys, and ensure everything matches up.

pandas HDF select does not recognise column name

I'm trying to process a large (2gb) csv file on a machine with only 4gb of RAM (don't ask) to produce a different, formatted csv containing a subset of data that needs some processing. I'm reading the file and creating a HDFstore that I query later for the data that I require for output. Everything works except that I cant retrieve data from the store using Term - error message comes back that PLOT is not a column name. Individual variables look fine and the store is what I expect I just can't see where the error is. (nb pandas v14 and numpy1.9.0). Very new to this so apologies for the clunky code.
#wibble wobble -*- coding: utf-8 -*-
# short version
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
Location = r"CL_short.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
#set chunk small for test file
chunky=4
plotty =pd.DataFrame(columns=['PLOT'])
dfdum=pd.DataFrame(columns=['PLOT', 'mDate', 'D100'])
#read file in chunks to avoid RAM blowing up
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve plot numbers and select unique items
plotty = store.select('wibble', "columns = ['PLOT']")
plotty.drop_duplicates(inplace=True)
#iterate through unique plots to retrieve data and put in dataframe for output
for index, row in plotty.iterrows():
dfdum = store.select('wibble', [Term('PLOT', '=', plotty.iloc[index]['PLOT'])])
#process dfdum for output to new csv
print("successful completion")
filesport()
Final listing for those that wish to fight through the tumbleweed to reach here and are similarly bemused by processing large .csv files and the various methods of trying to retrieve/process data. The biggest problem was getting the sytax of the pytables Term right. Despite several examples indicating that it was possible to use 'A >20' etc this never worked for me. I set up a string condition containing the Term query and this worked (it is in the documentation TBF).
Also found it easier to query the HDF to retrieve unique items direct from the store in a list which could then be sorted and iterated through to retrieve data plot by plot. Note that I wanted the final csv file to have plot and then all the D100 data in date order, hence the pivot at the end.
Reading the csv file in chunks meant that each plot retrieved from the store had a header and this got written to the final csv which messed things up. I'm sure there's a more elegant way of only writing one header than the one I've shown here.
It works, takes about 2 hours to process the data and produce the final csv file (initial file 2GB, 30+million lines, data for 100,000+ unique plots, machine has 4GB of RAM but running 32-bit which means that only 2.5GB of RAM was available).
Good luck if you have a similar problem, and I hope you find this useful
#wibble wobble -*- coding: utf-8 -*-
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
print (pd.__version__)
print (np.__version__)
Location = r"conliq_med.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
chunky=100000
#read file in chunks to avoid RAM blowing up select only needed columns
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve unique plots and sort
plotty = store.select_column('wibble', 'PLOT').unique()
plotty.sort()
#set flag for writing file header
i=0
#iterate through unique plots to retrieve data and put in dataframe for output
for item in plotty:
condition = 'PLOT =' + str(item)
dfdum = store.select('wibble', [Term(condition)])
dfdum["mDate"]= pd.to_datetime(dfdum["mDate"], dayfirst=True)
dfdum.sort(columns=["PLOT", "mDate"], inplace=True)
dfdum["mDate"] = dfdum["mDate"].map(lambda x: x.strftime("%Y - %m"))
dfdum=dfdum.pivot("PLOT", "mDate", "D100")
#only print one header to file
if i ==0:
dfdum.to_csv("CL_OP.csv", mode='a')
i=1
else:
dfdum.to_csv("CL_OP.csv", mode='a', header=False)
print("successful completion")
filesport()

Categories