I am trying to load multiple txt files into dataframe. I know how to load urls, csv, and excel, but I couldnt find any reference on how to load multiple txt files into dataframe and match with dictionary or viceversa.
the text file are not comma or tab separated just plain text containing plain text song lyrics.
I checked the pandas documents any assistance welcome.
https://pandas.pydata.org/pandas-docs/stable/reference/io.html
Ideally the dataframe
the dataframe I hope to achieve would be like this example
| lyrics
-------------+-----------------------------------------------------------------------------------------
bonjovi | some text from the text files HiHello! WelcomeThank you Thank you for coming.
-------------+---------------------------------------------------------------------------------------
lukebryan | some other text from the text files.Hi.Hello WelcomeThank you Thank you for coming.
-------------+-----------------------------------------------------------------------------------------
johnprine | yet some text from the text files. Hi.Hello WelcomeThank you Thank you for coming.
Basic example
folder structure /lyrics/
urls =
'lyrics/bonjovi.txt',
'lyrics/lukebryan.txt',
'lyrics/johnprine.txt',
'lyrics/brunomars.txt',
'lyrics/methodman.txt',
'lyrics/bobmarley.txt',
'lyrics/nickcannon.txt',
'lyrics/weeknd.txt',
'lyrics/dojacat.txt',
'lyrics/ladygaga.txt',
'lyrics/dualipa.txt',
'lyrics/justinbieber.txt',]
muscian names
bands = ['bonjovi', 'lukebryan', 'johnprine', 'brunomars', 'methodman', 'bobmarley', 'nickcannon', 'weeknd', 'dojacat', 'ladygaga', 'dualipa', 'justinbieber']
Open the text files
the files are in directory lyrics/ from where I running my Jupyter notebook.
for i, c in enumerate(bands):
with open("lyrics/" + c + ".txt", "wb") as file:
pickle.dump(lyrics[i], file)
Double check to make sure data has been loaded properly
data.keys()
hopefully get result like this
dict_keys(['bonjovi', 'lukebryan', 'johnprine', 'brunomars', 'methodman', 'bobmarley', 'nickcannon', 'weeknd', 'dojacat', 'ladygaga', 'dualipa', 'justinbieber'])
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
# We are going to change this to key: artist, value: string format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of text.'''
combined_text = ' '.join(list_of_text)
return combined_text
We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['lyrics']
data_df = data_df.sort_index()
data_df
import os
import re
import pandas as pd
#get full path of txt file
filePath = []
for file in os.listdir("./lyrics"):
filePath.append(os.path.join("./lyrics", file))
#pull file name from text file with regex, capturing the text before the .txt
fileName = re.compile('\\\\(.*)\.txt')
#make empty dict Data with the key as the file name, and the value as the words in the file.
data = {}
for file in filePath:
#capturing file name
key = fileName.search(file)
with open(file, "r") as readFile:
# note that key[1] is the capture group from our search, and that the text is put into a list.
data[key[1]] = [readFile.read()]
#make dataframe from dict, and rename columns.
df = pd.DataFrame(data).T.reset_index().rename(columns = {'index':'bands', 0:'lyrics'})
This is how I would do it. Notice I generalized the file manipulation, so I don't have to worry about manually making the list for the keys, and ensure everything matches up.
Related
Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
Hello I am trying to find the number of rows for all files within a folder. I am trying to do this for a folder that contains only ".txt" files and for a folder that contains ."csv" files.
I know that the way to get the number of rows for a SINGLE ".txt" file is something like this:
file = open("sample.txt","r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Whereas for a SINGLE ".csv" file is something like this:
file = open("sample.csv")
reader = csv.reader(file)
lines= len(list(reader))
print(lines)
But how can I do this for ALL files within a folder? That is, how can I loop each of these procedures across all files within a folder and, ideally, export the output into an excel sheet with columns akin to these:
Filename Number of Rows
1.txt 900
2.txt 653
and so on and so on.
Thank you so much for your help.
You can use glob to detect the files and then just iterate over them.
Other methods : How do I list all files of a directory?
import glob
# 1. list all text files in the directory
rel_filepaths = glob.glob("*.txt")
# 2. (optional) create a function to read the number of rows in a file
def count_rows(filepath):
res = 0
f = open(filepath, 'r')
res = len(f.readlines())
f.close()
return res
# 3. iterate over your files and use the count_row function
counts = [count_rows(filepath) for filepath in rel_filepaths]
print(counts)
Then, if you want to export this result in a .csv or .xslx file, I recommend using pandas.
import pandas as pd
# 1. create a new table and add your two columns filled with the previous values
df = pd.DataFrame()
df["Filename"] = rel_filepaths
df["Number of rows"] = counts
# 2. export this dataframe to `.csv`
df.to_csv("results.csv")
You can also use pandas.ExcelWriter() if you want to use the .xlsx format. Link to documentation & examples : Pandas - ExcelWriter doc
I'm relativly new to tensorflow and therefore I'm struggling with the data preparation.
I have a folder with about 500 .txt files. Each of these files contain the data and a label of the data. (The data represents MFCCs, which are audio features that get generated for each "frame" of a .wav audio file.)
Each of these files look like this:
1
1.013302233064514191e+01
-1.913611804400369110e+01
1.067932213100989847e+00
1.308777013246182364e+01
-3.591032944037165109e+00
1.294307486784356698e+01
5.628056691023937574e+00
5.311223121033092909e+00
1.069261850699697014e+01
4.398722698218969995e+00
5.045254154360372389e+00
7.757820364628694954e+00
-2.666228281486863416e+00
9.236707894117541784e+00
-1.727334954006132151e+01
5.166050472560470119e+00
6.421742650353079007e+00
2.550240091606466031e+00
9.871269941885440602e+00
7.594591526898561984e-01
-2.877228968309437196e+00
5.592507658015017924e-01
8.828475996369435919e+00
2.946838169848354561e+00
8.420693074096489150e-01
7.032494888004835687e+00
...
In the first line of each file, I got the label of the data (in this case 1).
In the rest of the file, I got 13 numbers representing 13 MFCCs for each frame. Each frame MFCCs are separated with a newline.
So my question would be whats an easy way of getting the content of all these files into tensors so tensorflow can use them?
Thanks!
Not sure if this is the Optimized way of doing but this can be done as explained in the steps below:
Iterate through each Text File and append its data to a List
Replace '\n' in each element with ',' because our goal is to create CSV out of it
Write the Elements of the List whose elements are separated by Commas to a CSV File
Finally, convert CSV File to Tensorflow Dataset using tf.data.experimental.make_csv_dataset. Please find this Tutorial on how to convert CSV File to Tensorflow Dataset.
Code which performs First Three Steps mentioned above is given below:
import os
import pandas as pd
# The Folder where all the Text Files are present
Path_Of_Text_Files = '/home/mothukuru/Jupyter_Notebooks/Stack_Overflow/Text_Files'
List_of_Files = os.listdir(Path_Of_Text_Files)
List_Of_Elements = []
# Iterate through each Text File and append its data to a List
for EachFile in List_of_Files:
with open(os.path.join(Path_Of_Text_Files, EachFile), 'r') as FileObj:
List_Of_Elements.append(FileObj.readlines())
# Below code is to remove '\n' at the end of each Column
for i in range(len(List_Of_Elements)):
List_Of_Elements[i] = [sub.replace('\n', ',') for sub in List_Of_Elements[i]]
Column_Names = ['Label,', 'F1,', 'F2,', 'F3,', 'F4,', 'F5,', 'F6,', 'F7,',
'F8,', 'F9,', 'F10,', 'F11,', 'F12,', 'F13']
# Write the Data in the List, List_Of_Elements to a CSV File
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'w') as FileObj:
FileObj.writelines(Column_Names)
for EachElement in List_Of_Elements:
with open(os.path.join(Path_Of_Text_Files, 'Final_Data.csv'), 'a') as FileObj:
FileObj.write('\n')
FileObj.writelines(EachElement)
Path_Of_Final_CSV = os.path.join(Path_Of_Text_Files, 'Final_Data.csv')
Data = pd.read_csv(Path_Of_Final_CSV, index_col = False)
To check if our Data is Fine, print(Data.head()) will output the below data:
I have almost 1000000 or even more files in a path.
My final goal is to extract some information from just names of the files.
Till now I have saved the names of the file in a list.
what information in names of the files?
so the format of the names of the file is something like this:
09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt
all haha are other different text that does not matter.
I want to extract 09066271 and 2016-10-07 out of the names and save in a dataframe. the first number is always 8 character.
Till now , I have saved the whole text file names in the list:
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. it seems I have to firstly read to numpy then reshape it to be readable in pandas. however I do not now before what will be the reshape numbers.
df = pd.DataFrame(np.array(file_list).reshape(,))
I would appreciate if you can give me your idea and what will be the efficient way of doing this :)
You can use os to list all of the files. Then just construct a DataFrame and use the string methods to get the parts of the filenames you need.
import pandas as pd
import os
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
file_name data date
0 09066271_142468576_1_Haha_-Haha-haha_2016-10-0... 09066271 2016-10-07
1 09014271_142468576_1_Haha_-Haha-haha_2013-02-1... 09014271 2013-02-18
I want to make a script that would copy 2nd column from multiple csv files in a folder and add some text before saving it to a single csv file .
here is what i want to do :
1.) Grab data in the 2nd column from all csv files
2.) Append text "hello" & "welcome" to each row at start and end
3.) Write the data into a single file
I tried creating it using pandas
import os
import pandas as pd
dataframes = [pd.read_csv(p, index_col=2, header=None) for p in ('1.csv','2.csv','3.csv')]
merged_dataframe = pd.concat(dataframes, axis=0)
merged_dataframe.to_csv("all.csv", index=False)
The Problem is -
In above code I am forced to mention the file names manually which is very difficult, as a solution I need to include all csv file *.csv
Need to use something like writr.writerow(("Hello"+r[1]+"welcome"))
As there are multiple csv files with many rows(around 100k) in each file so i need to speed up as well.
Here is a sample of the csv files:
"1.csv" "2.csv" "3.csv"
a,Jac b,William c,James
And here is how I would like the output to look all.csv:
Hello Jac welcome
Hello William welcome
Hello James welcome
Any solution using .merge() .append() or .concat() ??
How can I achieve this using python ?
You don't need pandas for this. Here's a really simple way of doing this with csv
import csv
import glob
with open("path/to/output", 'w') as outfile:
for fpath in glob.glob('path/to/directory/*.csv'):
with open(fpath) as infile:
for row in csv.reader(infile):
outfile.write("Hello {} welcome\n".format(row[1]))
1) If you would like to import all .csv files in a folder, you can just use
for i in [a in os.listdir() if a[-4:] == '.csv']:
#code to read in .csv file and concatenate to existing dataframe
2) To append the text and write to a file, you can map a function to each element of the dataframe's column 2 to add the text.
#existing dataframe called df
df[df.columns[1]].map(lambda x: "Hello {} welcome".format(x)).to_csv(<targetpath>)
#replace <targetpath> with your target path
See http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.to_csv.html for all the various parameters you can pass in to to_csv.
Here is a non-pandas solution using built in csv module. Not sure about speed.
import os
import csv
path_to_files = "path to files"
all_csv = os.path.join(path_to_files, "all.csv")
file_list = os.listdir(path_to_files)
names = []
for file in file_list:
if file.endswith(".csv"):
path_to_current_file = os.path.join(path_to_files, file)
with open(path_to_current_file, "r") as current_csv:
reader = csv.reader(current_csv, delimiter=',')
for row in reader:
names.append(row[1])
with open(all_csv, "w") as out_csv:
writer = csv.writer(current_csv, delimiter=',')
for name in names:
writer.writerow(["Hello {} welcome".format(name))