I'm trying to covert multiple text files into a single .csv file using Python. My current code is this:
import pandas
import glob
#Collects the files names of all .txt files in a given directory.
file_names = glob.glob("./*.txt")
#[Middle Step] Merges the text files into a single file titled 'output_file'.
with open('output_file.txt', 'w') as out_file:
for i in file_names:
with open(i) as in_file:
for j in in_file:
out_file.write(j)
#Reading the merged file and creating dataframe.
data = pandas.read_csv("output_file.txt", delimiter = '/')
#Store dataframe into csv file.
data.to_csv("convert_sample.csv", index = None)
So as you can see, I'm reading from all the files and merging them into a single .txt file. Then I convert it into a single .csv file. Is there a way to accomplish this without the middle step? Is it necessary to concatenate all my .txt files into a single .txt to convert it to .csv, or is there a way to directly convert multiple .txt files to a single .csv?
Thank you very much.
Of course it is possible. And you really don't need to involve pandas here, just use the standard library csv module. If you know the column names ahead of time, the most painless way is to use csv.DictWriter and csv.DictReader objects:
import csv
import glob
column_names = ['a','b','c'] # or whatever
with open("convert_sample.csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob("./*.txt"):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='/', fieldnames=column_names)
writer.writerows(reader)
Related
There are multiple tsv files in a folder. I want to convert each tsv file into a csv file and merge all the csv files into one mega csv file.
customer_data = r"C:\Users\username\Desktop\folder\CustomerData_20201030031520.tsv"
customer_data = pd.read_csv(customer_data, sep='\t', low_memory=False)
This is how I get read and write to csv file. How can I do this for multiple tsv files efficiently rather than repeating this manually?
Notice the file name pattern? All the files would be in this pattern:
CustomerData_"year""month""day_number""random_digits".tsv
My objective is to merge all these multiple CSVs into a mega CSV file.
If the need is to merge a pack of similar-formatted files, there no need to actually load the data to memory, we can dump all files to one directly.
The snipped below will check the directory path for the pattern pattern and sort the resulting list by file names. After that, the list in the sorted order would be written in out_file file.
outfile.write("\n") is required is *.tsv file are not ended wit blank lines, otherwise it should be commented.
import os
import re
path = "c:\\temp\\1"
out_file = "c:\\temp\\1\\big_file.tsv"
pattern = re.compile("^.*_(\d{4})(\d{2})(\d{2})\d{1,10}.\w{3}$")
matched_files = []
for f in os.listdir(path):
if os.path.isdir(os.path.join(path, f)):
continue
if not pattern.match(f):
continue
matched_files.append(f)
matched_files = sorted(matched_files)
with open(out_file, "w+") as outfile:
for f in matched_files:
with open(os.path.join(path, f), "r") as infile:
outfile.writelines(infile.readlines())
outfile.write("\n")
I try to do some data processing.
My question as following:
Folder (C://) contains multiple text files.
To read 1st text file -> process(get some data inside) to list1
To read 2nd text file -> process(get some data inside) to list2
.
.
To read Nth text file -> process to listN
Write ([list1],[list2],....,[listN]) into one excel.
To read X Files you need a multidimensional list. It's a list out of lists.
import os
path = "C://folder/"
files = os.listdir(path)
file_list = []
for file in files:
with open (path + file,"r") as txt:
file_list.append(txt.read().splitlines())
If .csv is the format you want to write, you would write the file like this:
from csv import writer
with open("test.csv", "w", newline="") as csv:
write = writer(csv, delimiter=';')
for file in file_list:
write.writerow(file)
(This way, every row is a file, and every column is a row of the file)
If you want a .xls/.xlsx file, you could look in the documentation for the module xlsxwriter
I have some data files, say data1.txt, data 2.txt,... and so on. I want to read all these data files using a single loop structure and append the data values into a single file, say data-all.txt.
I am fine with any of the following programming languages: c, python, matlab
pathlib module is great for globbing matching files, and for easy read/write:
from pathlib import Path
def all_files(dir, mask):
for path in Path(dir).glob(mask):
yield from path.open()
Path('data_all.txt').write_text(''.join(all_files('.', 'data*.txt')))
In windows, use
copy data*.txt data-all.txt
In Unix, use
cat data*.txt >> data-all.txt
Use python zip and the csv module to achieve this. Within a single for loop:
For Example:
import csv
with open("data_all.csv", "w") as f:
csv_writer = csv.writer(f)
for d1, d2, d3 in zip(open("data1.txt", "r"), open("data2.txt", "r"), open("data3.txt", "r")):
csv_writer.writerow([d1, d2, d3])
In python First of all you have to create the list of all file paths you can use the glob library in python.
import glob
import pandas as pd
path_list = glob.glob('Path/To/Your/DataFolder/pattern(data*)')
then you can read that data using list comprehension.it will give you the list of data frames depending upon the data files in your folder
list_data = [pd.read_csv(x,sep='\t') for x in path_list]
It will combine the data in single data frame and you can write it as a single dataframe.
data_all = pd.concat(list_data,ignore_index=True)
now you can write the dataframe in a single file.
data_all.to_csv('Path',sep=',')
It can be done through reading the content of each file, and writing them into an output file handle. The files structure in your description contains numbers, so we might need to call sorted to sort them before start reading. The "files_search_pattern" should point out to the input directory 'PATH/*.txt' and the same for the output file handle "data-all.txt"
import glob
files_search_pattern = "*.txt"
files = sorted(glob.glob(files_search_pattern))
with open("data-all.txt", "wb") as output:
for f in files:
with open(f, "rb") as inputFile:
output.write(inputFile.read())
I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).
You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.
Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])
I have a set of data saved across multiple .csv files with a fixed number of columns. Each column corresponds to a different measurement.
I would like to add a header to each file. The header will be identical for all files, and is comprised of three rows. Two of these rows are used to identify their corresponding columns.
I am thinking that I could save the header in a separate .csv file, then iteratively merge it with each data file using a for loop.
How can I do this in python? I am new to the language.
Yeah, you can do that easily with pandas. It will be faster and easier than what you're currently thinking which may create problems.
Three simple commands will be used for reading, merging and putting that in a new file and they are:
pandas.read_csv()
pandas.merge()
pandas.to_csv()
You can read what arguments you have to use and more details about them here.
for your case you may need first to create new files with
the headers with them. then you would do another loop to
add the rows, but skipping the header.
import csv
with open("data_out.csv","a") as fout:
# first file:
with open("data.csv") as f: # you header file
for line in f:
fout.write(line)
with open("data_2.csv") as f:
next(f) # this will skip first line
for line in f:
fout.write(line)
Instead of running a for loop appending two files for multiple files, an easier solution would be to put all the csv files you want to merge into a single folder and feed the path to the program. This will merge all the csv files into a single csv file.
(Note: The attributes of each file must be same)
import os
import pandas as pd
#give the path to the folder containing the multiple csv files
dirList = os.listdir(path)
#Put all their names into a list
filenames = []
for item in dirList:
if ".csv" in item:
filenames.append(item)
#Create a dataframe and make sure it's empty (not required but safe practice if using for appending)
df1 = pd.Dataframe()
df1.drop(df1.index, inplace=True)
#Convert each file to a dataframe and append it to dataframe df1
for f in filenames:
df = pd.read_csv(f)
df1 = df1.append(df)
#Convert the dataframe into a single csvfile
df1.to_csv(csvfile, encoding='utf-8', index=False)