I need to unzip multiple files within a directory and convert them to a csv file.
The files are numbered in order within the file, 1.gz, 2.gz, 3.gz etc
Can this be done within a single script or do I have to do it manually?
edit: current code is
#! /usr/local/bin/python
import gzip
import csv
import os
f = gzip.open('1.gz', 'rb')
file_content = f.read()
filename = '1.txt'
target = open ('1.txt', 'w')
target.write(file_content)
target.close()
filename = '1.csv'
txt_file = '1.txt'
csv_file = '1.csv'
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
dirname = '/home/user/Desktop'
filename = "1.txt"
pathname = os.path.abspath(os.path.join(dirname, filename))
if pathname.startswith(dirname):
os.remove(pathname)
f.close()
Current plan is to do a count for the total number of .gz files per directory and use a loop for each file to unzip and print the txt/csv out.
Is it feasible or is there a better way to this?
Also, is python similar to perl in which the double quotes interpretes the string?
You hardly need Python for this :)
But you can do this in a single Python script. You'll need to use:
os
os.path (possibly)
gzip
glob (will get your a nice glob listing of files. e.g: glob("*.gz"))
Have a read up on these modules over at https://docs.python.org/ and have a go! :)
Related
I have many files in a directory, like ['FredrikstadAvst1.dbf', 'FredrikstadAvst2.dbf', ...]. I want to write a Python script to concatenate these files into a new "*.dbf" file.
I have a Python script that almost does the job. But on the output file it overwrites all the time. So when the job is finished the output file only containes of the last file that is in my directory.
Here is my script:
import os, glob, shutil
folder_path = r'C:\Tom\Oppdrag_2019\Pendle\2018'
for filename in glob.glob(os.path.join(folder_path, '*.dbf')):
fd = open(filename, 'r')
List = []
List.append(fd)
print filename
wfd = open(r"C:\Tom\Oppdrag_2019\Pendle\FredrikstadAvst.dbf",'a')
shutil.copyfileobj(fd, wfd, 1024*1024*10)
Consider the following:
import os, glob, shutil
folder_path = r'C:\Tom\Oppdrag_2019\Pendle\2018'
wfd = open(r"C:\Tom\Oppdrag_2019\Pendle\FredrikstadAvst.dbf",'w')
for filename in glob.glob(os.path.join(folder_path, '*.dbf')):
fd = open(filename, 'r')
shutil.copyfileobj(fd, wfd, 1024*1024*10)
fd.close()
wfd.close()
By opening the file before the loop and closing only after iterating over every dbf file, it shouldn't overwrite. I removed the List (which is a reserved keyword so try not to use it) because I can't see what it's being used for here.
Almost work now. But the header writes for every file. I just want the header to write the first time. How to skip the header for each time ?
I have compressed a file into several chunks using 7zip:
HAVE:
foo.txt.gz.001
foo.txt.gz.002
foo.txt.gz.003
foo.txt.gz.004
foo.txt.gz.005
WANT:
foo.txt
How do I unzip and combine these chunks to get a single file using python?
First, get the list of all files.
files = ['/path/to/foo.txt.gz.001', '/path/to/foo.txt.gz.002', '/path/to/foo.txt.gz.003']
Then iterate over each file and append to a result file.
with open('./result.gz', 'ab') as result: # append in binary mode
for f in files:
with open(f, 'rb') as tmpf: # open in binary mode also
result.write(tmpf.read())
Then extract is using zipfile lib. You could use tempfile to avoid handle with temporary zip file.
First you must extract all the zip files sequentially:
import zipfile
paths = ["path_to_1", "path_to_2" ]
extract_paths = ["path_to_extract1", "path_to_extrac2"]
for i in range(0, paths):
zip_ref = zipfile.ZipFile(paths[i], 'r')
zip_ref.extractall(extract_paths[i])
zip_ref.close()
Next you can go to the extracted location and read() individual files with open into a string. Concatenate those strings and save to foo.txt.
import os, gzip, shutil
dir_name = '/Users/username/Desktop/data'
def gz_extract(directory):
extension = ".gz"
os.chdir(directory)
for item in os.listdir(directory): # loop through items in dir
if item.endswith(extension): # check for ".gz" extension
gz_name = os.path.abspath(item) # get full path of files
file_name = (os.path.basename(gz_name)).rsplit('.',1)[0] #get file name for file within
with gzip.open(gz_name,"rb") as f_in, open(file_name,"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
os.remove(gz_name) # delete zipped file
gz_extract(dir_name)
I have a directory /directory/some_directory/ and in that directory I have a set of files. Those files are named in the following format: <letter>-<number>_<date>-<time>_<dataidentifier>.log, for example:
ABC1-123_20162005-171738_somestring.log
DE-456_20162005-171738_somestring.log
ABC1-123_20162005-153416_somestring.log
FG-1098_20162005-171738_somestring.log
ABC1-123_20162005-031738_somestring.log
DE-456_20162005-171738_somestring.log
I would like to read those a subset of those files (for example, read only files named as ABC1-123*.log) and export all their contents to a single csv file (for example, output.csv), that is, a CSV file that will have all the data from the inidividual files collectively.
The code that I have written so far:
#!/usr/bin/env python
import os
file_directory=os.getcwd()
m_class="ABC1"
m_id="123"
device=m_class+"-"+m_id
for data_file in sorted(os.listdir(file_dir)):
if str(device)+"*" in os.listdir(file_dir):
print data_file
I don't know how to read a only a subset of filtered files and also how to export them to a common csv file.
How can I achieve this?
just use re lib to match file name pattern, and use csv lib to export.
Only a few adjustments, You were close
filesFromDir = os.listdir(os.getcwd())
fileList = [file for file in filesFromDir if file.startswith(device)]
f = open("LogOutput.csv", "ab")
for file in fileList:
#print "Processing", file
with open(file, "rb") as log_file:
txt = log_file.read()
f.write(txt)
f.write("\n")
f.close()
Your question could be better stated, based on your current code snipet, I'll assume that you want to:
Filter files in a directory based on glob pattern.
Concatenate their contents to a file named output.csv.
In python you can achieve (1.) by using glob to list filenames.
import glob
for filename in glob.glob('foo*bar'):
print filename
That would print all files starting with foo and ending with bar in
the current directory.
For (2.) you just read the file and write its content to your desired
output, using python's open() builtin function:
open('filename', 'r')
(Using 'r' as the mode you are asking python to open the file for
"reading", using 'w' you are asking python to open the file for
"writing".)
The final code would look like the following:
import glob
import sys
device = 'ABC1-123'
with open('output.csv', 'w') as output:
for filename in glob.glob(device+'*'):
with open(filename, 'r') as input:
output.write(input.read())
You can use the os module to list the files.
import os
files = os.listdir(os.getcwd())
m_class = "ABC1"
m_id = "123"
device = m_class + "-" + m_id
file_extension = ".log"
# filter the files by their extension and the starting name
files = [x for x in files if x.startswith(device) and x.endswith(file_extension)]
f = open("output.csv", "a")
for file in files:
with open(file, "r") as data_file:
f.write(data_file.read())
f.write(",\n")
f.close()
Trying to extract all the zip files and giving the same name to the folder where all the files are gonna be.
Looping through all the files in the folder and then looping through the lines within those files to write on a different text file.
This is my code so far:
#!usr/bin/env python3
import glob
import os
import zipfile
zip_files = glob.glob('*.zip')
for zip_filename in zip_files:
dir_name = os.path.splitext(zip_filename)[0]
os.mkdir(dir_name)
zip_handler = zipfile.ZipFile(zip_filename, "r")
zip_handler.extractall(dir_name)
path = dir_name
fOut = open("Output.txt", "w")
for filename in os.listdir(path):
for line in filename.read().splitlines():
print(line)
fOut.write(line + "\n")
fOut.close()
This is the error that I encounter:
for line in filename.read().splitlines():
AttributeError: 'str' object has no attribute 'read'
You need to open the file and also join the path to the file, also using splitlines and then adding a newline to each line is a bit redundant:
path = dir_name
with open("Output.txt", "w") as fOut:
for filename in os.listdir(path):
# join filename to path to avoid file not being found
with open(os.path.join(path, filename)):
for line in filename:
fOut.write(line)
You should always use with to open your files as it will close them automatically. If the files are not large you can simply fOut.write(f.read()) and remove the loop.
You also set path = dir_name which means path will be set to whatever the last value of dir_name was in your first loop which may or may not be what you want. You can also use iglob to avoid creating a full list zip_files = glob.iglob('*.zip').
I want to write a program for this: In a folder I have n number of files; first read one file and perform some operation then store result in a separate file. Then read 2nd file, perform operation again and save result in new 2nd file. Do the same procedure for n number of files. The program reads all files one by one and stores results of each file separately. Please give examples how I can do it.
I think what you miss is how to retrieve all the files in that directory.
To do so, use the glob module.
Here is an example which will duplicate all the files with extension *.txt to files with extension *.out
import glob
list_of_files = glob.glob('./*.txt') # create the list of file
for file_name in list_of_files:
FI = open(file_name, 'r')
FO = open(file_name.replace('txt', 'out'), 'w')
for line in FI:
FO.write(line)
FI.close()
FO.close()
import sys
# argv is your commandline arguments, argv[0] is your program name, so skip it
for n in sys.argv[1:]:
print(n) #print out the filename we are currently processing
input = open(n, "r")
output = open(n + ".out", "w")
# do some processing
input.close()
output.close()
Then call it like:
./foo.py bar.txt baz.txt
You may find the fileinput module useful. It is designed for exactly this problem.
I've just learned of the os.walk() command recently, and it may help you here.
It allows you to walk down a directory tree structure.
import os
OUTPUT_DIR = 'C:\\RESULTS'
for path, dirs, files in os.walk('.'):
for file in files:
read_f = open(os.join(path,file),'r')
write_f = open(os.path.join(OUTPUT_DIR,file))
# Do stuff
Combined answer incorporating directory or specific list of filenames arguments:
import sys
import os.path
import glob
def processFile(filename):
fileHandle = open(filename, "r")
for line in fileHandle:
# do some processing
pass
fileHandle.close()
def outputResults(filename):
output_filemask = "out"
fileHandle = open("%s.%s" % (filename, output_filemask), "w")
# do some processing
fileHandle.write('processed\n')
fileHandle.close()
def processFiles(args):
input_filemask = "log"
directory = args[1]
if os.path.isdir(directory):
print "processing a directory"
list_of_files = glob.glob('%s/*.%s' % (directory, input_filemask))
else:
print "processing a list of files"
list_of_files = sys.argv[1:]
for file_name in list_of_files:
print file_name
processFile(file_name)
outputResults(file_name)
if __name__ == '__main__':
if (len(sys.argv) > 1):
processFiles(sys.argv)
else:
print 'usage message'
from pylab import *
import csv
import os
import glob
import re
x=[]
y=[]
f=open("one.txt",'w')
for infile in glob.glob(('*.csv')):
# print "" +infile
csv23=csv2rec(""+infile,'rb',delimiter=',')
for line in csv23:
x.append(line[1])
# print len(x)
for i in range(3000,8000):
y.append(x[i])
print ""+infile,"\t",mean(y)
print >>f,""+infile,"\t\t",mean(y)
del y[:len(y)]
del x[:len(x)]
I know I saw this double with open() somewhere but couldn't remember where. So I built a small example in case someone needs.
""" A module to clean code(js, py, json or whatever) files saved as .txt files to
be used in HTML code blocks. """
from os import listdir
from os.path import abspath, dirname, splitext
from re import sub, MULTILINE
def cleanForHTML():
""" This function will search a directory text files to be edited. """
## define some regex for our search and replace. We are looking for <, > and &
## To replaced with &ls;, > and &. We might want to replace proper whitespace
## chars to as well? (r'\t', ' ') and (f'\n', '<br>')
search_ = ((r'(<)', '<'), (r'(>)', '>'), (r'(&)', '&'))
## Read and loop our file location. Our location is the same one that our python file is in.
for loc in listdir(abspath(dirname(__file__))):
## Here we split our filename into it's parts ('fileName', '.txt')
name = splitext(loc)
if name[1] == '.txt':
## we found our .txt file so we can start file operations.
with open(loc, 'r') as file_1, open(f'{name[0]}(fixed){name[1]}', 'w') as file_2:
## read our first file
retFile = file_1.read()
## find and replace some text.
for find_ in search_:
retFile = sub(find_[0], find_[1], retFile, 0, MULTILINE)
## finally we can write to our newly created text file.
file_2.write(retFile)
This thing also works for reading multiple files, my file name is fedaralist_1.txt and federalist_2.txt and like this, I have 84 files till fedaralist_84.txt
And I'm reading the files as f.
for file in filename:
with open(f'federalist_{file}.txt','r') as f:
f.read()