Unified opening of .csv and .csv.gz files - python

In Python 2.7, I would like to open a file and do some manipulations with it. The problem is that I do not know beforehand if it has a .csv or a .csv.gz extension. If I knew it was .csv, I would do
with open(filename, "r") as f_in:
do something
If I knew it was .csv.gz, I could say
import gzip
with gzip.open(filename, "r") as f_in:
do something
I am curious if there is a way to avoid repetition after figuring out the file extension:
def find_ext(filename):
return filename.split(".")[-1]
ext = find_ext(filename)
if ext == "csv":
with open(filename, "r") as f_in:
do something
else if ext == "gz":
import gzip
with gzip.open(filename, "r") as f_in:
do something

I wouldn't bother looking at the file extension: Files get renamed or use non-standard variations.
Instead, open the raw file and examine the header. If it begins with two 32-bit words 0x00088b1f and 0, it is a gzip file.
import struct
f = open(filename, 'rb')
v = f.read(8)
v1 = struct.unpack('I', v)[0]
v2 = struct.unpack('I', v)[1]
if v1 == 0x00088b1f and v2 == 0:
# it is gzip

import gzip
import mimetypes
smart_open = lambda fn: {"gzip": gzip.open(fn)}.get(mimetypes.guess_type(fn)[1], open(fn))
# usage:
f = smart_open("test.csv.gz")
f = smart_open("test.csv")

Since different libraries are used for different cases the check of the extension is required. But one can go with construction like:
try:
...
except AnError:
...
else:
...
finally:
...
Anyway, one (if/else) or another (try/finally) type of construction is necessary since a file opening process is not the same for different cases.
To avoid repetition use such approach (it is pseudocode):
def readcsv(file):
...
def readgzip(file):
...
if csv:
readcsv(file)
elif gzip:
readgzip(file)

Related

YAML find and replace text Python

I have some files in YAML format, I need to find the text in the $title file and replace with what I specified. What the configuration file looks like approximately:
JoinGame-MOTD:
Enabled: true
Messages:
- '$title'
The YAML file may look different, so I want to make a universal code that will not get any specific string, but replace all $title with what I specified
What I was trying to do:
import sys
import yaml
with open(r'config.yml', 'w') as file:
def tr(s):
return s.replace('$title', 'Test')
yaml.dump(file, sys.stdout, transform=tr)
Please help me. It is not necessary to work with my code, I will be happy with any examples that can suit me
Might be easier to not use the yaml package at all.
with open("file.yml", "r") as fin:
with open("file_replaced.yml", "w") as fout:
for line in fin:
fout.write(line.replace('$title', 'Test'))
EDIT:
To update in place
with open("config.yml", "r+") as f:
contents = f.read()
f.seek(0)
f.write(contents.replace('$title', 'Test'))
f.truncate()
You can also read & write data in one go. os.path.join is optional, it makes sure the yaml file is read relative to path your script is stored
import re
import os
with open(os.path.join(os.path.dirname(__file__), 'temp.yaml'), 'r+') as f:
data = f.read()
f.seek(0)
new_data = data.replace('$title', 'replaced!')
f.write(new_data)
f.truncate()
In case you wish to dynamically replace other keywords besides $title, like $description or $name, you can write a function using regex like this;
def replaceString(text_to_search, keyword, replacement):
return re.sub(f"(\${keyword})[\W]", replacement, text_to_search)
replaceString('My name is $name', '$name', 'Bob')

os.write() appends file instead of overwriting, but O_APPEND isn't used [duplicate]

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Using Pylzma with streaming and 7Zip compatibility

I have been using pylzma for a little bit, but I have to be able to create files compatible with the 7zip windows application. The caveat is that some of my files are really large (3 to 4gb created by a third party software in a proprietary binary format).
I went over and over here and on the instructions here: https://github.com/fancycode/pylzma/blob/master/doc/USAGE.md
I am able to create compatible files with the following code:
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
result = s.read(5)
result += struct.pack('<Q', fsize)
s=result+s.read()
o.write(s)
o.flush()
o.close()
i.close()
os.remove(f)
The smaller files (up to 2Gb) compress well with this code and are compatible with 7Zip, but the larger files just crash python after some time.
According to the user guide, to compact large files one should use streaming, but then the resulting file is not compatible with 7zip, as in the snippet bellow.
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
while True:
tmp = s.read(1)
if not tmp: break
o.write(tmp)
o.flush()
o.close()
i.close()
os.remove(f)
Any ideas on how can I incorporate the streaming technique present in pylzma while keeping the 7zip compatibility?
You still need to correctly write the header (.read(5)) and size, e.g. like so:
import os
import struct
import pylzma
def sevenzip(infile, outfile):
size = os.stat(infile).st_size
with open(infile, "rb") as ip, open(outfile, "wb") as op:
s = pylzma.compressfile(ip)
op.write(s.read(5))
op.write(struct.pack('<Q', size))
while True:
# Read 128K chunks.
# Not sure if this has to be 1 instead to trigger streaming in pylzma...
tmp = s.read(1<<17)
if not tmp:
break
op.write(tmp)
if __name__ == "__main__":
import sys
try:
_, infile, outfile = sys.argv
except:
infile, outfile = __file__, __file__ + u".7z"
sevenzip(infile, outfile)
print("compressed {} to {}".format(infile, outfile))

How to read from a text file compressed with 7z?

I would like to read (in Python 2.7), line by line, from a csv (text) file, which is 7z compressed. I don't want to decompress the entire (large) file, but to stream the lines.
I tried pylzma.decompressobj() unsuccessfully. I get a data error. Note that this code doesn't yet read line by line:
input_filename = r"testing.csv.7z"
with open(input_filename, 'rb') as infile:
obj = pylzma.decompressobj()
o = open('decompressed.raw', 'wb')
obj = pylzma.decompressobj()
while True:
tmp = infile.read(1)
if not tmp: break
o.write(obj.decompress(tmp))
o.close()
Output:
o.write(obj.decompress(tmp))
ValueError: data error during decompression
This will allow you to iterate the lines. It's partially derived from some code I found in an answer to another question.
At this point in time (pylzma-0.5.0) the py7zlib module doesn't implement an API that would allow archive members to be read as a stream of bytes or characters — its ArchiveFile class only provides a read() function that decompresses and returns the uncompressed data in a member all at once. Given that, about the best that can be done is return bytes or lines iteratively via a Python generator using that as a buffer.
The following does the latter, but may not help if the problem is the archive member file itself is huge.
The code below should work in Python 3.x as well as 2.7.
import io
import os
import py7zlib
class SevenZFileError(py7zlib.ArchiveError):
pass
class SevenZFile(object):
#classmethod
def is_7zfile(cls, filepath):
""" Determine if filepath points to a valid 7z archive. """
is7z = False
fp = None
try:
fp = open(filepath, 'rb')
archive = py7zlib.Archive7z(fp)
_ = len(archive.getnames())
is7z = True
finally:
if fp: fp.close()
return is7z
def __init__(self, filepath):
fp = open(filepath, 'rb')
self.filepath = filepath
self.archive = py7zlib.Archive7z(fp)
def __contains__(self, name):
return name in self.archive.getnames()
def readlines(self, name, newline=''):
r""" Iterator of lines from named archive member.
`newline` controls how line endings are handled.
It can be None, '', '\n', '\r', and '\r\n' and works the same way as it does
in StringIO. Note however that the default value is different and is to enable
universal newlines mode, but line endings are returned untranslated.
"""
archivefile = self.archive.getmember(name)
if not archivefile:
raise SevenZFileError('archive member %r not found in %r' %
(name, self.filepath))
# Decompress entire member and return its contents iteratively.
data = archivefile.read().decode()
for line in io.StringIO(data, newline=newline):
yield line
if __name__ == '__main__':
import csv
if SevenZFile.is_7zfile('testing.csv.7z'):
sevenZfile = SevenZFile('testing.csv.7z')
if 'testing.csv' not in sevenZfile:
print('testing.csv is not a member of testing.csv.7z')
else:
reader = csv.reader(sevenZfile.readlines('testing.csv'))
for row in reader:
print(', '.join(row))
If you were using Python 3.3+, you might be able to do this using the lzma module which was added to the standard library in that version.
See: lzma Examples
If you can use python 3, there is a useful library, py7zr, which supports partially 7zip decompression as below:
import py7zr
import re
filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
allfiles = archive.getnames()
selective_files = [f if filter_pattern.match(f) for f in allfiles]
archive.extract(targets=selective_files)

Add file name as last column of CSV file

I have a Python script which modifies a CSV file to add the filename as the last column:
import sys
import glob
for filename in glob.glob(sys.argv[1]):
file = open(filename)
data = [line.rstrip() + "," + filename for line in file]
file.close()
file = open(filename, "w")
file.write("\n".join(data))
file.close()
Unfortunately, it also adds the filename to the header (first) row of the file. I would like the string "ID" added to the header instead. Can anybody suggest how I could do this?
Have a look at the official csv module.
Here are a few minor notes on your current code:
It's a bad idea to use file as a variable name, since that shadows the built-in type.
You can close the file objects automatically by using the with syntax.
Don't you want to add an extra column in the header line, called something like Filename, rather than just omitting a column in the first row?
If your filenames have commas (or, less probably, newlines) in them, you'll need to make sure that the filename is quoted - just appending it won't do.
That last consideration would incline me to use the csv module instead, which will deal with the quoting and unquoting for you. For example, you could try something like the following code:
import glob
import csv
import sys
for filename in glob.glob(sys.argv[1]):
data = []
with open(filename) as finput:
for i, row in enumerate(csv.reader(finput)):
to_append = "Filename" if i == 0 else filename
data.append(row+[to_append])
with open(filename,'wb') as foutput:
writer = csv.writer(foutput)
for row in data:
writer.writerow(row)
That may quote the data slightly differently from your input file, so you might want to play with the quoting options for csv.reader and csv.writer described in the documentation for the csv module.
As a further point, you might have good reasons for taking a glob as a parameter rather than just the files on the command line, but it's a bit surprising - you'll have to call your script as ./whatever.py '*.csv' rather than just ./whatever.py *.csv. Instead, you could just do:
for filename in sys.argv[1:]:
... and let the shell expand your glob before the script knows anything about it.
One last thing - the current approach you're taking is slightly dangerous, in that if anything fails when writing back to the same filename, you'll lose data. The standard way of avoiding this is to instead write to a temporary file, and, if that was successful, rename the temporary file over the original. So, you might rewrite the whole thing as:
import csv
import sys
import tempfile
import shutil
for filename in sys.argv[1:]:
tmp = tempfile.NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name,'wb') as ftmp:
writer = csv.writer(ftmp)
for i, row in enumerate(csv.reader(finput)):
to_append = "Filename" if i == 0 else filename
writer.writerow(row+[to_append])
shutil.move(tmp.name,filename)
You can try:
data = [file.readline().rstrip() + ",id"]
data += [line.rstrip() + "," + filename for line in file]
You can try changing your code, but using the csv module is recommended. This should give you the result you want:
import sys
import glob
import csv
filename = glob.glob(sys.argv[1])[0]
yourfile = csv.reader(open(filename, 'rw'))
csv_output=[]
for row in yourfile:
if len(csv_output) != 0: # skip the header
row.append(filename)
csv_output.append(row)
yourfile = csv.writer(open(filename,'w'),delimiter=',')
yourfile.writerows(csv_output)
Use the CSV module that comes with Python.
import csv
import sys
def process_file(filename):
# Read the contents of the file into a list of lines.
f = open(filename, 'r')
contents = f.readlines()
f.close()
# Use a CSV reader to parse the contents.
reader = csv.reader(contents)
# Open the output and create a CSV writer for it.
f = open(filename, 'wb')
writer = csv.writer(f)
# Process the header.
header = reader.next()
header.append('ID')
writer.writerow(header)
# Process each row of the body.
for row in reader:
row.append(filename)
writer.writerow(row)
# Close the file and we're done.
f.close()
# Run the function on all command-line arguments. Note that this does no
# checking for things such as file existence or permissions.
map(process_file, sys.argv[1:])
You can run this as follows:
blair#blair-eeepc:~$ python csv_add_filename.py file1.csv file2.csv
you can use fileinput to do in place editing
import sys
import glob
import fileinput
for filename in glob.glob(sys.argv[1]):
for line in fileinput.FileInput(filename,inplace=1) :
if fileinput.lineno()==1:
print line.rstrip() + " ID"
else
print line.rstrip() + "," + filename

Categories