Script to sum of file size

Script to sum of file size - python

I am trying to calculate the sum of size of various files. This is my script:
import os
date = raw_input('Enter date in format YYYYMMDD ')
file1 = 'p_poupe_' + date + '.tar.gz.done'
file2 = 'p_poupw_' + date + '.tar.gz.done'
file3 = 'p_pojk_' + date + '.tar.gz.done'
a1 = os.system('zcat ' + file1 + '|wc --bytes')
a2 = os.system('zcat ' + file2 + '|wc --bytes')
a3 = os.system('zcat ' + file3 + '|wc --bytes')
print a1,a2,a3
sum = a1 + a2 + a3
print sum
But the values are not storing in variable. Can any one tell me what I am doing wrong. How can I modify script so that values are stored in variable and not as a output.

On Unix, the return value is the exit status of the process encoded in
the format specified for wait(). Note that POSIX does not specify the
meaning of the return value of the C system() function, so the return
value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell
after running command, given by the Windows environment variable
COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always
0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit
status of the command run; on systems using a non-native shell,
consult your shell documentation.
https://docs.python.org/2/library/os.html#os.system
The problem is that you're using exit-codes rather than stdout data as your "values".
You're probably looking to use subprocess.Popen for instance. Or just simply code the solution manually by opening the files.
Try using https://docs.python.org/3/library/gzip.html
import gzip
def get_fcont_len(fname):
with gzip.open(fname) as f:
return len(f.read())
total = 0
date = raw_input('Enter date in format YYYYMMDD ')
total += get_fcont_len('p_poupe_' + date + '.tar.gz.done')
total += get_fcont_len('p_poupw_' + date + '.tar.gz.done')
total += get_fcont_len('p_pojk_' + date + '.tar.gz.done')
print(total)

os.system return the exit status of the command not the output of the command. To capture the output of a command you should look into the subprocess module.
subprocess.check_output("zcat " + file1 + " | wc --bytes", shell=True)
# Output the size in bytes of file1 with a trailing new line character
However it is probably better to use other python modules/methods to do that as suggested by other as it is preferable to do things directly in Python.

The uncompressed file size is stored in the last 4 bytes of the gzip file. This function will return the size of the uncompressed file, i.e. the "gunzipped" size:
import os
import gzip
import struct
def get_gunzipped_size(filename):
with gzip.open(filename) as f:
_ = f.read(1) # elicit IOError if file is not a gzip file
f.fileobj.seek(-4, os.SEEK_END)
return struct.unpack('<i', f.fileobj.read(4))[0]
On large files this is much faster than reading all of the uncompressed data and counting it's length because the whole file does not need to be decompressed.
Fitting this into your code:
import os
date = raw_input('Enter date in format YYYYMMDD ')
prefixes = ('p_poupe_', 'p_poupw_', 'p_pojk_')
files = ['{}{}.tar.gz.done'.format(prefix, date) for prefix in prefixes]
total_uncompressed = sum(get_gunzipped_size(f) for f in files)
print total_uncompressed

You can capture the output of a command using getoutput function from commands as:
import commands as cm
.
.
.
a1 = cm.getoutput('zcat ' + file1 + '|wc --bytes')
a2 = cm.getoutput('zcat ' + file2 + '|wc --bytes')
a3 = cm.getoutput('zcat ' + file3 + '|wc --bytes')
# Note that the outputs are in string format so you need to convert them to integers or floats
a1, a2, a3 = float(a1), float(a2), float(a3)
print a1,a2,a3
sum = a1 + a2 + a3
print sum

You can use the os module to get the file size. Try this:
import os
import tarfile
tar = tarfile.open("yourFile.tar.gz")
tar.extractall("folderWithExtractedFiles")
print os.path.getsize("folderWithExtractedFiles/yourFileInsideTarGz")

Related

Split huge file into n files keeping first 7 columns + next 3 columns until column n

I have a huge data frame with columns names:
A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n
Using unix/bash or python, I want to produce n individual files with the following columns:
A,B,C,D,F,G,H,GT_a,N_a_,E_a
A,B,C,D,F,G,H,GT_b,N_b_,E_b
A,B,C,D,F,G,H,GT_c,N_c_,E_c
....
A,B,C,D,F,G,H,GT_n,N_n_,E_n
Each file should be called: a.txt, b.txt, c.txt,...,n.txt

Here are a couple of solutions with bash tools.
1. bash
Using cut inside a bash loop.This will raise n processes and parse the file n times.
Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk and print them every 8th, 11th, etc column into the bash array.
#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )
for i in "${!ids[#]}"; do
cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done
Usage: bash test.sh file
2. awk
Or you can use awk. Here I customize just the number of outputs, but the others can also be done like in the first solution.
BEGIN { FS=OFS=","; times=14 }
{
for (i=1;i<=times;i++) {
print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
}
}
Usage: awk -f test.awk file.
This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.

This should write out the different files, with different headers for each file. You'll have to change the COL_NAMES_TO_WRITE to be the ones that you want.
It uses the standard library, so no pandas. It won't write out more than 26 different files.. but the filename generator could be changed to augment that and allow that.
If I'm interpreting this question correctly, you want to split this into 14 files (a..n)
You'll have to copy this code below into a file, splitter.py
And then run this command:
python3.8 splitter.py --fn largefile.txt -n 14
Where largefile.txt is your huge file that you need to split.
import argparse
import csv
import string
COL_NAMES_TO_WRITE = "A,B,C,D,F,G,H,GT_{letter},N_{letter},E_{letter}"
WRITTEN_HEADERS = set() # place to keep track of whether headers have been written
def output_file_generator(num):
if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}")
i = 0
while True:
prefix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
yield f"{prefix}.txt"
def col_name_generator(num):
i = 0
while True:
col_suffix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
print( COL_NAMES_TO_WRITE.format(letter=col_suffix).split(','))
yield COL_NAMES_TO_WRITE.format(letter=col_suffix).split(',')
def main(filename, num_files=4):
"""Split a file into multiple files
Args:
filename (str): large filename that needs to be split into multiple files
num_files (int): number of files to split filename into
"""
print(filename)
with open(filename, 'r') as large_file_fp:
reader = csv.DictReader(large_file_fp)
output_files = output_file_generator(num_files)
col_names = col_name_generator(num_files)
for line in reader:
print(line)
filename_for_this_file = output_files.__next__()
print("filename ", filename_for_this_file)
column_names_for_this_file = col_names.__next__()
print("col names:", column_names_for_this_file)
with open(filename_for_this_file, 'a') as output_fp:
writer = csv.DictWriter(output_fp, fieldnames=column_names_for_this_file)
if filename_for_this_file not in WRITTEN_HEADERS:
writer.writeheader()
WRITTEN_HEADERS.add(filename_for_this_file)
just_these_fields = {k:v for k,v in line.items() if k in column_names_for_this_file}
writer.writerow(just_these_fields)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-fn", "--filename", required=True, default='large_file.txt', help="filename of large file to be split")
parser.add_argument("-n", "--num_files", required=False, default=4, help="number of separate files to split large_file into")
args = parser.parse_args()
main(args.filename, int(args.num_files))

import pandas as pd
import numpy as np
c = "A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,GT_d,N_d_,E_d,GT_e,N_e_,E_e".split(',')
df = pd.DataFrame(np.full((30, 22), c), columns=c)
c = None
c = list(df.columns)
default = c[:7]
var = np.matrix(c[7:])
var = pd.DataFrame(var.reshape(var.shape[1]//3, 3))
def dump(row):
cols = default + list(row)
magic = cols[-1][-1]
df[cols].to_csv(magic + '.txt')
var.apply(dump, axis=1)

Parametrized complex string working with gpkg in python

I need to construct a chain of text like this:
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
Here is my code:
import glob, time, sys, threading, os
from datetime import date, timedelta, datetime
import time, threading
#Parameters
layer = 'C:\\layer.gpkg'
n ='2020'
outdir = 'C:\\output'
#Process
l = os.path.realpath(layer)
pn = os.path.realpath(outdir + '/' + n + '.gpkg')
p = f"'{pn}'"
f = f"'{n}'"
o = f'ogr:dbname={p} table={f} (geom) sql='
#Test
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
o == out
The goal is to get o == out.
What do I need to change in the #Process part in order to get this as True ?
Moreover I need to run this either in linux or windows.
My final goal is to create a function that give 3 strings returns the complex string line shown above.

Assuming you are using python 3.6 or above you should use format strings (also known as f strings) to construct strings from variables. Start the string with the letter "f" and then put whatever variables you want in curly brackets {}. Also if you use single quotes as the outer quote then you don't have to escape double quotes and vice versa.
Code:
db_name = "'home/user/output/prueba.gpkg'"
table_name = '"prueba"'
outputlayer = f'ogr:dbname={db_name} table={table_name} (geom) sql='
outputlayer
Output:
'ogr:dbname=\'home/user/output/prueba.gpkg\' table="prueba" (geom) sql='

I think one of the issues that this isn't working is your path here pn = os.path.realpath(outdir + '/' + n + '.gpkg'). This is trying to combine UNIX path / with windows path \\. A more robust solution in terms of portability between linux and windows would be to use the path.join function in os module.
Additionally, python f strings will only add escapes to whichever quote character you used to open the string (' or "). If the escaped quotes around both strings are necessary, you're probably better off hard coding it into an f-string instead of setting 2 new variables with different quote types.
import glob, time, sys, threading, os
from datetime import date, timedelta, datetime
import time, threading
#Parameters
layer = 'C:\\layer.gpkg'
n ='2020'
outdir = 'C:\\output'
#Process
l = os.path.realpath(layer)
pn = os.path.realpath(os.path.join(outdir, f"{n}.gpkg"))
o = f'ogr:dbname=\'{pn}\' table=\"{n}\" (geom) sql='
#Test
out = 'ogr:dbname=\'C:\\output\\2020.gpkg\' table=\"2020\" (geom) sql='
o == out
A version of this (different path) has been tested to work on my linux machine.

Another option is to use a triple quoted string:
dbname = """/home/user/output/prueba.gpkg"""
outputlayer = """ogr:dbname='"""+dbname+"""' table="prueba" (geom) sql="""
which gives:
'ogr:dbname=\'/home/user/output/prueba.gpkg\' table="prueba" (geom) sql='

how to manipulate SREC file

I have an S19 file looking something like below:
S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8
I want to separate the first two characters and also the next two characters, and so on... I want it to look like below (last two characters are also to be separated for each line):
S0, 03, 0000, FC
S3, 0D, 0003C000, 0F00000000000000, 20
S3, FD, 00000000, 782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, ED, 000000F8, 3D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, 15, 00000400, FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF, 7D
S3, FD, 00000410, 10B5DFF828000468012147F22C10C4F20300016047F22010C4F20300, 00
S7, 05, 00008EB4, B8
How can I do this in Python?
I have something like this:
#!/usr/bin/python
import string,os,sys,re,fileinput
print "hi"
inputfile = "k60.S19"
outputfile = "k60_out.S19"
# open the source file and read it
fh = file(inputfile, 'r')
subject = fh.read()
fh.close()
# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters
pattern2 = re.compile(r'S3')
pattern3 = re.compile(r'S7')
pattern1 = re.compile(r'S0')
# do the replace
result1 = pattern1.sub("S0, ", subject)
result2 = pattern2.sub("S3, ", subject)
result3 = pattern3.sub("S7, ", subject)
# write the file
f_out = file(outputfile, 'w')
f_out.write(result1)
f_out.write(result2)
f_out.write(result3)
f_out.close()
#EoF
but it is not working as I like!! Can someone help me with how to come up with proper regular expression use for this?

try package bincopy, maybe you need it.
bincopy - Interpret strings as packed binary data
Mangling of various file formats that conveys binary information (Motorola S-Record, Intel HEX and binary files).
import bincopy
f = bincopy.BinFile()
f.add_srec_file("path/to/your/s19/flie.s19")
f.as_binary() # print s19 as binary
or you can easily use open() for a file:
with open("path/to/your/s19/flie.s19") as s19:
for line in s19:
type = line[0:2]
count = line[2:4]
adress = line[4:12]
data = line[12:-2]
crc = line[-2:]
print type + ", "+ count + ", " + adress + ", " + data + ", " + crc + "\n"
hope it helps.
Motorola S-record file format

You can do it using a callback function as replacement with re.sub:
#!/usr/bin/python
import re
data = r'''S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8'''
pattern = re.compile(r'^(..)(..)((?:.{4}){1,2})(.*)(?=..)', re.M)
def repl(m):
repstr = ''
for g in m.groups():
if (g):
repstr += g + ', '
return repstr
print re.sub(pattern, repl, data)
However, as Mark Setchell notices it, there is probably a nice way to do it with slicing.

I know you are thinking Python and regexes, but this was made for awk and the following will maybe help you work out the way to do it using slicing:
awk '{r=length($0);print substr($0,1,2),substr($0,3,2),substr($0,5,8),substr($0,13,r-14),substr($0,r-1)}' OFS=, k60.s19
That says "get the length of the line in variable r, then print the first two characters, the next two characters, the next 8 characters and so on... and use a comma as the field separator".
EDITED
Here are a few more hints to get you started...
if you want to avoid printing line 1, you can do
awk 'FNR==1{next} ...rest of awk script above ... '
If you want to only process lines longer than 40 characters, you can do
awk 'length($0)>40 {print}' yourfile
If you only want to process lines where the second field is "xx", you can do
awk '$2 ~ "xx" {print}' yourfile

python subprocess.call() not executing in loop

I have the following piece of code that generates plots with gnuplot:
import sys, glob, subprocess, os, time
for file in glob.glob('comb_*.out'):
fNameParts = file[5:].split('.')[0].split('-')
gPlotCmd = []
gPlotCmd = 'unset border; set xl "rotation period #"; set yl "T [K]"\n'
gPlotCmd += 'set key above\n'
gPlotCmd += 'set grid xtics ytics\n'
gPlotCmd += 'set term post eps enh color solid\n'
gPlotCmd += 'set xrange [20:876]\n'
gPlotCmd += 'set output "fig_TestRelax-' + fNameParts[1] + '-' + fNameParts[2] + '-' + fNameParts[3] + '-' + fNameParts[4] + '.eps"\n'
conf = fNameParts[1] + '-' + fNameParts[2] + '-' + fNameParts[3]
gPlotCmd += 'plot "' + file + '" using ($1/36000):($9-$3) w l lw 5 title "OP3-OP1 ' + conf + '", "' + file + '" using ($1/36000):($6-$3) w l lw 3 title "OP2-OP1 ' + conf + '", "' + file + '" using ($1/36000):($9-$6) w l lw 1 title "OP3-OP2 ' + conf + '"'
fw = open('temp.plt','w+')
fw.writelines(gPlotCmd)
subprocess.call(["gnuplot","temp.plt"])
print(file)
In the last loop execution the subprocess.call(["gnuplot","temp.plt"]) does not seem to be executed. At the end of the program, temp.plt exists with data from the last iteration. Also print(file) is executed during the last loop. Also if I plot the temp.plt left after the program I get the last plot (so there is no problem on the side of the data). Only the line subprocess.call(["gnuplot","temp.plt"]) is not executed. I also tried popen and monitor stdout and stderr but both were empty (as in all other iterations.
The checked the problem occurs both on linux and windows and in versions 3.3.5 and 2.7.3.
I conclude that there is something wrong with the script but I do not know what.

#lvc's and yours answer are correct; it is a buffering issue and fw.flush() should fix it. But you don't need the temporary file, you could pass the input commands to gnuplot directly without writing them to disk:
from subprocess import Popen, PIPE
p = Popen('gnuplot', stdin=PIPE)
p.communicate(input=gPlotCmd.encode('ascii'))

One likely error here is that the file temp.plt isn't actually written to disk at the time you run gnuplot. Python doesn't necessarily flush its buffers immediately after a call to writelines. This would mean that when gnuplot is launched from your script, it sees an empty file. It doesn't give an error, because an empty input isn't an error, and it has no way of knowing its expecting anything else. By the time you run it outside of your script, Python has exited and hence can't be holding anything in its own buffers anymore.
Use the with statement to ensure that fw is closed once you're done with it:
with open('temp.plt', 'w') as fw:
fw.writelines(gPlotCmd)
subprocess.call(["gnuplot","temp.plt"])

It seems I figured that out. I am missing fw.close(). The last lines of code should look:
fw = open('temp.plt','w+')
fw.writelines(gPlotCmd)
fw.close()
subprocess.call(["gnuplot","temp.plt"])
print(file)
Now the code produces the intended plots.

Should I be comparing bytes using struct?

I'm trying to compare the data within two files, and retrieve a list of offsets of where the differences are.
I tried it on some text files and it worked quite well..
However on non-text files (that still contain ascii text), I call them binary data files. (executables, so on..)
It seems to think some bytes are the same, even though when I look at it in hex editor, they are obviously not. I tried printing out this binary data that it thinks is the same and I get blank lines where it should be printed.
Thus, I think this is the source of the problem.
So what is the best way to compare bytes of data that could be both binary and contain ascii text? I thought using the struct module by be a starting point...
As you can see below, I compare the bytes with the == operator
Here's the code:
import os
import math
#file1 = 'file1.txt'
#file2 = 'file2.txt'
file1 = 'file1.exe'
file2 = 'file2.exe'
file1size = os.path.getsize(file1)
file2size = os.path.getsize(file2)
a = file1size - file2size
end = file1size #if they are both same size
if a > 0:
#file 2 is smallest
end = file2size
big = file1size
elif a < 0:
#file 1 is smallest
end = file1size
big = file2size
f1 = open(file1, 'rb')
f2 = open(file2, 'rb')
readSize = 500
r = readSize
off = 0
data = []
looking = False
d = open('data.txt', 'w')
while off < end:
f1.seek(off)
f2.seek(off)
b1, b2 = f1.read(r), f2.read(r)
same = b1 == b2
print ''
if same:
print 'Same at: '+str(off)
print 'readSize: '+str(r)
print b1
print b2
print ''
#save offsets of the section of "different" bytes
#data.append([diffOff, diffOff+off-1]) #[begin diff off, end diff off]
if looking:
d.write(str(diffOff)+" => "+str(diffOff+off-2)+"\n")
looking = False
r = readSize
off = off + 1
else:
off = off + r
else:
if r == 1:
looking = True
diffOff = off
off = off + 1 #continue reading 1 at a time, until u find a same reading
r = 1 #it will shoot back to the last off, since we didn't increment it here
d.close()
f1.close()
f2.close()
#add the diff ending portion to diff data offs, if 1 file is longer than the other
a = int(math.fabs(a)) #get abs val of diff
if a:
data.append([big-a, big-1])
print data

Did you try difflib and filecmp modules?
This module provides classes and
functions for comparing sequences. It
can be used for example, for comparing
files, and can produce difference
information in various formats,
including HTML and context and unified
diffs. For comparing directories and
files, see also, the filecmp module.
The filecmp module defines functions
to compare files and directories, with
various optional time/correctness
trade-offs. For comparing files, see
also the difflib module
.

You are probably encountering encoding/decoding problems. Someone may suggest a better solution, but you could try reading the file into a bytearray so you're reading raw bytes instead of decoded characters:
Here's a crude example:
$ od -Ax -tx1 /tmp/aa
000000 e0 b2 aa 0a
$ od -Ax -tx1 /tmp/bb
000000 e0 b2 bb 0a
$ cat /tmp/diff.py
a = bytearray(open('/tmp/aa', 'rb').read())
b = bytearray(open('/tmp/bb', 'rb').read())
print "%02x, %02x" % (a[2], a[3])
print "%02x, %02x" % (b[2], b[3])
$ python /tmp/diff.py
aa, 0a
bb, 0a

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Script to sum of file size - python

You can use the os module to get the file size. Try this: import os import tarfile tar = tarfile.open("yourFile.tar.gz") tar.extractall("folderWithExtractedFiles") print os.path.getsize("folderWithExtractedFiles/yourFileInsideTarGz")

Related

Split huge file into n files keeping first 7 columns + next 3 columns until column n

Parametrized complex string working with gpkg in python

how to manipulate SREC file

python subprocess.call() not executing in loop

Should I be comparing bytes using struct?

Categories

Resources