pypy memory usage grows forever? - python

I have a complicated python server app, that runs constantly all the time. Below is a very simplified version of it.
When I run the below app using python; "python Main.py". It uses 8mb of ram straight away, and stays at 8mb of ram, as it should.
When I run it using pypy "pypy Main.py". It begins by using 22mb of ram and over time the ram usage grows. After a 30 seconds its at 50mb, after an hour its at 60mb.
If I change the "b.something()" to be "pass" it doesn't gobble up memory like that.
I'm using pypy 1.9 on OSX 10.7.4
I'm okay with pypy using more ram than python.
Is there a way to stop pypy from eating up memory over long periods of time?
import sys
import time
import traceback
class Box(object):
def __init__(self):
self.counter = 0
def something(self):
self.counter += 1
if self.counter > 100:
self.counter = 0
try:
print 'starting...'
boxes = []
for i in range(10000):
boxes.append(Box())
print 'running!'
while True:
for b in boxes:
b.something()
time.sleep(0.02)
except KeyboardInterrupt:
print ''
print '####################################'
print 'KeyboardInterrupt Exception'
sys.exit(1)
except Exception as e:
print ''
print '####################################'
print 'Main Level Exception: %s' % e
print traceback.format_exc()
sys.exit(1)
Below is a list of times and the ram usage at that time (I left it running over night).
Wed Sep 5 22:57:54 2012, 22mb ram
Wed Sep 5 22:57:54 2012, 23mb ram
Wed Sep 5 22:57:56 2012, 24mb ram
Wed Sep 5 22:57:56 2012, 25mb ram
Wed Sep 5 22:57:58 2012, 26mb ram
Wed Sep 5 22:57:58 2012, 27mb ram
Wed Sep 5 22:57:59 2012, 29mb ram
Wed Sep 5 22:57:59 2012, 30mb ram
Wed Sep 5 22:58:00 2012, 31mb ram
Wed Sep 5 22:58:02 2012, 32mb ram
Wed Sep 5 22:58:03 2012, 33mb ram
Wed Sep 5 22:58:05 2012, 34mb ram
Wed Sep 5 22:58:08 2012, 35mb ram
Wed Sep 5 22:58:10 2012, 36mb ram
Wed Sep 5 22:58:12 2012, 38mb ram
Wed Sep 5 22:58:13 2012, 39mb ram
Wed Sep 5 22:58:16 2012, 40mb ram
Wed Sep 5 22:58:19 2012, 41mb ram
Wed Sep 5 22:58:21 2012, 42mb ram
Wed Sep 5 22:58:23 2012, 43mb ram
Wed Sep 5 22:58:26 2012, 44mb ram
Wed Sep 5 22:58:28 2012, 45mb ram
Wed Sep 5 22:58:31 2012, 46mb ram
Wed Sep 5 22:58:33 2012, 47mb ram
Wed Sep 5 22:58:35 2012, 49mb ram
Wed Sep 5 22:58:35 2012, 50mb ram
Wed Sep 5 22:58:36 2012, 51mb ram
Wed Sep 5 22:58:36 2012, 52mb ram
Wed Sep 5 22:58:37 2012, 54mb ram
Wed Sep 5 22:59:41 2012, 55mb ram
Wed Sep 5 22:59:45 2012, 56mb ram
Wed Sep 5 22:59:45 2012, 57mb ram
Wed Sep 5 23:00:58 2012, 58mb ram
Wed Sep 5 23:02:20 2012, 59mb ram
Wed Sep 5 23:02:20 2012, 60mb ram
Wed Sep 5 23:02:27 2012, 61mb ram
Thu Sep 6 00:18:00 2012, 62mb ram

http://doc.pypy.org/en/latest/gc_info.html#minimark-environment-variables shows how to tweak the gc

Compared to cpython, pypy uses different garbage collection strategies.
If the increase in memory is due to something in your program, you could try to run a forced garbage collection every now and then, by using the collect function from the gc module. In this case, it might also help to explicitly del large objects that you don't need anymore and that don't go out of scope.
If it is due to the internal workings of pypy, it might be worth it submitting a bug report, as Mark Dickinson suggested.

Related

Python gzip gives null bytes

I'm trying to parse some log files in Python, but my responses always return only null bytes.
I've confirmed that the file in question does contain data:
$ zcat Events.log.gz | wc -c
188371128
$ zcat Events.log.gz | head
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
But attempting to read it in Python gives only null bytes:
$ python
Python 2.6.9 (unknown, Sep 14 2016, 17:46:59)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> filename = 'Events.log.gz'
>>> import gzip
>>> content = gzip.open(filename).read()
>>> len(content)
188371128
>>> for i in range(10):
... content[i*10000:(i*10000)+10]
...
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
I've tried explicitly setting 'mode' to either 'r' or 'rb', with no difference in result.
I've also tried subprocess.Popen(['zcat', filename], stdout=subprocess.PIPE).stdout.read(), with the same response.
Perhaps relevantly, when I tried to zcat the file to another file, the output was a binary file:
$ zcat Events.log.gz > /tmp/logoutput
$ less /tmp/logoutput
"/tmp/logoutput" may be a binary file. See it anyway?
[y]
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#...
$ head /tmp/logoutput
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}

Python - Parsing a text file into a csv file

I have a text file that is output from a command that I ran with Netmiko to retrieve data from a Cisco WLC of things that are causing interference on our WiFi network. I stripped out just what I needed from the original 600k lines of code down to a couple thousand lines like this:
AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017
The goal is to end up with a csv file that strips out the 'AP Name ...' and puts what left on the same line as the rest of the information in the next line. The problem is some have two lines below the AP name and some have 1 or none. I have been at it for 8 hours and cannot find the best way to make this happen.
This is the latest version of code that I was trying to use, any suggestions for making this work? I just want something I can load up in excel and create a report with:
with open(outfile_name, 'w') as out_file:
with open('wlc-interference_raw.txt', 'r')as in_file:
#Variables
_ap_name = ''
_temp = ''
_flag = False
for i in in_file:
if 'AP Name' in i:
#write whatever was put in the temp file to disk because new ap now
#add another temp variable in case an ap has more than 1 interferer and check if new AP name
out_file.write(_temp)
out_file.write('\n')
#print(_temp)
_ap_name = i.lstrip('AP Name.......................................... ')
_ap_name = _ap_name.rstrip('\n')
_temp = _ap_name
#print(_temp)
elif '----' in i:
pass
elif 'Class Type' in i:
pass
else:
line_split = i.split()
for x in line_split:
_temp += ','
_temp += x
_temp += '\n'
I think your best option is to read all lines of the file, then split into sections starting with AP Name. Then you can work on parsing each section.
Example
s = """AP Name.......................................... 010-HIGH-FL4-AP04
Microwave Oven 11 10 -59 Mon Dec 18 08:21:23 2017
WiMax Mobile 11 0 -84 Fri Dec 15 17:09:45 2017
WiMax Fixed 11 0 -68 Tue Dec 12 09:29:30 2017
AP Name.......................................... 010-2nd-AP04
Microwave Oven 11 10 -61 Sat Dec 16 11:20:36 2017
WiMax Fixed 11 0 -78 Mon Dec 11 12:33:10 2017
AP Name.......................................... 139-FL1-AP03
Microwave Oven 6 18 -51 Fri Dec 15 12:26:56 2017
AP Name.......................................... 010-HIGH-FL3-AP04
Microwave Oven 11 10 -55 Mon Dec 18 07:51:23 2017
WiMax Mobile 11 0 -83 Wed Dec 13 16:16:26 2017"""
import re
class AP:
"""
A class holding each section of the parsed file
"""
def __init__(self):
self.header = ""
self.content = []
sections = []
section = None
for line in s.split('\n'): # Or 'for line in file:'
# Starting new section
if line.startswith('AP Name'):
# If previously had a section, add to list
if section is not None:
sections.append(section)
section = AP()
section.header = line
else:
if section is not None:
section.content.append(line)
sections.append(section) # Add last section outside of loop
for section in sections:
ap_name = section.header.lstrip("AP Name.") # lstrip takes all the characters given, not a literal string
for line in section.content:
print(ap_name + ",", end="")
# You can extract the date separately, if needed
# Splitting on more than one space using a regex
line = ",".join(re.split(r'\s\s+', line))
print(line.rstrip(',')) # Remove trailing comma from imperfect split
Output
010-HIGH-FL4-AP04,Microwave Oven,11,10,-59,Mon Dec 18 08:21:23 2017
010-HIGH-FL4-AP04,WiMax Mobile,11,0,-84,Fri Dec 15 17:09:45 2017
010-HIGH-FL4-AP04,WiMax Fixed,11,0,-68,Tue Dec 12 09:29:30 2017
010-2nd-AP04,Microwave Oven,11,10,-61,Sat Dec 16 11:20:36 2017
010-2nd-AP04,WiMax Fixed,11,0,-78,Mon Dec 11 12:33:10 2017
139-FL1-AP03,Microwave Oven,6,18,-51,Fri Dec 15 12:26:56 2017
010-HIGH-FL3-AP04,Microwave Oven,11,10,-55,Mon Dec 18 07:51:23 2017
010-HIGH-FL3-AP04,WiMax Mobile,11,0,-83,Wed Dec 13 16:16:26 2017
Tip:
You don't need Python to write the CSV, you can output to a file using the command line
python script.py > output.csv

Delete every row of data preceding "MSG" row in txt file?

I have a txt file that has 70K or so rows of data with 8 or so columns. The 2nd column defines the data type (either SMP or MSG). Within that data file, there are 62 rows of data total identified as "MSG". I am trying to make a simple awk command or even a short python script that will delete 1 row of data that precedes every "MSG" row in the file. Example section out of actual data file:
976810 SMP 2 144.79 108.25
993461 SMP 2 144.68 108.15
945277 SMP 2 144.90 108.10
945828 SMP 3 144.83 108.31
945237 MSG 3 # Message: 5
943544 SMP 3 144.87 108.58
945209 SMP 3 144.93 108.68
976916 SMP 3 145.17 108.72
997481 SMP 3 140.90 109.33
914197 SMP 4 140.79 109.15
945300 MSG 4 # Message: 0
940848 SMP 4 140.84 109.11
945568 SMP 4 140.91 109.03
945200 SMP 4 141.08 109.01
So in the example above, I need to delete the SMP lines right before every MSG line.
I thought maybe I'd use an awk command to search for $2=='MSG' and then delete row MSG-1 or something.
I very much appreciate any suggestions/help/guidance on this!
Regards
$ awk 'NR>1 && $2!="MSG"{print prev} {prev=$0} END{print prev}' file
976810 SMP 2 144.79 108.25
993461 SMP 2 144.68 108.15
945277 SMP 2 144.90 108.10
945237 MSG 3 # Message: 5
943544 SMP 3 144.87 108.58
945209 SMP 3 144.93 108.68
976916 SMP 3 145.17 108.72
997481 SMP 3 140.90 109.33
945300 MSG 4 # Message: 0
940848 SMP 4 140.84 109.11
945568 SMP 4 140.91 109.03
945200 SMP 4 141.08 109.01

Pandas - Optimal persistence strategy for highest compression ratio?

Question
Given a large series of DataFrames with a small variety of dtypes, what is the optimal design for Pandas DataFrame persistence/serialization if I care about compression ratio first, decompression speed second, and initial compression speed third?
Background:
I have roughly 200k dataframes of shape [2900,8] that I need to store in logical blocks of ~50 data frames per file. The data frame contains variables of type np.int8, np.float64. Most data frames are good candidates for sparse types, but sparse is not supported in HDF 'table' format stores (not that it would even help - see the size below for a sparse gzipped pickle). Data is generated daily and currently adds up to over 20GB. While I'm not bound to HDF, I have yet to find a better solution that allows for reads on individual dataframes within the persistent store, combined with top quality compression. Again, I'm willing to sacrifice a little speed for better compression ratios, especially since I will need to be sending this all over the wire.
There are a couple of other SO threads and other links that might be relevant for those that are in a similar position. However most of what I've found doesn't focus on minimizing storage size as a priority:
“Large data” work flows using pandas
HDF5 and SQLite. Concurrency, compression & I/O performance [closed]
Environment:
OSX 10.9.5
Pandas 14.1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version: 3.1.1
HDF5 version: 1.8.13
NumPy version: 1.8.1
Numexpr version: 2.4 (not using Intel's VML/MKL)
Zlib version: 1.2.5 (in Python interpreter)
LZO version: 2.06 (Aug 12 2011)
BZIP2 version: 1.0.6 (6-Sept-2010)
Blosc version: 1.3.5 (2014-03-22)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Cython version: 0.20.2
Python version: 2.7.8 (default, Jul 2 2014, 10:14:46)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Platform: Darwin-13.4.0-x86_64-i386-64bit
Byte-ordering: little
Detected cores: 8
Default encoding: ascii
Default locale: (en_US, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Example:
import pandas as pd
import numpy as np
import random
import cPickle as pickle
import gzip
def generate_data():
alldfs = {}
n = 2800
m = 8
loops = 50
idx = pd.date_range('1/1/1980',periods=n,freq='D')
for x in xrange(loops):
id = "id_%s" % x
df = pd.DataFrame(np.random.randn(n,m) * 100,index=idx)
# adjust data a bit..
df.ix[:,0] = 0
df.ix[:,1] = 0
for y in xrange(100):
i = random.randrange(n-1)
j = random.randrange(n-1)
df.ix[i,0] = 1
df.ix[j,1] = 1
df.ix[:,0] = df.ix[:,0].astype(np.int8) # adjust datatype
df.ix[:,1] = df.ix[:,1].astype(np.int8)
alldfs[id] = df
return alldfs
def store_all_hdf(x,format='table',complevel=9,complib='blosc'):
fn = "test_%s_%s-%s.hdf" % (format,complib,complevel)
hdfs = pd.HDFStore(fn,mode='w',format=format,complevel=complevel,complib=complib)
for key in x.keys():
df = x[key]
hdfs.put(key,df,format=format,append=False)
hdfs.close()
alldfs = generate_data()
for format in ['table','fixed']:
for complib in ['blosc','zlib','bzip2','lzo',None]:
store_all_hdf(alldfs,format=format,complib=complib,complevel=9)
# pickle, for comparison
with open('test_pickle.pkl','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip.pklz','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip_sparse.pklz','wb') as f:
sparsedfs = {}
for key in alldfs.keys():
sdf = alldfs[key].to_sparse(fill_value=0)
sparsedfs[key] = sdf
pickle.dump(sparsedfs,f)
Results
-rw-r--r-- 1 bazel staff 10292760 Oct 17 14:31 test_fixed_None-9.hdf
-rw-r--r-- 1 bazel staff 9531607 Oct 17 14:31 test_fixed_blosc-9.hdf
-rw-r--r-- 1 bazel staff 7867786 Oct 17 14:31 test_fixed_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 9506483 Oct 17 14:31 test_fixed_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8036845 Oct 17 14:31 test_fixed_zlib-9.hdf
-rw-r--r-- 1 bazel staff 26627915 Oct 17 14:31 test_pickle.pkl
-rw-r--r-- 1 bazel staff 8752370 Oct 17 14:32 test_pickle_gzip.pklz
-rw-r--r-- 1 bazel staff 8407704 Oct 17 14:32 test_pickle_gzip_sparse.pklz
-rw-r--r-- 1 bazel staff 14464924 Oct 17 14:31 test_table_None-9.hdf
-rw-r--r-- 1 bazel staff 8619016 Oct 17 14:31 test_table_blosc-9.hdf
-rw-r--r-- 1 bazel staff 8154716 Oct 17 14:31 test_table_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 8481631 Oct 17 14:31 test_table_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8047125 Oct 17 14:31 test_table_zlib-9.hdf
Given the results above, the best 'compression-first' solution appears to be to store the data in HDF fixed format, with bzip2. Is there a better way of organising the data, perhaps without HDF, that would allow me to save even more space?
Update 1
Per the comment below from Jeff, I have used ptrepack on the table store HDF file without initial compression -- and then recompressed. Results are below:
-rw-r--r-- 1 bazel staff 8627220 Oct 18 08:40 test_table_repack-blocsc-9.hdf
-rw-r--r-- 1 bazel staff 8627620 Oct 18 09:07 test_table_repack-blocsc-blosclz-9.hdf
-rw-r--r-- 1 bazel staff 8409221 Oct 18 08:41 test_table_repack-blocsc-lz4-9.hdf
-rw-r--r-- 1 bazel staff 8104142 Oct 18 08:42 test_table_repack-blocsc-lz4hc-9.hdf
-rw-r--r-- 1 bazel staff 14475444 Oct 18 09:05 test_table_repack-blocsc-snappy-9.hdf
-rw-r--r-- 1 bazel staff 8059586 Oct 18 08:43 test_table_repack-blocsc-zlib-9.hdf
-rw-r--r-- 1 bazel staff 8161985 Oct 18 09:08 test_table_repack-bzip2-9.hdf
Oddly, recompressing with ptrepack seems to increase total file size (at least in this case using table format with similar compressors).

regex to print out lines after hostname until delemiter

i have a list which goes as follows:
-------------------------------------------------------------------------------------------
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------------
www.someotherdomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
-------------------------------------------------------------------------------------------
www.somedifferentdomain.at UP Thu May 8 09:08:47 2014
HTTPS OK Thu May 8 09:10:26 2014
HTTPS-Cert OK Thu May 8 09:11:13 2014
-------------------------------------------------------------------------------------------
www.foobladomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:08 2014
-------------------------------------------------------------------------------------------
www.snafudomain.at UP Thu May 8 09:09:17 2014
HTTP OK Thu May 8 09:09:42 2014
HTTPS OK Thu May 8 09:10:10 2014
HTTPS-Cert OK Thu May 8 09:10:09 2014
-------------------------------------------------------------------------------------------
www.lolnotanotherdomain.de UP Thu May 8 09:06:57 2014
HTTP OK Thu May 8 09:11:10 2014
HTTPS OK Thu May 8 09:11:16 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
and i have a function which takes the hostname as parameter and prints it out:
please enter hostname to search for: www.snafudomain.at
www.snafudomain.at UP Thu May 8 09:09:17 2014
but what i want to archive is that the following lines after the hostname are printed out until the delimiter line "-----" the function i right now looks like this:
def getChecks(self,hostname):
re0 = "%s" % hostname
mylist = open('myhostlist', 'r')
for i in mylist:
if re.findall("^%s" % re0, str(i)):
print i
else:
continue
is there some easy way to do this? If something is unclear please comment. Thanks in advance
edit
to clarify the output should look like this:
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------
just want to print out the lines from the searched domain name till the line with only minuses.
How about not using regex at all?
def get_checks(self, hostname):
record = False
with open('myhostlist', 'r') as file_h:
for line in file_h:
if line.startswith(hostname):
record = True
print(line)
elif line.startswith("---"):
record = False
print(line)
elif record:
print(line)
import re
def get_checks(hostname):
pattern = re.compile(r"{}.*?(?=---)".format(re.escape(hostname)), re.S)
with open("Input.txt") as in_file:
return re.search(pattern, in_file.read())
print get_checks("www.snafudomain.at").group()
This will returns all the lines starting with www.snafudomain.at till it finds ---. The pattern generated will be like this
www\.snafudomain\.at.*?(?=---)
Online Demo
We use re.escape because your hostname has . in it. Since . has special meaning in the Regular Expressions, we just want the RegEx engine to treat . as literal dot.

Categories