I have a txt file that has 70K or so rows of data with 8 or so columns. The 2nd column defines the data type (either SMP or MSG). Within that data file, there are 62 rows of data total identified as "MSG". I am trying to make a simple awk command or even a short python script that will delete 1 row of data that precedes every "MSG" row in the file. Example section out of actual data file:
976810 SMP 2 144.79 108.25
993461 SMP 2 144.68 108.15
945277 SMP 2 144.90 108.10
945828 SMP 3 144.83 108.31
945237 MSG 3 # Message: 5
943544 SMP 3 144.87 108.58
945209 SMP 3 144.93 108.68
976916 SMP 3 145.17 108.72
997481 SMP 3 140.90 109.33
914197 SMP 4 140.79 109.15
945300 MSG 4 # Message: 0
940848 SMP 4 140.84 109.11
945568 SMP 4 140.91 109.03
945200 SMP 4 141.08 109.01
So in the example above, I need to delete the SMP lines right before every MSG line.
I thought maybe I'd use an awk command to search for $2=='MSG' and then delete row MSG-1 or something.
I very much appreciate any suggestions/help/guidance on this!
Regards
$ awk 'NR>1 && $2!="MSG"{print prev} {prev=$0} END{print prev}' file
976810 SMP 2 144.79 108.25
993461 SMP 2 144.68 108.15
945277 SMP 2 144.90 108.10
945237 MSG 3 # Message: 5
943544 SMP 3 144.87 108.58
945209 SMP 3 144.93 108.68
976916 SMP 3 145.17 108.72
997481 SMP 3 140.90 109.33
945300 MSG 4 # Message: 0
940848 SMP 4 140.84 109.11
945568 SMP 4 140.91 109.03
945200 SMP 4 141.08 109.01
Related
I have a read.log file that will have lines such as...
10.2.177.170 Tue Jun 19 03:30:55 CDT 2018
10.2.177.170 Tue Jun 19 03:31:03 CDT 2018
10.2.177.170 Tue Jun 19 03:31:04 CDT 2018
10.2.177.170 Tue Jun 19 03:32:04 CDT 2018
10.2.177.170 Tue Jun 19 03:33:04 CDT 2018
My code will read the 3rd to last line and combine strings. So the normal output would be:
2018:19:03:32:04
My problem is, if there are only 4 or less lines of data such as
10.1.177.170 Tue Jun 19 03:30:55 CDT 2018
10.1.177.170 Tue Jun 19 03:31:03 CDT 2018
10.1.177.170 Tue Jun 19 03:31:04 CDT 2018
10.1.177.170 Tue Jun 19 03:32:04 CDT 2018
I get an error
x1 = line.split()[0]
IndexError: list index out of range
How can I error check this or keep it from happening? I have been trying to check how many lines there are in the log and if less than 5, print a notice. Are there better options?
def run():
f = open('read.log', 'r')
lnumber = dict()
for num,line in enumerate(f,1):
x1 = line.split()[0]
log_day = line.split()[3]
log_time = line.split()[4]
log_year = line.split()[6]
if x1 in lnumber:
lnumber[x1].append((log_year + ":" + log_day + ":" + log_time))
else:
lnumber[x1] = [(num,log_time)]
if x1 in lnumber and len(lnumber.get(x1,None)) > 2:
# if there are less than 3 lines in document, this will fail
line_time = (lnumber[x1][-3].__str__())
print(line_time)
else:
print('nothing')
f.close
run()
f.readlines() gives you a list of lines in a file. So, you could try reading in all the lines in a file:
f = open('firewall.log', 'r')
lines = f.readlines()
And exiting if there are 4 or less lines:
if len(lines) <= 4:
f.close()
print("4 or less lines in file")
exit()
That IndexError you're getting is because you're calling split() on a line with nothing on it. I would suggest doing something like if not line: continue to avoid that case.
I'm trying to execute command in Python like this:
os.system('ls')
What's interesting is that the output length is limited by the terminal window size where I'm running this python console.
>>>os.system('ls -l')
total 0
-rw-r--r-- 1 hy hy 0 Apr 29 22:30 a.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 b.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 c.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 d.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 e.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 f.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 g.txt
>>>
I did that in a directory containing hundreds of files, and intentionally re-size the terminal window very small, it only outputs very few lines which exactly fill the entire window. If I use smaller terminal window, it output even less lines. Every time it just outputs down to the low boundary of my terminal window.
It's not that python console hides some output when displaying. I tried using subprocess.Popen() to store the output into a pipe and readlines() the pipe, and got the same result.
But it seems python doesn't do that all the time. I don't got this problem on all machines.
Question
Given a large series of DataFrames with a small variety of dtypes, what is the optimal design for Pandas DataFrame persistence/serialization if I care about compression ratio first, decompression speed second, and initial compression speed third?
Background:
I have roughly 200k dataframes of shape [2900,8] that I need to store in logical blocks of ~50 data frames per file. The data frame contains variables of type np.int8, np.float64. Most data frames are good candidates for sparse types, but sparse is not supported in HDF 'table' format stores (not that it would even help - see the size below for a sparse gzipped pickle). Data is generated daily and currently adds up to over 20GB. While I'm not bound to HDF, I have yet to find a better solution that allows for reads on individual dataframes within the persistent store, combined with top quality compression. Again, I'm willing to sacrifice a little speed for better compression ratios, especially since I will need to be sending this all over the wire.
There are a couple of other SO threads and other links that might be relevant for those that are in a similar position. However most of what I've found doesn't focus on minimizing storage size as a priority:
“Large data” work flows using pandas
HDF5 and SQLite. Concurrency, compression & I/O performance [closed]
Environment:
OSX 10.9.5
Pandas 14.1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version: 3.1.1
HDF5 version: 1.8.13
NumPy version: 1.8.1
Numexpr version: 2.4 (not using Intel's VML/MKL)
Zlib version: 1.2.5 (in Python interpreter)
LZO version: 2.06 (Aug 12 2011)
BZIP2 version: 1.0.6 (6-Sept-2010)
Blosc version: 1.3.5 (2014-03-22)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Cython version: 0.20.2
Python version: 2.7.8 (default, Jul 2 2014, 10:14:46)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Platform: Darwin-13.4.0-x86_64-i386-64bit
Byte-ordering: little
Detected cores: 8
Default encoding: ascii
Default locale: (en_US, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Example:
import pandas as pd
import numpy as np
import random
import cPickle as pickle
import gzip
def generate_data():
alldfs = {}
n = 2800
m = 8
loops = 50
idx = pd.date_range('1/1/1980',periods=n,freq='D')
for x in xrange(loops):
id = "id_%s" % x
df = pd.DataFrame(np.random.randn(n,m) * 100,index=idx)
# adjust data a bit..
df.ix[:,0] = 0
df.ix[:,1] = 0
for y in xrange(100):
i = random.randrange(n-1)
j = random.randrange(n-1)
df.ix[i,0] = 1
df.ix[j,1] = 1
df.ix[:,0] = df.ix[:,0].astype(np.int8) # adjust datatype
df.ix[:,1] = df.ix[:,1].astype(np.int8)
alldfs[id] = df
return alldfs
def store_all_hdf(x,format='table',complevel=9,complib='blosc'):
fn = "test_%s_%s-%s.hdf" % (format,complib,complevel)
hdfs = pd.HDFStore(fn,mode='w',format=format,complevel=complevel,complib=complib)
for key in x.keys():
df = x[key]
hdfs.put(key,df,format=format,append=False)
hdfs.close()
alldfs = generate_data()
for format in ['table','fixed']:
for complib in ['blosc','zlib','bzip2','lzo',None]:
store_all_hdf(alldfs,format=format,complib=complib,complevel=9)
# pickle, for comparison
with open('test_pickle.pkl','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip.pklz','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip_sparse.pklz','wb') as f:
sparsedfs = {}
for key in alldfs.keys():
sdf = alldfs[key].to_sparse(fill_value=0)
sparsedfs[key] = sdf
pickle.dump(sparsedfs,f)
Results
-rw-r--r-- 1 bazel staff 10292760 Oct 17 14:31 test_fixed_None-9.hdf
-rw-r--r-- 1 bazel staff 9531607 Oct 17 14:31 test_fixed_blosc-9.hdf
-rw-r--r-- 1 bazel staff 7867786 Oct 17 14:31 test_fixed_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 9506483 Oct 17 14:31 test_fixed_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8036845 Oct 17 14:31 test_fixed_zlib-9.hdf
-rw-r--r-- 1 bazel staff 26627915 Oct 17 14:31 test_pickle.pkl
-rw-r--r-- 1 bazel staff 8752370 Oct 17 14:32 test_pickle_gzip.pklz
-rw-r--r-- 1 bazel staff 8407704 Oct 17 14:32 test_pickle_gzip_sparse.pklz
-rw-r--r-- 1 bazel staff 14464924 Oct 17 14:31 test_table_None-9.hdf
-rw-r--r-- 1 bazel staff 8619016 Oct 17 14:31 test_table_blosc-9.hdf
-rw-r--r-- 1 bazel staff 8154716 Oct 17 14:31 test_table_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 8481631 Oct 17 14:31 test_table_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8047125 Oct 17 14:31 test_table_zlib-9.hdf
Given the results above, the best 'compression-first' solution appears to be to store the data in HDF fixed format, with bzip2. Is there a better way of organising the data, perhaps without HDF, that would allow me to save even more space?
Update 1
Per the comment below from Jeff, I have used ptrepack on the table store HDF file without initial compression -- and then recompressed. Results are below:
-rw-r--r-- 1 bazel staff 8627220 Oct 18 08:40 test_table_repack-blocsc-9.hdf
-rw-r--r-- 1 bazel staff 8627620 Oct 18 09:07 test_table_repack-blocsc-blosclz-9.hdf
-rw-r--r-- 1 bazel staff 8409221 Oct 18 08:41 test_table_repack-blocsc-lz4-9.hdf
-rw-r--r-- 1 bazel staff 8104142 Oct 18 08:42 test_table_repack-blocsc-lz4hc-9.hdf
-rw-r--r-- 1 bazel staff 14475444 Oct 18 09:05 test_table_repack-blocsc-snappy-9.hdf
-rw-r--r-- 1 bazel staff 8059586 Oct 18 08:43 test_table_repack-blocsc-zlib-9.hdf
-rw-r--r-- 1 bazel staff 8161985 Oct 18 09:08 test_table_repack-bzip2-9.hdf
Oddly, recompressing with ptrepack seems to increase total file size (at least in this case using table format with similar compressors).
i have a list which goes as follows:
-------------------------------------------------------------------------------------------
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------------
www.someotherdomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
-------------------------------------------------------------------------------------------
www.somedifferentdomain.at UP Thu May 8 09:08:47 2014
HTTPS OK Thu May 8 09:10:26 2014
HTTPS-Cert OK Thu May 8 09:11:13 2014
-------------------------------------------------------------------------------------------
www.foobladomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:08 2014
-------------------------------------------------------------------------------------------
www.snafudomain.at UP Thu May 8 09:09:17 2014
HTTP OK Thu May 8 09:09:42 2014
HTTPS OK Thu May 8 09:10:10 2014
HTTPS-Cert OK Thu May 8 09:10:09 2014
-------------------------------------------------------------------------------------------
www.lolnotanotherdomain.de UP Thu May 8 09:06:57 2014
HTTP OK Thu May 8 09:11:10 2014
HTTPS OK Thu May 8 09:11:16 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
and i have a function which takes the hostname as parameter and prints it out:
please enter hostname to search for: www.snafudomain.at
www.snafudomain.at UP Thu May 8 09:09:17 2014
but what i want to archive is that the following lines after the hostname are printed out until the delimiter line "-----" the function i right now looks like this:
def getChecks(self,hostname):
re0 = "%s" % hostname
mylist = open('myhostlist', 'r')
for i in mylist:
if re.findall("^%s" % re0, str(i)):
print i
else:
continue
is there some easy way to do this? If something is unclear please comment. Thanks in advance
edit
to clarify the output should look like this:
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------
just want to print out the lines from the searched domain name till the line with only minuses.
How about not using regex at all?
def get_checks(self, hostname):
record = False
with open('myhostlist', 'r') as file_h:
for line in file_h:
if line.startswith(hostname):
record = True
print(line)
elif line.startswith("---"):
record = False
print(line)
elif record:
print(line)
import re
def get_checks(hostname):
pattern = re.compile(r"{}.*?(?=---)".format(re.escape(hostname)), re.S)
with open("Input.txt") as in_file:
return re.search(pattern, in_file.read())
print get_checks("www.snafudomain.at").group()
This will returns all the lines starting with www.snafudomain.at till it finds ---. The pattern generated will be like this
www\.snafudomain\.at.*?(?=---)
Online Demo
We use re.escape because your hostname has . in it. Since . has special meaning in the Regular Expressions, we just want the RegEx engine to treat . as literal dot.
I have a complicated python server app, that runs constantly all the time. Below is a very simplified version of it.
When I run the below app using python; "python Main.py". It uses 8mb of ram straight away, and stays at 8mb of ram, as it should.
When I run it using pypy "pypy Main.py". It begins by using 22mb of ram and over time the ram usage grows. After a 30 seconds its at 50mb, after an hour its at 60mb.
If I change the "b.something()" to be "pass" it doesn't gobble up memory like that.
I'm using pypy 1.9 on OSX 10.7.4
I'm okay with pypy using more ram than python.
Is there a way to stop pypy from eating up memory over long periods of time?
import sys
import time
import traceback
class Box(object):
def __init__(self):
self.counter = 0
def something(self):
self.counter += 1
if self.counter > 100:
self.counter = 0
try:
print 'starting...'
boxes = []
for i in range(10000):
boxes.append(Box())
print 'running!'
while True:
for b in boxes:
b.something()
time.sleep(0.02)
except KeyboardInterrupt:
print ''
print '####################################'
print 'KeyboardInterrupt Exception'
sys.exit(1)
except Exception as e:
print ''
print '####################################'
print 'Main Level Exception: %s' % e
print traceback.format_exc()
sys.exit(1)
Below is a list of times and the ram usage at that time (I left it running over night).
Wed Sep 5 22:57:54 2012, 22mb ram
Wed Sep 5 22:57:54 2012, 23mb ram
Wed Sep 5 22:57:56 2012, 24mb ram
Wed Sep 5 22:57:56 2012, 25mb ram
Wed Sep 5 22:57:58 2012, 26mb ram
Wed Sep 5 22:57:58 2012, 27mb ram
Wed Sep 5 22:57:59 2012, 29mb ram
Wed Sep 5 22:57:59 2012, 30mb ram
Wed Sep 5 22:58:00 2012, 31mb ram
Wed Sep 5 22:58:02 2012, 32mb ram
Wed Sep 5 22:58:03 2012, 33mb ram
Wed Sep 5 22:58:05 2012, 34mb ram
Wed Sep 5 22:58:08 2012, 35mb ram
Wed Sep 5 22:58:10 2012, 36mb ram
Wed Sep 5 22:58:12 2012, 38mb ram
Wed Sep 5 22:58:13 2012, 39mb ram
Wed Sep 5 22:58:16 2012, 40mb ram
Wed Sep 5 22:58:19 2012, 41mb ram
Wed Sep 5 22:58:21 2012, 42mb ram
Wed Sep 5 22:58:23 2012, 43mb ram
Wed Sep 5 22:58:26 2012, 44mb ram
Wed Sep 5 22:58:28 2012, 45mb ram
Wed Sep 5 22:58:31 2012, 46mb ram
Wed Sep 5 22:58:33 2012, 47mb ram
Wed Sep 5 22:58:35 2012, 49mb ram
Wed Sep 5 22:58:35 2012, 50mb ram
Wed Sep 5 22:58:36 2012, 51mb ram
Wed Sep 5 22:58:36 2012, 52mb ram
Wed Sep 5 22:58:37 2012, 54mb ram
Wed Sep 5 22:59:41 2012, 55mb ram
Wed Sep 5 22:59:45 2012, 56mb ram
Wed Sep 5 22:59:45 2012, 57mb ram
Wed Sep 5 23:00:58 2012, 58mb ram
Wed Sep 5 23:02:20 2012, 59mb ram
Wed Sep 5 23:02:20 2012, 60mb ram
Wed Sep 5 23:02:27 2012, 61mb ram
Thu Sep 6 00:18:00 2012, 62mb ram
http://doc.pypy.org/en/latest/gc_info.html#minimark-environment-variables shows how to tweak the gc
Compared to cpython, pypy uses different garbage collection strategies.
If the increase in memory is due to something in your program, you could try to run a forced garbage collection every now and then, by using the collect function from the gc module. In this case, it might also help to explicitly del large objects that you don't need anymore and that don't go out of scope.
If it is due to the internal workings of pypy, it might be worth it submitting a bug report, as Mark Dickinson suggested.