Parsing members of a variable length python string

Parsing members of a variable length python string - python

I am using sed in python to read the text from a log file into a single string.
Here is the command:
sys_output=commands.getoutput('sed -n "/SYS /,/Tot /p" %s.log' % cim_input_prefix)
and here is a printout of sys_output
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 18 21 59 89 92 1.6163
2 RHF CCSD 4 7 22 36 2 0.0036
Tot 94 1.6199
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 4 4 14 19 1 0.0002
Tot 1 0.0002
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 4 9 36 55 8 0.0416
2 RHF CCSD 18 25 73 108 200 5.3587
3 RHF CCSD 4 10 29 48 6 0.0217
Tot 214 5.4221
Which has three groups, with [2,1,3] rows of interest.
The log files my script will encounter may have a variable number of groups and rows, so I can't simply split the string and pull out the useful information.
I am interested in the index of group and row, and the memory column.
How can I parse this large string to obtain a dictionary such as:
{'1-1': 92, '1-2': 2, '2-1': 1, '3-1': 8, '3-2': 200, '3-3': 6}?
Thank you very much for your time

Some kind of state machine based on the particular traits of the output may make life easier than worrying too much about indices.
This snippet works with the example and could be tailored to handle corner cases.
import collections
with open("cpu_text", "r") as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
group_id = 0
group_member_id = 0
output_dict = collections.OrderedDict()
for line in lines:
if line.find("SYS") > -1:
group_id += 1
elif line.find("Tot") > -1:
group_member_id = 0
else:
group_member_id += 1
key = "{0}-{1}".format(group_id, group_member_id)
memory = line.split()[7]
output_dict[key] = memory
print(output_dict)
Output:
OrderedDict([('1-1', '92'), ('1-2', '2'), ('2-1', '1'), ('3-1', '8'), ('3-2', '200'), ('3-3', '6')])

Related

How can I print all rows using python's guppy

I am using python's guppy in order to see heap usage in a python program. I do:
h = hpy
hp = h.heap()
print hp
and this is the produced output:
Partition of a set of 339777 objects. Total size = 51680288 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 137974 41 17732032 34 17732032 34 str
1 93077 27 8342072 16 26074104 50 tuple
2 992 0 3428864 7 29502968 57 dict of module
3 23606 7 3021568 6 32524536 63 types.CodeType
4 23577 7 2829240 5 35353776 68 function
5 2815 1 2541648 5 37895424 73 type
6 2815 1 2513128 5 40408552 78 dict of type
7 2112 1 2067840 4 42476392 82 dict (no owner)
8 4495 1 1729792 3 44206184 86 unicode
9 4026 1 671376 1 44877560 87 list
<972 more rows. Type e.g. '_.more' to view.>
How can I print all rows?

Use all Method which is used to show all lines.
import decimal
from guppy import hpy
d = {
"int": 0,
"float": 0.0,
"dict": dict(),
"set": set(),
"tuple": tuple(),
"list": list(),
"str": "a",
"unicode": u"a",
"decimal": decimal.Decimal(0),
"object": object(),
}
hp = hpy()
heap = hp.heap()
print(heap.all)

I used code taken from this model and with the documentation I could get clear in my own mind what the various entities are.
The end result is that the following prints out the report for the whole of the heap, the same sort of report that you only get 10 lines at a time of normally:
h = hpy()
identity_set = h.heap()
stats = identity_set.stat
print()
print("Index Count Size Cumulative Size Object Name")
for row in stats.get_rows():
print("%5d %5d %8d %8d %30s"%(row.index, row.count, row.size, row.cumulsize, row.name))

Python rapidly leaking memory when Celery retrieves results

The script that I've written to add tasks to the my Celery queue is leaking memory (to the point where the kernel kills the process after 20 minutes). In this script, I'm just executing the same 300 tasks repeatedly, every 60 seconds (inside a while True:).
The parameters passed to the task, makeGroupRequest(), are dictionaries containing strings, and according to hpy and objgraph, dicts and strings are also what's growing uncontrollably in memory. I've included the outputs of hpy below on successive iterations of the loop.
I've spent days on this, and I can't understand why memory would grow uncontrollably, considering nothing is re-used between loops. If I skip the retrieval of tasks, the memory doesn't appear to leak (so it's really the .get() call that is leaking memory). How can I determine what's going on and how to stop the growth?
Here is an outline of the code that's executing. I'm using the rpc:// backend.
while True:
# preparation is done here to set set up the arguments for the tasks (processedChains)
chains = []
for processedChain in processedChains:
# shorthanding
supportingData = processedChain["supportingDataAndCheckedGroups"]
# init the first element, which includes the supportingData and the first group
argsList = [(supportingData, processedChain["groups"][0])]
# add in the rest of the groups
argsList.extend([(groupInChain,) for groupInChain in processedChain["groups"][1:]])
# actually create the chain
chain = celery.chain(*[makeGroupRequest.signature(params, options={'queue':queue}) for params in argsList])
# add this to the list of chains
chains.append(chain)
groupSignature = celery.group(*chains).apply_async()
# this line appears to cause a large increase in memory each cycle
results = groupSignature.get(timeout = 2 * acceptableLoopTime)
time.sleep(60)
Here is the output of hpy on sucessive runs:
Loop 2:
Partition of a set of 366560 objects. Total size = 57136824 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 27065 7 17665112 31 17665112 31 dict (no owner)
1 122390 33 11966720 21 29631832 52 unicode
2 89133 24 8291952 15 37923784 66 str
3 45448 12 3802968 7 41726752 73 tuple
4 548 0 1631072 3 43357824 76 dict of module
5 11195 3 1432960 3 44790784 78 types.CodeType
6 9224 3 1343296 2 46134080 81 list
7 11123 3 1334760 2 47468840 83 function
8 1414 0 1274552 2 48743392 85 type
9 1414 0 1240336 2 49983728 87 dict of type
Loop 3:
Index Count % Size % Cumulative % Kind (class / dict of class)
0 44754 9 29240496 37 29240496 37 dict (no owner)
1 224883 44 20946280 26 50186776 63 unicode
2 89104 18 8290248 10 58477024 74 str
3 45455 9 3803288 5 62280312 79 tuple
4 14955 3 2149784 3 64430096 81 list
5 548 0 1631072 2 66061168 83 dict of module
6 11195 2 1432960 2 67494128 85 types.CodeType
7 11122 2 1334640 2 68828768 87 function
8 1402 0 1263704 2 70092472 88 type
9 1402 0 1236976 2 71329448 90 dict of type

Turns out this is a bug in Celery. Switching to the memcache backend completely resolves the memory leak. Hopefully the issue will be resolved in a subsequent version.

Simple way to get current memory usage from Guppy

tl/dr: how do I get the current memory usage of my python program using Guppy? Is there a simple command?
I'm trying to track memory usage in a python program using guppy. This is my first usage of guppy, so I'm not very sure of how it behaves. What I want is
to be able to plot the total usage as "time" progresses in a simulation. This is a basic bit of code for what I can do:
from guppy import hpy
import networkx as nx
h = hpy()
L=[1,2,3]
h.heap()
> Partition of a set of 89849 objects. Total size = 12530016 bytes.
> Index Count % Size % Cumulative % Kind (class / dict of class)
> 0 40337 45 3638400 29 3638400 29 str
> 1 21681 24 1874216 15 5512616 44 tuple
> 2 1435 2 1262344 10 6774960 54 dict (no owner)
But I would like to just know what the current size is (the 12530016 bytes). So I'd like to be able to call something like h.total() to get the total size. I'd be shocked if this doesn't exist as a simple command, but so far, looking through the documentation I haven't found it. It's probably documented, just not where I'm looking.

x = h.heap()
x.size
returns the total size. For example:
from guppy import hpy
import networkx as nx
h = hpy()
num_nodes = 1000
num_edges = 5000
G = nx.gnm_random_graph(num_nodes, num_edges)
x = h.heap()
print(x.size)
prints
19820968
which is consistent with the Total size reported by
print(x)
# Partition of a set of 118369 objects. Total size = 19820904 bytes.
# Index Count % Size % Cumulative % Kind (class / dict of class)
# 0 51057 43 6905536 35 6905536 35 str
# 1 7726 7 3683536 19 10589072 53 dict (no owner)
# 2 28416 24 2523064 13 13112136 66 tuple
# 3 516 0 1641312 8 14753448 74 dict of module
# 4 7446 6 953088 5 15706536 79 types.CodeType
# 5 6950 6 834000 4 16540536 83 function
# 6 584 0 628160 3 17168696 87 dict of type
# 7 584 0 523144 3 17691840 89 type
# 8 169 0 461696 2 18153536 92 unicode
# 9 174 0 181584 1 18335120 93 dict of class
# <235 more rows. Type e.g. '_.more' to view.>

How to read a file word by word

I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on.
P3
480 640
255
49 49 49 48 48 48 47 47 47 46 46 46 45 45 45 42 42 42 38 38
38 35 35 35 23 23 23 8 8 8 7 7 7 17 17 17 21 21 21 29 29
29 41 41 41 47 47 47 49 49 49 42 42 42 33 33 33 24 24 24 18 18
...
Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem.
The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.

To read three space-separated word at a time from a file:
with open(filename, 'rb') as file:
kind, dimensions, max_color = map(next, [file]*3) # read 3 lines
rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3)
Output
[(49, 49, 49),
(48, 48, 48),
(47, 47, 47),
(46, 46, 46),
(45, 45, 45),
(42, 42, 42),
...
See What is the most “pythonic” way to iterate over a list in chunks?
To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.

Probably not the most 'pythonic' way but...
Iterate through the lines containing integers.
Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479).
For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed.
Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row].
Increment col.
Reset color_code_count.
'numbers_processed' will continue to increment until you get to 1920.
Once you hit 1920, you've reached the end of the first row.
Reset numbers_processed and col to zero, increment row by 1.
By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0.
Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.

A possible way to go through each word is to iterate through each line then .split it into each word.
the_file = open("file.txt",r)
for line in the_file:
for word in line.split():
#-----Your Code-----
From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic)
for line in the_file:
if "1" not in line or "2" not in line ...:
for word in line.split():
#-----Your Code-----
Or you can test if there is anything in each line: (Much more pythonic)
for line in the_file:
for word in line.split():
if len(word) != 0 or word != "\n":
#-----Your Code-----
I would recommend adding each of your new "lines" to a new document.

I am a C programmer. Sorry if this code looks like C Style:
f = open("pixel.ppm", "r")
type = f.readline()
height, width = f.readline().split()
height, width = int(height), int(width)
max_color = int(f.readline());
colors = []
count = 0
col_count = 0
line = []
while(col_count < height):
count = 0
i = 0
row =[]
while(count < width * 3):
temp = f.readline().strip()
if(temp == ""):
col_count = height
break
temp = temp.split()
line.extend(temp)
i = 0
while(i + 2 < len(line)):
row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])})
i = i+3
count = count +3
if(count >= width *3):
break
if(i < len(line)):
line = line[i:len(line)]
else:
line = []
col_count += 1
colors.append(row)
for row in colors:
for rgb in row:
print(rgb)
print("\n")
You can tweak this according to your needs. I tested it on this file:
P4
3 4
256
4 5 6 4 7 3
2 7 9 4
2 4
6 8 0
3 4 5 6 7 8 9 0
2 3 5 6 7 9 2
2 4 5 7 2
2

This seems to do the trick:
from re import findall
def _split_list(lst, i):
return lst[:i], lst[i:]
def iter_ppm_rows(path):
with open(path) as f:
ftype = f.readline().strip()
h, w = (int(s) for s in f.readline().split(' '))
maxcolor = int(f.readline())
rlen = w * 3
row = []
next_row = []
for line in f:
line_ints = [int(i) for i in findall('\d+\s+', line)]
if not row:
row, next_row = _split_list(line_ints, rlen)
else:
rest_of_row, next_row = _split_list(line_ints, rlen - len(row))
row += rest_of_row
if len(row) == rlen:
yield row
row = next_row
next_row = []
It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths.
I tested it on a file that looked like the following:
P3
120 160
255
0 1 2 3 4 5 6 7
8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[...]
9993 9994 9995 9996 9997 9998 9999
That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file.
Using the following test code...
for row in iter_ppm_rows('mock_ppm.txt'):
print(len(row), row[0], row[-1])
...the result was the following, which seems to not be skipping over any data and returning rows of the right size.
480 0 479
480 480 959
480 960 1439
480 1440 1919
480 1920 2399
480 2400 2879
480 2880 3359
480 3360 3839
480 3840 4319
480 4320 4799
480 4800 5279
480 5280 5759
480 5760 6239
480 6240 6719
480 6720 7199
480 7200 7679
480 7680 8159
480 8160 8639
480 8640 9119
480 9120 9599
As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.

Python: How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.

This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.

with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing members of a variable length python string - python

Related

How can I print all rows using python's guppy

Python rapidly leaking memory when Celery retrieves results

Simple way to get current memory usage from Guppy

How to read a file word by word

Python: How to write values to a csv file from another csv file

Categories

Resources