Numpy count rows in one column in DataFrame without copying - python

source_pos = get_list_index_or_none(col_list, SOURCE_COLUMN)
# This line below makes copy of the column and allocates
# additional memory, which is not ideal
source_only = filtered_df.iloc[:, source_pos:source_pos + 1:].iloc[:, 0]
rows_with_many_words_in_source = source_only.astype(str).str.count(
' ') + 1 > SOURCE_WORDS_LIMIT
I want to count the amount of rows that has a string that consists of multiple words (i.e. contains a space) in a specific column. How can I do that without allocating memory for pd.Series, like it does in the example code?
This is memory usage before that line:
2022-07-29 12:27:56,974 - DEBUG - Total memory usage: 102.21875 MB
2022-07-29 12:27:57,539 - DEBUG - Total objects allocated: 438113
types | # objects | total size
======================================= | =========== | ============
pandas.core.frame.DataFrame | 2 | 24.61 MB
str | 144600 | 23.60 MB
dict | 40957 | 15.36 MB
code | 36800 | 6.26 MB
type | 5342 | 4.92 MB
list | 31955 | 3.04 MB
tuple | 34692 | 2.01 MB
set | 2222 | 1.42 MB
numpy.ndarray | 86 | 752.26 KB
weakref | 9220 | 648.28 KB
collections.OrderedDict | 269 | 459.09 KB
abc.ABCMeta | 421 | 453.01 KB
openpyxl.descriptors.MetaSerialisable | 424 | 440.56 KB
int | 14551 | 426.82 KB
builtin_function_or_method | 5516 | 387.84 KB
This is memory usage after:
2022-07-29 12:28:10,295 - DEBUG - Total memory usage: 111.296875 MB
2022-07-29 12:28:10,793 - DEBUG - Total objects allocated: 439022
types | # objects | total size
======================================= | =========== | ============
pandas.core.series.Series | 48 | 24.93 MB
pandas.core.frame.DataFrame | 2 | 24.61 MB
str | 144605 | 23.60 MB
dict | 41326 | 15.42 MB
code | 36800 | 6.26 MB
type | 5342 | 4.92 MB
list | 31998 | 3.04 MB
tuple | 34785 | 2.02 MB
set | 2223 | 1.42 MB
numpy.ndarray | 133 | 757.03 KB
weakref | 9269 | 651.73 KB
collections.OrderedDict | 269 | 459.09 KB
abc.ABCMeta | 421 | 453.01 KB
openpyxl.descriptors.MetaSerialisable | 424 | 440.56 KB
int | 14597 | 428.08 KB

Related

How to make sure PyTorch has deallocated GPU memory?

Say we have a function like this:
def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
self.model_large.cuda()
self.model_large.train()
self.optimizer_large.zero_grad()
for fb in range(self.fake_batch):
val_x, val_y = next(self.valid_loader)
val_x, val_y = val_x.cuda(), val_y.cuda()
logits_main, emsemble_logits_main = self.model_large(val_x)
cel = self.criterion(logits_main, val_y)
loss_weight = cel / (self.fake_batch)
loss_weight.backward(retain_graph=False)
cel = cel.cpu().detach()
emsemble_logits_main = emsemble_logits_main.cpu().detach()
totall_lw += float(loss_weight.item())
val_x = val_x.cpu().detach()
val_y = val_y.cpu().detach()
loss_weight = loss_weight.cpu().detach()
self._clip_grad_norm(self.model_large)
self.optimizer_large.step()
self.model_large.train(mode=False)
self.model_large = self.model_large.cpu()
return totall_lc, totall_lw, totall_li, totall_lr
On the first call, it allocates 8GB of GPU memory. On the next call, no new memory gets allocated, yet 8GBs are still occupied. I want to have after it is called and the produced first result to have 0 allocated GPU memory or as low as possible.
What I have tried: do retain_graph=False and .cpu().detach() everywhere - no positive effects.
Memory snapshot before
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| Active memory | 33100 KB | 33219 KB | 40555 KB | 7455 KB |
| from large pool | 3072 KB | 3072 KB | 3072 KB | 0 KB |
| from small pool | 30028 KB | 30147 KB | 37483 KB | 7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 51200 KB | 51200 KB | 51200 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 30720 KB | 30720 KB | 30720 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 18100 KB | 20926 KB | 56892 KB | 38792 KB |
| from large pool | 17408 KB | 18944 KB | 18944 KB | 1536 KB |
| from small pool | 692 KB | 2047 KB | 37948 KB | 37256 KB |
|---------------------------------------------------------------------------|
| Allocations | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| Active allocs | 12281 | 12414 | 12912 | 631 |
| from large pool | 2 | 2 | 2 | 0 |
| from small pool | 12279 | 12412 | 12910 | 631 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 16 | 16 | 16 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 15 | 15 | 15 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 30 | 262 | 259 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 29 | 261 | 259 |
|===========================================================================|
And after calliing function and
torch.cuda.empty_cache()
torch.cuda.synchronize()
We get:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| Active memory | 10957 KB | 8626 MB | 272815 MB | 272804 MB |
| from large pool | 0 KB | 8596 MB | 272477 MB | 272477 MB |
| from small pool | 10957 KB | 33 MB | 337 MB | 327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 8818 MB | 9906 MB | 19618 MB | 10800 MB |
| from large pool | 8784 MB | 9874 MB | 19584 MB | 10800 MB |
| from small pool | 34 MB | 34 MB | 34 MB | 0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 5427 KB | 3850 MB | 207855 MB | 207850 MB |
| from large pool | 0 KB | 3850 MB | 207494 MB | 207494 MB |
| from small pool | 5427 KB | 5 MB | 360 MB | 355 MB |
|---------------------------------------------------------------------------|
| Allocations | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| Active allocs | 3853 | 13391 | 34339 | 30486 |
| from large pool | 0 | 557 | 12392 | 12392 |
| from small pool | 3853 | 12838 | 21947 | 18094 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 226 | 226 | 410 | 184 |
| from large pool | 209 | 209 | 393 | 184 |
| from small pool | 17 | 17 | 17 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 46 | 358 | 12284 | 12238 |
| from large pool | 0 | 212 | 7845 | 7845 |
| from small pool | 46 | 279 | 4439 | 4393 |
|===========================================================================|
I don't think the other answer is correct. Allocation and deallocation definitely happens during runtime, the thing to note is that the CPU code runs asynchronously from the GPU code, so you need to wait for any deallocation to happen if you want to reserve more memory after it. Take a look at this:
import torch
a = torch.zeros(100,100,100).cuda()
print(torch.cuda.memory_allocated())
del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())
Outputs
4000256
0
So you should del the tensors you don't need and call torch.cuda.synchronize() to make sure that the deallocation goes through before your CPU code continues to run.
In your specific case, after your function trn_l returns, any variables that were local to that function, and do not have references elsewhere, will be deallocated along with the corresponding GPU tensors. All you need to do is wait for this to happen by calling torch.cuda.synchronize() after the function call.
So, Pytorch does not allocate and deallocate memory from GPU in training time.
From https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:
PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.
If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid].
You can call torch.cuda.empty_cache() to free all unused memory (however, that is not really good practice as memory re-allocation is time consuming). Docs of empty_cace() : https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

Creating a pandas dataframe produces huge series

I am using the SummaryTracker from pympler.tracker. I am trying to figure out why my script is killed though the dataframes in my code should not exceed approx. 0,4gb as I have a matrix
of shape 18125×35232 (if that number is not correct I can change that) and while creating this dataframe pandas Series are present within the RAM which I don't really understand.
Now I am using a tracker to track down which objects make my script stop and this is the result before adding the df:
types | # objects | total size
======================================== | =========== | ============
pandas.core.frame.DataFrame | 2 | 79.57 MB
pandas.core.indexes.base.Index | 3 | 1.60 MB
list | 14620 | 1.35 MB
str | 17814 | 1.25 MB
pandas._libs.index.ObjectEngine | 2 | 640.20 KB
numpy.ndarray | 17 | 145.17 KB
set | 3 | 132.66 KB
int | 3316 | 90.70 KB
pandas.core.series.Series | 1 | 20.15 KB
pandas.core.indexes.numeric.Int64Index | 1 | 20.02 KB
weakref | 101 | 7.89 KB
dict | 22 | 5.85 KB
type | 0 | 1.82 KB
pandas._libs.internals.BlockPlacement | 4 | 320 B
code | 2 | 288 B
After running this code:
import pandas as pd
data = pd.read_csv('../test_data/test.csv',
sep=';', encoding=data_encoding)
data = data.pivot_table(index='index_name',
columns='col_name',
values='col_2')
The result looks like this:
types | # objects | total size
=================================================== | =========== | ============
pandas.core.frame.DataFrame | 3 | 2.92 GB
pandas.core.series.Series | 723 | 1.15 GB
list | 29950 | 2.74 MB
str | 33151 | 2.33 MB
pandas.core.indexes.base.Index | 3 | 1.60 MB
pandas.core.indexes.numeric.Int64Index | 3 | 966.09 KB
numpy.ndarray | 745 | 706.70 KB
pandas._libs.index.ObjectEngine | 3 | 660.23 KB
dict | 1494 | 329.34 KB
int | 7374 | 201.73 KB
set | 3 | 137.16 KB
pandas.core.internals.managers.SingleBlockManager | 723 | 84.73 KB
tuple | 1346 | 76.71 KB
pandas._libs.internals.BlockPlacement | 727 | 56.80 KB
pandas.core.internals.blocks.FloatBlock | 725 | 56.64 KB
I get that I will need more space for the dataframes but I do not create series within this code and I am wondering why there is 1.15gb blocked by pandas.series objects. Is this a memory leak and if yes, how can I fix this? If not, can I somehow optimize my code so this does not happen?
Or is this normal behaviour of pandas when creating dataframes? I concat two dataframes with each other after this and it gets even worse to 5.69gb blocked by pandas.series.

Python memoryleak when using numpyarray in classes

I'm writing a python-library/package for controlling an oscilloscope. One of the function downloads the traces from the oscilloscope. The data is sent in binary and converted by numpy into a float. In an test_script it will be used in an (almost) infinity loop to save the recorded data(ends when no space is left on the device). But before it can run out of space, the python script runs out of memory.
Here is a broke down version of my library, which suffers the same problem of hoarding memory without releasing it, even though being told to free it:
import time
import gc
import numpy
from pympler import muppy, summary
class Measurement:
data = None
def __del__(self):
del self.data
print("Deleting Measurement")
class Oscilloscope:
def GetData(self):
# Simulating the acquiration of the data by reading it in by a file
p = open(r"rawdata.bin",mode="rb")
rawdata = p.read()
mes = Measurement()
mes.data = numpy.array(numpy.frombuffer(bytes(rawdata), dtype='B'), dtype=float)
p.close()
return mes
def __del__(self):
print("Deleting Oscilloscope")
if __name__ == "__main__":
myosci = Oscilloscope()
for i in range(0,10):
var = myosci.GetData()
data = var.data
# In the real script something will be done with the data
print("Saving and evaluating Traces")
time.sleep(0.5)
del var
gc.collect()
print("Collected everything")
# Show MemoryUsage
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
# Prints out a summary of the large objects
summary.print_(sum1)
print("Done")
When executed the output looks like this:
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 40 | 9.16 MB
str | 18144 | 3.39 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4209 | 272.90 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
list | 396 | 88.62 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
int | 1994 | 60.12 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 41 | 18.31 MB
str | 22639 | 3.70 MB
dict | 4685 | 2.21 MB
list | 4895 | 959.05 KB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
int | 2940 | 85.98 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 42 | 27.47 MB
str | 27134 | 4.00 MB
dict | 4685 | 2.21 MB
list | 9392 | 1.90 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
int | 3879 | 111.66 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 43 | 36.63 MB
str | 31629 | 4.31 MB
list | 13889 | 2.94 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
wrapper_descriptor | 2209 | 172.58 KB
int | 4818 | 137.34 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 44 | 45.78 MB
str | 36124 | 4.62 MB
list | 18386 | 4.05 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
wrapper_descriptor | 2209 | 172.58 KB
int | 5757 | 163.01 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 45 | 54.94 MB
list | 22883 | 5.25 MB
str | 40619 | 4.92 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
int | 6696 | 188.69 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 46 | 64.09 MB
list | 27380 | 6.55 MB
str | 45114 | 5.23 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
int | 7635 | 214.36 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 47 | 73.25 MB
list | 31877 | 7.85 MB
str | 49609 | 5.54 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
int | 8574 | 240.04 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 48 | 82.40 MB
list | 36374 | 9.26 MB
str | 54104 | 5.84 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
tuple | 4212 | 273.09 KB
int | 9513 | 265.71 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Saving and evaluating Traces
Deleting Measurement
Collected everything
types | # objects | total size
============================ | =========== | ============
numpy.ndarray | 49 | 91.56 MB
list | 40871 | 10.79 MB
str | 58599 | 6.15 MB
dict | 4685 | 2.21 MB
code | 6556 | 925.30 KB
type | 1060 | 868.78 KB
int | 10452 | 291.39 KB
tuple | 4212 | 273.09 KB
wrapper_descriptor | 2209 | 172.58 KB
set | 139 | 131.41 KB
weakref | 1339 | 104.61 KB
method_descriptor | 1402 | 98.58 KB
builtin_function_or_method | 1165 | 81.91 KB
abc.ABCMeta | 84 | 81.50 KB
getset_descriptor | 1050 | 73.83 KB
Done
Deleting Oscilloscope
Process finished with exit code 0
The memory is hoarded even though the numpyarrays receive a delete command and the garbage collector gets triggered. In the real script the memory-usage jumps with every step by around 150 mB. The Raspberry Pi which is running my library gets out of memory pretty quick.
If you want to test it yourself, here is the rawdata.bin file used:
https://mega.nz/#!ps4ACAJD!0cqTJJmMU5RjSx5BqggM1afz47PqcR67hzKKnI8LgTs
I don't understand why this memory is leaking, but I found you can clear it up by adding in a
del var
where you currently have
del data
I'll update this if I work out what's actually going on here, its clearly something to do with the lifetime of (multiple copies of) data. But this should get you up and running.
(fwiw you can simplify the test case by doing something like mes.data = numpy.zeros((100,100)) and it still leaks)

How to print line of a file after match an exact string pattern with python?

I have a list
list = ['plutino?','res 2:11','Uranus L4','res 9:19','damocloid','cubewano?','plutino']
I want to search every element from the list in a file with the next format and print the line after match
1995QY9 | 1995_QY9 | plutino | 32929 | | 39.445 | 0.260 | 29.193 | 49.696 | 4.8 | 66 | # 0.400 | 1.21 BR-U | ?
1997CU29 | 1997_CU29 | cubewano | 33001 | | 43.534 | 0.039 | 41.815 | 45.253 | 1.5 | 243 | | 1.82 RR |
1998BU48 | 1998_BU48 | Centaur | 33128 | | 33.363 | 0.381 | 20.647 | 46.078 | 14.2 | 213 | # 0.052 | 1.59 RR | ?
1998VG44 | 1998_VG44 | plutino | 33340 | | 39.170 | 0.250 | 29.367 | 48.974 | 3.0 | 398 | # 0.028 | 1.51 IR |
1998SN165 | 1998_SN165 | inner classic | 35671 | | 37.742 | 0.041 | 36.189 | 39.295 | 4.6 | 393 | # 0.060 | 1.13 BB |
2000VU2 | 2000_VU2 | unusual | 37117 | Narcissus | 6.878 | 0.554 | 3.071 | 10.685 | 13.8 | 11 | # 0.088 | |
1999HX11 | 1999_HX11 | plutino? | 38083 | Rhadamanthus | 39.220 | 0.151 | 33.295 | 45.144 | 12.7 | 168 | | 1.18 BR |
1999HB12 | 1999_HB12 | res 2:5 | 38084 | | 56.376 | 0.422 | 32.566 | 80.187 | 13.1 | 176 | | 1.39 BR-IR |
I am using the next code to do that
for i in list:
with open("tnolist.txt") as f:
for line in f:
if re.search(i, line):
print(line)
The code works fine for all element, except for plutino. When the variable i is plutino the code prints lines for plutino and for plutino?.
This happens because plutino is a substring of plutino?, so the regex parser matches the first part of plutino? and returns a non-falsey answer. Without a whole lot of additional work, you should be able to fix the problem with re.search(i, line+r'\s'), which says that you need to have a whitespace character after the phrase you're searching. As the file gets longer and more complicated, you might have more such exceptions to make the regex behave as desired.
Update: I also like visual regex editors for reasons like this. They make it easy to see what matches and what doesn't.
Another option would be something like i==line.split('|')[2].strip() which extracts the portion of your file you seem to care about. The .strip() method can become inefficient on long lines, but this might fit your use case.

python pyramid garbage collection

While investigating memory leak issue, saw these results. As I am using pyramid web framework what is the best way to cleanup these objects?
types | # objects | total size
=================================================== | =========== | ============
dict | 957482 | 560.44 MB
unicode | 509861 | 158.82 MB
tuple | 1169055 | 86.54 MB
str | 317844 | 50.17 MB
list | 335534 | 33.62 MB
set | 19873 | 4.52 MB
<class 'psycopg2.Column | 177121 | 18.92 MB
<class 'sqlalchemy.util._collections.PopulateDict | 18229 | 16.68 MB
<class 'sqlalchemy.sql.elements._anonymous_label | 120168 | 15.64 MB
<class 'sqlalchemy.sql.compiler._CompileLabel | 176992 | 13.50 MB
<class 'sqlalchemy.sql.elements.BindParameter | 119670 | 7.30 MB
<class 'sqlalchemy.sql.elements.ClauseList | 90530 | 5.53 MB
<class 'sqlalchemy.sql.elements.Grouping | 74269 | 4.53 MB
psycopg2._psycopg.cursor | 18292 | 4.33 MB
<class 'sqlalchemy.sql.elements.BinaryExpression | 67464 | 4.12 MB

Categories