READ PCAP: memory error in Python DPKT

READ PCAP: memory error in Python DPKT - python

I am trying to read PCAP file in python 2.7.10. The code is:--->
import dpkt
f = open('testbed-11jun.pcap')
pcap = dpkt.pcap.Reader(f)
for ts, buf in pcap:
print ts, len(buf)
But I got this error:--->
1276225266.46 60
1276225266.72 60
1276225266.84 110
1276225266.84 110
1276225266.84 134
277171502.827 132
Traceback (most recent call last):
File "D:/UC subjects/MS Thesis/code/python/readpcap_dpkt.py", line 5, in
for ts, buf in pcap:
File "C:\Python27\lib\site-packages\dpkt\pcap.py", line 159, in iter
buf = self.__f.read(hdr.caplen)
MemoryError
So basically after reading 6 traces from the "testbed-11jun.pcap" file it showed memory error. The size of "testbed-11jun.pcap" is 2 GB. It has hundreds of traces. So only 6 traces will be few MB max. Still I got error.(my laptop RAM is 6 GB)
Can anybody tell how to read all hundred traces without any memory error?

I realize that this question was asked a long time back, but I thought I should still provide a couple of possible resolutions to the issue as it may help others.
There could be a couple of reasons for this error:
1: The pcap file was opened for parsing as an ascii file instead of a binary file. Try opening the file explicitly with "b" parameter i.e.
f = open('testbed-11jun.pcap','rb')
Note that not specifying a flag defaults mode character to 'r' which is meant for reading text files, as per the python documentation.
2: The format of PCAP file cannot be fully parsed by dpkt. Note that there are multiple versions of PCAP e.g. libpcap and pcap-ng, both of which have the same extension. Ensure that you have captured the wireshark dump correctly. For eg, if using Dumpcap then the following command line will capture the pcap for dpkt parsing in
dumpcap.exe -P -i "Wireless Network Connection" -w input.pcap -a duration:10
The -P flag ensures that the capture is performed using libpcap.

Related

Loop through pyshark's FileCapture object stops working

I am reading a pcap file I have acquired with tcpdump. The pcap file is ~500MB. I read the file with FileCapture() and then I want to loop through each packet to extract the TLS payload. When I create the FileCapture object I also use override_prefs={'tls.keylog_file': os.path.abspath('tlsKey') where tlsKey is the file with the master keys to decrypt the file. The decryption works just fine, I can extract all the information from each single packet. However, if I want to loop through each packet and extract some information, the loop stops working at the packet for which packet.number = 258. My file contains more than 258 packets. What is going on?
My code
import pyshark
import os
cap = pyshark.FileCapture('traffic.pcap')
for packet in cap:
print(packet.number)
if "IP" in packet:
print(packet)
print('Finished')
the last output I get is here. As you can see, the layer TLS does not get printed. Why?
Expected behavior
I would expect my script to print Finished at the end, but it doesn't. The for loop looks stuck. Since the pcap file is large I cannot attach it. Any explanation of what's happening?
Versions (please complete the following information):
OS: MacOS 13.1
pyshark version: 0.5.3
tshark version: TShark (Wireshark) 4.0.2 (v4.0.2-0-g415456d13370)

How to detect a file has been truncated while reading

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.
I'm opening and reading the files with python native methods:
file = open(self.file_path, 'r')
# ... later
line = file.readline()
This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.
However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.
I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.
I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.
Is there a simple way to detect a file has been truncated while reading from it?
Edit:
Truncating a file can be simulated with simple shell commands:
echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log
To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:
foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
print(foo.tell())
print(line)
line = foo.readline()
# Put a breakpoint on the following line and
# truncate the file before it executes
print(foo.tell())

Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.

Close rdpcap function - Scapy

I have a script that loops through a lot of pcap files. For each pcap file I need to read it and then write some information to a txt file. I'm using the rdcap function from Scapy. Is there anyway to close the pcap file once I'm done reading it? My script has a memory leak and I'm worried this may be the culprit (via leaving many pcap files essentially open)

Inspecting Scapy's source code reveals that the rdpcap function neglects to close the pcap file:
#conf.commands.register
def rdpcap(filename, count=-1):
"""Read a pcap file and return a packet list
count: read only <count> packets"""
return PcapReader(filename).read_all(count=count)
I suggest you implement your own version of this function as follows:
def rdpcap_and_close(filename, count=-1):
"""Read a pcap file, return a packet list and close the file
count: read only <count> packets"""
pcap_reader = PcapReader(filename)
packets = pcap_reader.read_all(count=count)
pcap_reader.close()
return packets
I've created an issue for this problem here.
EDIT: The issue has been resolved in this changeset.

Python fails to open 11gb csv in r+ mode but opens in r mode

I'm having problems with some code that loops through a bunch of .csvs and deletes the final line if there's nothing in it (i.e. files that end with the \n newline character)
My code works successfully on all files except one, which is the largest file in the directory at 11gb. The second largest file is 4.5gb.
The line it fails on is simply:
with open(path_str,"r+") as my_file:
and I get the following message:
IOError: [Errno 22] invalid mode ('r+') or filename: 'F:\\Shapefiles\\ab_premium\\processed_csvs\\a.csv'
The path_str I create using os.file.join to avoid errors, and I tried renaming the file to a.csv just to make sure there wasn't anything odd going on with the filename. This made no difference.
Even more strangely, the file is happy to open in r mode. I.e. the following code works fine:
with open(path_str,"r") as my_file:
I have tried navigating around the file in read mode, and it's happy to read characters at the start, end, and in the middle of the file.
Does anyone know of any limits on the size of file that Python can deal with or why I might be getting this error? I'm on Windows 7 64bit and have 16gb of RAM.

The default I/O stack in Python 2 is layered over CRT FILE streams. On Windows these are built on top of a POSIX emulation API that uses file descriptors (which in turn is layered over the user-mode Windows API, which is layered over the kernel-mode I/O system, which itself is a deeply layered system based on I/O request packets; the hardware is down there somewhere...). In the POSIX layer, opening a file with _O_RDWR | _O_TEXT mode (as in "r+"), requires seeking to the end of the file to remove CTRL+Z, if it's present. Here's a quote from the CRT's fopen documentation:
Open in text (translated) mode. In this mode, CTRL+Z is interpreted as
an end-of-file character on input. In files opened for reading/writing
with "a+", fopen checks for a CTRL+Z at the end of the file and
removes it, if possible. This is done because using fseek and ftell to
move within a file that ends with a CTRL+Z, may cause fseek to behave
improperly near the end of the file.
The problem here is that the above check calls the 32-bit _lseek (bear in mind that sizeof long is 4 bytes on 64-bit Windows, unlike most other 64-bit platforms), instead of _lseeki64. Obviously this fails for an 11 GB file. Specifically, SetFilePointer fails because it gets called with a NULL value for lpDistanceToMoveHigh. Here's the return value and LastErrorValue for the latter call:
0:000> kc 2
Call Site
KERNELBASE!SetFilePointer
MSVCR90!lseek_nolock
0:000> r rax
rax=00000000ffffffff
0:000> dt _TEB #$teb LastErrorValue
ntdll!_TEB
+0x068 LastErrorValue : 0x57
The error code 0x57 is ERROR_INVALID_PARAMETER. This is referring to lpDistanceToMoveHigh being NULL when trying to seek from the end of a large file.
To work around this problem with CRT FILE streams, I recommend opening the file using io.open instead. This is a backported implementation of Python 3's I/O stack. It always opens files in raw binary mode (_O_BINARY), and it implements its own buffering and text-mode layers on top of the raw layer.
>>> import io
>>> f = io.open('a.csv', 'r+')
>>> f
<_io.TextIOWrapper name='a.csv' encoding='cp1252'>
>>> f.buffer
<_io.BufferedRandom name='a.csv'>
>>> f.buffer.raw
<_io.FileIO name='a.csv' mode='rb+'>
>>> f.seek(0, os.SEEK_END)
11811160064L

python: read lines from compressed text files

Is it possible to read a line from a gzip-compressed text file using Python without extracting the file completely? I have a text.gz file which is around 200 MB. When I extract it, it becomes 7.4 GB. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extracting the whole information. How can this be done using Python? I need to read the text file line-by-line.

Using gzip.GzipFile:
import gzip
with gzip.open('input.gz','rt') as f:
for line in f:
print('got line', line)
Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode).
I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

You could use the standard gzip module in python. Just use:
gzip.open('myfile.gz')
to open the file as any other file and read its lines.
More information here: Python gzip module

Have you tried using gzip.GzipFile? Arguments are similar to open.

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.
The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:
#!/usr/bin/python
import os
import sys
import time
import gzip
def local_unzip(obj):
t0 = time.time()
count = 0
with obj as f:
for line in f:
count += 1
print(time.time() - t0, count)
r = sys.argv[1]
if sys.argv[2] == "1":
local_unzip(gzip.open(r,'rt'))
else:
local_unzip(os.popen('pigz -dc ' + r))
Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:
$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116
$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996
Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

READ PCAP: memory error in Python DPKT - python

Related

Loop through pyshark's FileCapture object stops working

How to detect a file has been truncated while reading

Close rdpcap function - Scapy

Python fails to open 11gb csv in r+ mode but opens in r mode

python: read lines from compressed text files

Categories

Resources