I'm writing a userspace driver for accessing FPGA registers in Python 3.5 that mmaps the FPGA's PCI address space, obtains a memoryview to provide direct access to the memory-mapped register space, and then uses struct.pack_into("<I", ...) to write a 32-bit value into the selected 32-bit aligned address.
def write_u32(address, data):
assert address % 4 == 0, "Address must be 32-bit aligned"
path = path.lib.Path("/dev/uio0")
file_size = path.stat().st_size
with path.open(mode='w+b') as f:
mv = memoryview(mmap.mmap(f.fileno(), file_size))
struct.pack_into("<I", mv, address, data)
Unfortunately, it appears that struct.pack_into does a memset(buf, 0, ...) that clears the register before the actual value is written. By examining write operations within the FPGA, I can see that the register is set to 0x00000000 before the true value is set, so there are at least two writes across the PCI bus (in fact for 32-bit access there are three, two zero writes, then the actual data. 64-bit involves six writes). This causes side-effects with some registers that count the number of write operations, or some that "clear on write" or trigger some event when written.
I'd like to use an alternative method to write the register data in a single write to the memory-mapped register space. I've looked into ctypes.memmove and it looks promising (not yet working), but I'm wondering if there are other ways to do this.
Note that a register read using struct.unpack_from works perfectly.
Note that I've also eliminated the FPGA from this by using a QEMU driver that logs all accesses - I see the same double zero-write access before data is written.
I revisited this in 2022 and the situation hasn't really changed. If you're considering using memoryview to write blocks of data at once, you may find this interesting.
Perhaps this would work as needed?
mv[address:address+4] = struct.pack("<I", data)
Update:
As seen from the comments, the code above does not solve the problem. The following variation of it does, however:
mv_as_int = mv.cast('I')
mv_as_int[address/4] = data
Unfortunately, precise understanding of what happens under the hood and why exactly memoryview behaves this way is beyond the capabilities of modern technology and will thus stay open for the researchers of the future to tackle.
You could try something like this:
def __init__(self,offset,size=0x10000):
self.offset = offset
self.size = size
mmap_file = os.open('/dev/mem', os.O_RDWR | os.O_SYNC)
mem = mmap.mmap(mmap_file, self.size,
mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE,
offset=self.offset)
os.close(mmap_file)
self.array = np.frombuffer(mem, np.uint32, self.size >> 2)
def wread(self,address):
idx = address >> 2
return_val = int(self.array[idx])
return return_val
def wwrite(self,address,data):
idx = address >> 2
self.array[idx] = np.uint32(data)
Related
To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime. I expected both jobs to take roughly the same amount of time, but Python job took only 27min, while Scala job took 37min (almost 40% longer!). I implemented the same job in Java as well and it took 37minutes too. How is this possible that Python is so much faster?
Minimal verifiable example:
Python job:
# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)
# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()
print(a, b)
Scala job:
// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
// 960 Files from a public dataset in 2 batches
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()
println(s"Lines with a: $num1, Lines with b: $num2")
Just by looking at the code, they seem to be identical. I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).
I would really appreciate any pointers.
Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect. You can easily verify it with minimal local applications. Scala one:
import scala.io.Source
import java.time.{Duration, Instant}
object App {
def main(args: Array[String]) {
val Array(filename, string) = args
val start = Instant.now()
Source
.fromFile(filename)
.getLines
.filter(line => line.startsWith(string))
.length
val stop = Instant.now()
val duration = Duration.between(start, stop).toMillis
println(s"${start},${stop},${duration}")
}
}
Python one
import datetime
import sys
if __name__ == "__main__":
_, filename, string = sys.argv
start = datetime.datetime.now()
with open(filename) as fr:
# Not idiomatic or the most efficient but that's what
# PySpark will use
sum(1 for _ in filter(lambda line: line.startswith(string), fr))
end = datetime.datetime.now()
duration = round((end - start).total_seconds() * 1000)
print(f"{start},{end},{duration}")
Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:
Python 273.50 (258.84, 288.16)
Scala 634.13 (533.81, 734.45)
As you see Python is not only systematically faster, but also is more consistent (lower spread).
Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.
Edit:
To address some concerns raised in the comments:
In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
What is passed back is just a single integer by partition, so in that direction impact is negligible.
Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).
Edit 2:
Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.
Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.
Python 22809.57 (21466.26, 24152.87)
Scala 27315.28 (24367.24, 30263.31)
Please note non-overlapping confidence intervals.
Edit 3:
To address another comment from Jasper-M:
The bulk of all the processing is still happening inside a JVM in the Spark case.
That is simply incorrect in this particular case:
The job in question is map job with single global reduce using PySpark RDDs.
PySpark RDD (unlike let's say DataFrame) implement gross of functionality natively in Python, with exception input, output and inter-node communication.
Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.
The read part is identical for JVM and Python API, so it can be considered as constant overhead. It also doesn't qualify as the bulk of the processing, even for such simple job like this one.
The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources.
There are two mistakes in the code:
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. From this point on, no way to adjust anything. Neither the number of executors nor the number of cores per executor.
LINE 4-5. sc.hadoopConfiguration is a wrong place to set any Spark configuration. It should be set in the config instance you pass to new SparkContext(config).
[ADDED]
Bearing the above in mind, I would propose to change the code of the Scala job to
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
and re-test it again. I bet the Scala version is going to be X times faster now.
I recently used a program called Cheat Engine, and what it does is you specify which process you want to work with from a list of currently running processes (usually used for pc gaming), and after you've got one picked out, you can run a search for existing values and keep track of changes made to those values via sequential scans in your memory to help trim your list of results down to a single memory address for the relevant thing you're trying to work with.
Once you've found the memory address and its value you were looking for, you can edit it and return that back to the program which I found to be very interesting. I want to learn how to do low level work like this, and I read this question here which had an answer I thought would put me off to an excellent start:
from ctypes import *
from ctypes.wintypes import *
OpenProcess = windll.kernel32.OpenProcess
ReadProcessMemory = windll.kernel32.ReadProcessMemory
CloseHandle = windll.kernel32.CloseHandle
PROCESS_ALL_ACCESS = 0x1F0FFF
pid = 4044 # I assume you have this from somewhere.
address = 0x1000000 # Likewise; for illustration I'll get the .exe header.
buffer = c_char_p("The data goes here")
bufferSize = len(buffer.value)
bytesRead = c_ulong(0)
processHandle = OpenProcess(PROCESS_ALL_ACCESS, False, pid)
if ReadProcessMemory(processHandle, address, buffer, bufferSize, byref(bytesRead)):
print "Success:", buffer
else:
print "Failed."
CloseHandle(processHandle)
It's basic enough for me to get a good grasp of what they're doing, but in their code, they use a hardcoded memory address. A specific memory address is closer to my end point, not my starting point.
What I'd like to do is be able to pass a process ID to my script, have it read all the memory associated with it (cheat engine's memory scan will range from 0000000000000000-7fffffffffffffff), and I'd like to make changes in whichever program is associated with the process, then run queries for the value matching up with the changes made until I've singled it down to an address. Then I'd like to be able to write over that value and pass that back to the program.
From what I've been reading around, it looks like ctypes and kernel32 are the way to go if I'd like to do this in python, but I'm stuck on how to dynamically retrieve memory addresses associated with a PID and can't branch out from this basic script until I've figured it out. How would one go about finding these?
I've attached a screenshot of what this looks like in Cheat Engine in case my explanation wasn't clear enough.
--edit--
I've also read through the ctypes doc and though they had different examples where kernel32 was being used, they didn't have anything that explained how to work with it. One hint I saw on the page was
"To find out the correct calling convention you have to look into the C header file or the documentation for the function you want to call."
I'm having trouble finding this information. What would one typically type in when looking for these?
Is it possible to print my environment variable memory address ?
With gdb-peda i have a memory address looking like 0xbffffcd6 with searchmem and i know it's the right form. (0xbfff????) but gdb moved the stack with some other environment variable.
I would like with my python script to get this address and then do my trick and include my shellcode.
i tried (with Python):
print hex(id(os.environ["ENVVAR"]))
print memoryview(os.environ["ENVVAR"])
# output :
# 0xb7b205c0L
# <memory at 0xb7b4dd9c>
With Ruby :
puts (ENV['PATH'].object_id << 1).to_s(16)
# output :
# -4836c38c
If anyone have an idea, with python or ruby.
The cpython built in function id() returns a unique id for any object, which is not exactly it's memory address but is as close as you can get to such.
For example, we have variable x. id(x) does not return the memory address of the variable x, rather it returns the memory address of the object that x points to.
There's a strict separation between 'variables' and 'memory objects'. In the standard implementation, python allocates a set of locals and a stack for the virtual machine to operate on. All local slots are disjoint, so if you load an object from local slot x onto the stack and modify that object, the "location" of the x slot doesn't change.
http://docs.python.org/library/functions.html#id
I suppose you could do that using the ctypes module to call the native getenv directly :
import ctypes
libc = ctypes.CDLL("libc.so.6")
getenv = libc.getenv
getenv.restype = ctypes.c_voidp
print('%08x' % getenv('PATH'))
This seems an impossible task at least in python.
There are few things to take in consideration from this question:
ASLR would make this completely impossible
Every binary can have it's own overhead, different argv, so, the only reliable option is to execute the binary and trace it's memory until we found the environment variable we are looking for. Basically, even if we can find the environment address in the python process, it would be at a different position in the binary you are trying to exploit.
Best fit to answer this question is to use http://python3-pwntools.readthedocs.io/en/latest/elf.html which is taking a coredump file where it's easy to find the address.
Please keep in mind that system environment variable is not an object you can access by its memory address. Each process, like Python or Ruby process running your script will receive its own copy of environment. Thats why results returned by Python and Ruby interpreters are so different.
If you would like to modify system environment variable you should use API provided by your programming language.
Please see this or that post for Python solution.
Thanks for #mickael9, I have writen a function to calculate address of an environment variable in a program:
def getEnvAddr(envName, ELFfile):
import ctypes
libc = ctypes.CDLL('libc.so.6')
getenv = libc.getenv
getenv.restype = ctypes.c_voidp
ptr = getenv(envName)
ptr += (len('/usr/bin/python') - len(ELFfile)) * 2
return ptr
For example:
user#host:~$ ./getenvaddr.elf PATH /bin/ls
PATH will be at 0xbfffff22 in /bin/ls
user#host:~$ python getenvaddr.py PATH /bin/ls
PATH will be at 0xbfffff22 in /bin/ls
user#host:~$
Note: This function only works in Linux system.
The getenv() function is inherently not reentrant because it returns a value pointing to static data.
In fact, for higher performance of getenv(), the implementation could also maintain a separate copy of the environment in a data structure that could be searched much more quickly (such as an indexed hash table, or a binary tree), and update both it and the linear list at environ when setenv() or unsetenv() is invoked.
So the address returned by getenv is not necessarily from the environment.
Process memory layout;
(source: duartes.org)
(source: cloudfront.net)
Memory map
import os
def mem_map():
path_hex = hex(id(os.getenv('PATH'))).rstrip('L')
path_address = int(path_hex, 16)
for line in open('/proc/self/maps'):
if 'stack' in line:
line = line.split()
first, second = line[0].split('-')
first, second = int(first, 16), int(second, 16)
#stack grows towards lower memory address
start, end = max(first, second), min(first, second)
print('stack:\n\tstart:\t0x{}\n\tend:\t0x{}\n\tsize:\t{}'.format(start, end, start - end))
if path_address in range(end, start+1):
print('\tgetenv("PATH") ({}) is in the stack'.format(path_hex))
else:
print('\tgetenv("PATH") ({}) is not in the stack'.format(path_hex))
if path_address > start:
print('\tgetenv("PATH") ({}) is above the stack'.format(path_hex))
else:
print('\tgetenv("PATH") ({}) is not above the stack'.format(path_hex))
print('')
continue
if 'heap' in line:
line = line.split()
first, second = line[0].split('-')
first, second = int(first, 16), int(second, 16)
#heap grows towards higher memory address
start, end = min(first, second), max(first, second)
print('heap:\n\tstart:\t0x{}\n\tend:\t0x{}\n\tsize:\t{}'.format(start, end, end - start))
if path_address in range(start, end+1):
print('\tgetenv("PATH") ({}) in the heap'.format(path_hex))
else:
print('\tgetenv("PATH") ({}) is not in the heap'.format(path_hex))
print('')
Output;
heap:
start: 0x170364928
end: 0x170930176
size: 565248
getenv("PATH") (0xb74d2330) is not in the heap
stack:
start: 0x0xbffa8000L
end: 0x0xbff86000L
size: 139264
getenv("PATH") (0xb74d2330) is not in the stack
getenv("PATH") (0xb74d2330) is not above the stack
Environment is above the stack. So its address should be higher than the stack. But the address id shows is not in the stack, not in the heap and not above the stack. Is it really an address? or my calculation is wrong!
Here's the code to check where an object lies in memory.
def where_in_mem(obj):
maps = {}
for line in open('/proc/self/maps'):
line = line.split()
start, end = line[0].split('-')
key = line[-1] if line[-1] != '0' else 'anonymous'
maps.setdefault(key, []).append((int(start, 16), int(end, 16)))
for key, pair in maps.items():
for start, end in pair:
# stack starts at higher memory address and grows towards lower memory address
if 'stack' in key:
if start >= id(obj) >= end:
print('Object "{}" ({}) in the range {} - {}, mapped to {}'.format(obj, hex(id(obj)), hex(start), hex(end), key))
continue
if start <= id(obj) <= end:
print('Object "{}" ({}) in the range {} - {}, mapped to {}'.format(obj, hex(id(obj)), hex(start), hex(end), key))
where_in_mem(1)
where_in_mem(os.getenv('PATH'))
Output;
Object "1" (0xa17f8b0) in the range 0xa173000 - 0xa1fd000, mapped to [heap]
Object "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games" (0xb74a1330L) in the range 0xb7414000L - 0xb74d6000L, mapped to anonymous
What's anonymous in the above output?
It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
Anatomy of a Program in Memory
So the os.environ['PATH'] is in the malloced region.
In ruby it's possible - this post covers the general case:
Accessing objects memory address in ruby..? "You can get the actual pointer value of an object by taking the object id, and doing a bitwise shift to the left"
puts (ENV['RAILS_ENV'].object_id << 1).to_s(16)
> 7f84598a8d58
Copying a File using a straight-forward approach in Python is typically like this:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
(This code snippet is from shutil.py, by the way).
Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
buffer = array.array('c')
buffer.fromstring('-' * length)
while True:
count = fsrc.readinto(buffer)
if count == 0:
break
if count != len(buffer):
fdst.write(buffer.toString()[:count])
else:
buf.tofile(fdst)
My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).
I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.
Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.
This code snippet is from shutil.py
Which is a standard library module. Why not just use it?
First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
This is tiny compared to the effort required to actually grab a page of data from disk.
Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server coe you wrote and already works for performance or memory usage) I'd rather call the OS directly using ctypes - thus having a copy performed as low level as I want too.
It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).
I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.
Make sure that Django's DEBUG setting is set to False
This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().
There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).