How can I obtain a PE file's instructions using Python?

How can I obtain a PE file's instructions using Python? - python

So I'm trying to write a basic disassembler for a school project using Python. I'm using the pydasm and capstone libraries. What I don't understand is how I can actually access the assembly instructions of a program using Python. These libraries allow me to disassemble instructions, but I can't figure out how to access a program's instructions in Python. Could anyone give me some direction?
Thanks.

You should be cautious about the code section as the base of the code section might not contain only code (imports or read only data might be present at this location).
The best way to start a disassembly is by looking at the AddressOfEntryPoint field in the IMAGE_OPTIONAL_HEADER which indicates the first executed byte in the PE file (except if TLS is present but that's another subject).
A very good library for browsing PE files in python is pefile.
Here's an example to get the first 10 bytes at the program entry point:
#!/usr/local/bin/python2
# -*- coding: utf8 -*-
from __future__ import print_function
import sys
import os.path
import pefile
def find_entry_point_section(pe, eop_rva):
for section in pe.sections:
if section.contains_rva(eop_rva):
return section
return None
def main(file_path):
print("Opening {}".format(file_path))
try:
pe = pefile.PE(file_path, fast_load=True)
# AddressOfEntryPoint if guaranteed to be the first byte executed.
eop = pe.OPTIONAL_HEADER.AddressOfEntryPoint
code_section = find_entry_point_section(pe, eop)
if not code_section:
return
print("[+] Code section found at offset: "
"{:#x} [size: {:#x}]".format(code_section.PointerToRawData,
code_section.SizeOfRawData))
# get first 10 bytes at entry point and dump them
code_at_oep = code_section.get_data(eop, 10)
print("[*] Code at EOP:\n{}".
format(" ".join("{:02x}".format(ord(c)) for c in code_at_oep)))
except pefile.PEFormatError as pe_err:
print("[-] error while parsing file {}:\n\t{}".format(file_path,
pe_err))
if __name__ == '__main__':
if len(sys.argv) < 2:
print("[*] {} <PE_Filename>".format(sys.argv[0]))
else:
file_path = sys.argv[1]
if os.path.isfile(file_path):
main(file_path)
else:
print("[-] {} is not a file".format(file_path))
Simply pass the name of your PE file as the first argument.
In the above code the code_at_oep variable holds the first few bytes of the entry point. From there you can pass this bytes to the capstone engine.
Note that these first bytes might simply be a jmp or call instruction, so you'll have to follow the code execution in order to disassemble correctly. Disassembling correctly a program is still an open problem in computer science...

This depends on what OS you're using. You have some other questions about Linux, so I'm assuming you're using that. On Linux, executables are typically in ELF format, so you'll need a python library to read that or else to use some other tool to extract the part of the ELF file that you want.
The actual instructions are stored in the .text section, so if you extract that section's contents, those should be the raw bytes you want to disassemble.

so here is my code that disassembles your exe file and gives you an output in x86 assembly language I used pefile and capstone library
#!/usr/bin/python
import pefile
from capstone import *
# load the target PE file
pe = pefile.PE("/file/path/code.exe")
# get the address of the program entry point from the program header
entrypoint = pe.OPTIONAL_HEADER.AddressOfEntryPoint
# compute memory address where the entry code will be loaded into memory
entrypoint_address = entrypoint+pe.OPTIONAL_HEADER.ImageBase
# get the binary code from the PE file object
binary_code = pe.get_memory_mapped_image()[entrypoint:entrypoint+100]
# initialize disassembler to disassemble 32 bit x86 binary code
disassembler = Cs(CS_ARCH_X86, CS_MODE_32)
# disassemble the code
for instruction in disassembler.disasm(binary_code, entrypoint_address):
print("%s\t%s" %(instruction.mnemonic, instruction.op_str))
make sure to change and give the correct path of your exe file, also you need to specify how many instructions you want to print out at line number 15. here [entrypoint:entrypoint+100] I specify only 100 instructions but you can change it.

Related

Does the Captone python module support an exe as input, or does the data need to be an isolated instruction binary?

Some disassemblers like IDA or Ghidra take an exe and output the instructions. Other disassemblers require the user to parse the PE header, isolate binary for the instructions and pass that in.
I'm trying to learn to use the Capstone Python API, but the .py documentation only ever shows a buffer of isolated instructions being passed, like so:
# test1.py
from capstone import *
CODE = b"\x55\x48\x8b\x05\xb8\x13\x00\x00"
md = Cs(CS_ARCH_X86, CS_MODE_64)
for i in md.disasm(CODE, 0x1000):
print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
But I want to do something like:
CODE = open("test.exe", "rb")
Without having to personally parse the PE headers to isolate the instruction data. Does Captone's API support this?

Capstone is architecture-independent. It doesn't understand PE files or elf files. You just feed it bytes of machine language for whatever processor you have.

How can I retrieve a file's details in Windows using the Standard Library for Python 2

I need to read the details for a file in Windows so that I can interrogate the file's 'File version' as displayed in the Details tab of the File Properties window.
I haven't found anything in the standard library that makes this very easy to accomplish but figured if I could find the right windows function, I could probably accomplish it using ctypes.
Does anyone have any exemplary code or can they point me to a Windows function that would let me read this info. I took a look a GetFileAttributes already, but that wasn't quite right as far as I could tell.

Use the win32 api Version Information functions from ctypes. The api is a little fiddly to use, but I also want this so have thrown together a quick script as an example.
usage: version_info.py [-h] [--lang LANG] [--codepage CODEPAGE] path
Can also use as a module, see the VersionInfo class. Checked with Python 2.7 and 3.6 against a few files.

import array
from ctypes import *
def get_file_info(filename, info):
"""
Extract information from a file.
"""
# Get size needed for buffer (0 if no info)
size = windll.version.GetFileVersionInfoSizeA(filename, None)
# If no info in file -> empty string
if not size:
return ''
# Create buffer
res = create_string_buffer(size)
# Load file informations into buffer res
windll.version.GetFileVersionInfoA(filename, None, size, res)
r = c_uint()
l = c_uint()
# Look for codepages
windll.version.VerQueryValueA(res, '\\VarFileInfo\\Translation',
byref(r), byref(l))
# If no codepage -> empty string
if not l.value:
return ''
# Take the first codepage (what else ?)
codepages = array.array('H', string_at(r.value, l.value))
codepage = tuple(codepages[:2].tolist())
# Extract information
windll.version.VerQueryValueA(res, ('\\StringFileInfo\\%04x%04x\\'
+ info) % codepage, byref(r), byref(l))
return string_at(r.value, l.value)

Windows vs Linux file modes

On a windows machine, I am trying to get a file's mode using the os module in python, like this (short snippet):
import os
from stat import *
file_stat = os.stat(path)
mode = file_stat[ST_MODE]
An example for the mode I got for a file is 33206.
My question is, how can I convert it to the linux-file mode method? (for example, 666).
Thanks to all repliers!
Edit:
found my answer down here :) for all who want to understand this topic further:
understanding and decoding the file mode value from stat function output

Check if this translates properly:
import os
import stat
file_stat = os.stat(path)
mode = file_stat[ST_MODE]
print oct(stat.S_IMODE(mode))
For your example:
>>>print oct(stat.S_IMODE(33206))
0666
Took it from here. Read for more explanation

One workaround would be to use:os.system(r'attrib –h –s d:\your_file.txt') where you can use the attribute switches :
R – This command will assign the “Read-Only” attribute to your selected files or folders.
H – This command will assign the “Hidden” attribute to your selected files or folders.
A – This command will prepare your selected files or folders for “Archiving.”
S – This command will change your selected files or folders by assigning the “System” attribute.

What is the common header format of Python files?

I came across the following header format for Python source files in a document about Python coding guidelines:
#!/usr/bin/env python
"""Foobar.py: Description of what foobar does."""
__author__ = "Barack Obama"
__copyright__ = "Copyright 2009, Planet Earth"
Is this the standard format of headers in the Python world?
What other fields/information can I put in the header?
Python gurus share your guidelines for good Python source headers :-)

Its all metadata for the Foobar module.
The first one is the docstring of the module, that is already explained in Peter's answer.
How do I organize my modules (source files)? (Archive)
The first line of each file shoud be #!/usr/bin/env python. This makes it possible to run the file as a script invoking the interpreter implicitly, e.g. in a CGI context.
Next should be the docstring with a description. If the description is long, the first line should be a short summary that makes sense on its own, separated from the rest by a newline.
All code, including import statements, should follow the docstring. Otherwise, the docstring will not be recognized by the interpreter, and you will not have access to it in interactive sessions (i.e. through obj.__doc__) or when generating documentation with automated tools.
Import built-in modules first, followed by third-party modules, followed by any changes to the path and your own modules. Especially, additions to the path and names of your modules are likely to change rapidly: keeping them in one place makes them easier to find.
Next should be authorship information. This information should follow this format:
__author__ = "Rob Knight, Gavin Huttley, and Peter Maxwell"
__copyright__ = "Copyright 2007, The Cogent Project"
__credits__ = ["Rob Knight", "Peter Maxwell", "Gavin Huttley",
"Matthew Wakefield"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Rob Knight"
__email__ = "rob#spot.colorado.edu"
__status__ = "Production"
Status should typically be one of "Prototype", "Development", or "Production". __maintainer__ should be the person who will fix bugs and make improvements if imported. __credits__ differs from __author__ in that __credits__ includes people who reported bug fixes, made suggestions, etc. but did not actually write the code.
Here you have more information, listing __author__, __authors__, __contact__, __copyright__, __license__, __deprecated__, __date__ and __version__ as recognized metadata.

I strongly favour minimal file headers, by which I mean just:
#!/usr/bin/env python # [1]
"""\
This script foos the given bars [2]
Usage: myscript.py BAR1 BAR2
"""
import os # standard library, [3]
import sys
import requests # 3rd party packages
import mypackage # local source
[1] The hashbang if, and only if, this file should be able to be directly executed, i.e. run as myscript.py or myscript or maybe even python myscript.py. (The hashbang isn't used in the last case, but providing it gives users the choice of executing it either way.) The hashbang should not be included if the file is a module, intended just to be imported by other Python files.
[2] Module docstring
[3] Imports, grouped in the standard way, ie. three groups of imports, with a single blank line between them. Within each group, imports are sorted. The final group, imports from local source, can either be absolute imports as shown, or explicit relative imports.
Everything else is a waste of time - both for the author and for subsequent maintainers. It wastes the precious visual space at the top of the file with information that is better tracked elsewhere, and is easy to get out of date and become actively misleading.
If you have legal disclaimers or licensing info, it goes into a separate file. It does not need to infect every source code file. Your copyright should be part of this. People should be able to find it in your LICENSE file, not random source code.
Metadata such as authorship and dates is already maintained by your source control. There is no need to add a less-detailed, erroneous, and out-of-date version of the same info in the file itself.
I don't believe there is any other data that everyone needs to put into all their source files. You may have some particular requirement to do so, but such things apply, by definition, only to you. They have no place in “general headers recommended for everyone”.

The answers above are really complete, but if you want a quick and dirty header to copy'n paste, use this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Module documentation goes here
and here
and ...
"""
Why this is a good one:
The first line is for *nix users. It will choose the Python interpreter in the user path, so will automatically choose the user preferred interpreter.
The second one is the file encoding. Nowadays every file must have a encoding associated. UTF-8 will work everywhere. Just legacy projects would use other encoding.
And a very simple documentation. It can fill multiple lines.
See also: https://www.python.org/dev/peps/pep-0263/
If you just write a class in each file, you don't even need the documentation (it would go inside the class doc).

Also see PEP 263 if you are using a non-ascii characterset
Abstract
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.
Problem
In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the
programming environment rather unfriendly to Python users who live
and work in non-Latin-1 locales such as many of the Asian
countries. Programmers can write their 8-bit strings using the
favorite encoding, but are bound to the "unicode-escape" encoding
for Unicode literals.
Proposed Solution
I propose to make the Python source code encoding both visible and
changeable on a per-source file basis by using a special comment
at the top of the file to declare the encoding.
To make Python aware of this encoding declaration a number of
concept changes are necessary with respect to the handling of
Python source code data.
Defining the Encoding
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
...

What I use in some project is this line in the first line for Linux machines:
# -*- coding: utf-8 -*-
As a DOC & Author credit, I like simple string in multiline. Here an example from Example Google Style Python Docstrings
# -*- coding: utf-8 -*-
"""Example Google style docstrings.
This module demonstrates documentation as specified by the `Google Python
Style Guide`_. Docstrings may extend over multiple lines. Sections are created
with a section header and a colon followed by a block of indented text.
Example:
Examples can be given using either the ``Example`` or ``Examples``
sections. Sections support any reStructuredText formatting, including
literal blocks::
$ python example_google.py
Section breaks are created by resuming unindented text. Section breaks
are also implicitly created anytime a new section starts.
Attributes:
module_level_variable1 (int): Module level variables may be documented in
either the ``Attributes`` section of the module docstring, or in an
inline docstring immediately following the variable.
Either form is acceptable, but the two should not be mixed. Choose
one convention to document module level variables and be consistent
with it.
Todo:
* For module TODOs
* You have to also use ``sphinx.ext.todo`` extension
.. _Google Python Style Guide:
http://google.github.io/styleguide/pyguide.html
"""
Also can be nice to add:
"""
#Author: ...
#Date: ....
#Credit: ...
#Links: ...
"""
Additional Formats
Meta-information markup | devguide
"""
:mod:`parrot` -- Dead parrot access
===================================
.. module:: parrot
:platform: Unix, Windows
:synopsis: Analyze and reanimate dead parrots.
.. moduleauthor:: Eric Cleese <eric#python.invalid>
.. moduleauthor:: John Idle <john#python.invalid>
"""
/common-header-python
#!/usr/bin/env python3 Line 1
# -*- coding: utf-8 -*- Line 2
#----------------------------------------------------------------------------
# Created By : name_of_the_creator Line 3
# Created Date: date/month/time ..etc
# version ='1.0'
# ---------------------------------------------------------------------------
Also I report similarly to other answers
__author__ = "Rob Knight, Gavin Huttley, and Peter Maxwell"
__copyright__ = "Copyright 2007, The Cogent Project"
__credits__ = ["Rob Knight", "Peter Maxwell", "Gavin Huttley",
"Matthew Wakefield"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Rob Knight"
__email__ = "rob#spot.colorado.edu"
__status__ = "Production"

Finding the version of an application from Python?

Basically i am trying to find out what version of ArcGIS the user currently has installed, i looked through the registry and couldn't find anything related to a version string. However i know it is stored, within the .exe.
I've done a fair bit of googling, and can't find anything really worth it. I tried using the GetFileVersionInfo, and i seem to get a random mishmash of stuff.
Any ideas?
EDIT
Sigh....
Turns out pywin32 is not always installed on all machines. Does anyone know if its possible to do the same thing via ctypes?
Also this is only for windows.

If you prefer not to do this using pywin32, you would be able to do this with ctypes, for sure.
The trick will be decoding that silly file version structure that comes back.
There's one old mailing list post that is doing what you're asking. Unfortunately, I don't have a windows box handy to test this myself, right now. But if it doesn't work, it should at least give you a good start.
Here's the code, in case those 2006 archives vanish sometime:
import array
from ctypes import *
def get_file_info(filename, info):
"""
Extract information from a file.
"""
# Get size needed for buffer (0 if no info)
size = windll.version.GetFileVersionInfoSizeA(filename, None)
# If no info in file -> empty string
if not size:
return ''
# Create buffer
res = create_string_buffer(size)
# Load file informations into buffer res
windll.version.GetFileVersionInfoA(filename, None, size, res)
r = c_uint()
l = c_uint()
# Look for codepages
windll.version.VerQueryValueA(res, '\\VarFileInfo\\Translation',
byref(r), byref(l))
# If no codepage -> empty string
if not l.value:
return ''
# Take the first codepage (what else ?)
codepages = array.array('H', string_at(r.value, l.value))
codepage = tuple(codepages[:2].tolist())
# Extract information
windll.version.VerQueryValueA(res, ('\\StringFileInfo\\%04x%04x\\'
+ info) % codepage, byref(r), byref(l))
return string_at(r.value, l.value)
print get_file_info(r'C:\WINDOWS\system32\calc.exe', 'FileVersion')
--
Ok - back near a windows box. Have actually tried this code now. "Works for me".
>>> print get_file_info(r'C:\WINDOWS\system32\calc.exe', 'FileVersion')
6.1.7600.16385 (win7_rtm.090713-1255)

there's a gnu linux utility called 'strings' that prints the printable characters in any file(binary or non-binary), try using that and look for a version number like pattern
on windows, you can get strings here http://unxutils.sourceforge.net/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.