STDOUT json format with system command - python

I have the following example where I want to output a variable in the same json format with os.system (the idea is to output with a system command) but the double quotes are ignored in output.
import json
import os
import requests
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = (str(PAYLOAD_CONF))
paydic = (json.dumps(PAYLOAD_CONF))
os.system("echo "+paystr+"")
os.system("echo "+paydic+"")
Output:
{cluster: {processes: 22, ldap: string}}
{cluster: {processes: 22, ldap: string}}
Can you help me in a workaround where I can output this with double quotes? It's very important to output with system command.

For cases like this, where you embed variables into a shell command you should use shlex.quote. Using this (and with minor cleanup), the code can be written as:
import json
import os
import shlex
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paydic = json.dumps(PAYLOAD_CONF)
os.system("echo " + shlex.quote(paydic))
Output:
{"cluster": {"ldap": "string", "processes": 22}}
Using subprocess
The subprocess module contains a lot of helper functions for calling external applications. These functions are generally preferrable to use than os.system for various security reasons.
If there is no hard dependency of os.system you can also use one of depending on your needs:
subprocess.call -- This will return even if the subprocess fails. If this returns a non-zero value, the process exited abnormally.
subprocess.check_call -- This will raise an exception if the process exits abnormally.
subprocess.check_output -- This will return stdout from the subprocess and raise an exception if it exits abnormally.
... The module contains many other helpful functions for interacting with subprocess which you should check out if the above don't suit your needs.
using check_output, the code becomes:
from subprocess import check_call
import json
import os
import shlex
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paydic = json.dumps(PAYLOAD_CONF)
check_call(["echo", paydic])

You're not adding quotes; you're adding an empty string to the end.
Also, echo is going to interpret the first set of double quotes as a wrapper around the argument – not as part of the string itself. In the echo command itself, you need to escape double quotes with a backslash, e.g. echo \"hello\" will output "hello", whereas echo "hello" will output hello.
In a Python string, you're going to have to escape the literal backslash in the echo command with another backslash, e.g. os.system('echo \\"hello\\"') for output "hello".
Applying this to your case and using format to make it easy:
import json
import os
import requests
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = (str(PAYLOAD_CONF))
paydic = (json.dumps(PAYLOAD_CONF))
os.system('echo \\"{}\\"'.format(paystr))
os.system('echo \\"{}\\"'.format(paydic))
Output:
"{cluster: {ldap: string, processes: 22}}"
"{cluster: {ldap: string, processes: 22}}"
Your paystr variable is also unnecessary, since all objects are automatically converted to strings by print and format via their inherited or overridden __str__ methods.
EDIT:
To output the variable as it appears in Python you just need to iterate through the payload dict and render each key-value pair in a type-sensitive way.
import os
import requests
def make_payload_str(payload, nested=1):
payload_str = "{\n" + "\t" * nested
for i, k in enumerate(payload.keys()):
v = payload[k]
if type(k) is str:
payload_str += '\\"{}\\"'.format(k)
else:
payload_str += str(k)
payload_str += ": "
if type(v) is str:
payload_str += '\\"{}\\"'.format(v)
elif type(v) is dict:
payload_str += make_payload_str(v, nested=nested + 1)
else:
payload_str += str(v)
# Only add comma if not last element
if i < len(payload) - 1:
payload_str += ",\n" + "\t" * nested
else:
payload_str += "\n"
return payload_str + "\n" + "\t" * (nested - 1) + "}"
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = make_payload_str(PAYLOAD_CONF)
os.system('echo "{}"'.format(paystr))
Output:
{
"cluster": {
"ldap": "string",
"processes": 22
}
}
If the payload contains a dictionary, as it does in the example you provided, the function calls itself to produce the string for that dictionary, indented the right number of tabs using the nested parameter.
If the payload is also allowed to have lists and other more complex types, you'll have to include cases that account for those, but it's just more of the same.

Related

howto pass json to docker solc

I used to compile my solidity contracts with compile_standard in python, but it's getting very hard to keep up with the different compiler versions. After migrating to a new system, the new compiler won't work with older contracts.
So I figured, let's do it with docker,so I can choose my compiler version, how hard can it be, right? I tried to add a string as an argument, which would be my preferred way of working, but this doesn't work. If I try to use files, I get: FileNotFoundError: [Errno 2] No such file or directory, which makes sense, I guess, since I don't mount any paths on docker, since according to the solc docs, this is not necessary. But I can't find more information here.
So my questions are;
1 if using files, how to pass the path?
2 is it possible to pass the json as a string and get a string back as a result?
compile_json = {
"language": "Solidity",
"sources": sources,
"settings":
{
"outputSelection": {
"*": {
"*": [
"metadata", "evm.bytecode"
, "evm.bytecode.sourceMap"
]
}
}
}
}
with open('data.json', 'w') as outfile:
json.dump(compile_json, outfile)
docker_cmd = 'docker run ethereum/solc:0.6.10 --standard-json < ' + self.base_path + 'data.json --allow-paths '+ self.base_path +' < out.json'
bincode = subprocess.check_output([docker_cmd]).decode('utf-8')
edit:
docker_cmd = 'docker run ethereum/solc:0.6.0 --standard-json < ' + self.base_path + 'data.json --allow-paths '+ self.base_path +' > out.json'
try:
bincode = subprocess.check_output(docker_cmd,shell=True,stderr=subprocess.STDOUT).decode('utf-8')
except subprocess.CalledProcessError as e:
raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
I've managed to get it working through docker, but now solc complains about the json format:
{"errors":[{"component":"general","formattedMessage":"* Line 2, Column 1\n Syntax error: value, object or array expected.\n* Line 1, Column 1\n A valid JSON document must be either an array or an object value.\n","message":"* Line 2, Column 1\n Syntax error: value, object or array expected.\n* Line 1, Column 1\n A valid JSON document must be either an array or an object value.\n","severity":"error","type":"JSONError"}]}

How can i run a php function in python?

for some reasons, i have to run a php function in python.
However, i realized that it's beyond my limit.
So, i'm asking for help here.
below is the code
function munja_send($mtype, $name, $phone, $msg, $callback, $contents) {
$host = "www.sendgo.co.kr";
$id = ""; // id
$pass = ""; // password
$param = "remote_id=".$id;
$param .= "&remote_pass=".$pass;
$param .= "&remote_name=".$name;
$param .= "&remote_phone=".$phone; //cellphone number
$param .= "&remote_callback=".$callback; // my cellphone number
$param .= "&remote_msg=".$msg; // message
$param .= "&remote_contents=".$contents; // image
if ($mtype == "lms") {
$path = "/Remote/RemoteMms.html";
} else {
$path = "/Remote/RemoteSms.html";
}
$fp = #fsockopen($host,80,$errno,$errstr,30);
$return = "";
if (!$fp) {
echo $errstr."(".$errno.")";
} else {
fputs($fp, "POST ".$path." HTTP/1.1\r\n");
9
fputs($fp, "Host: ".$host."\r\n");
fputs($fp, "Content-type: application/x-www-form-urlencoded\r\n");
fputs($fp, "Content-length: ".strlen($param)."\r\n");
fputs($fp, "Connection: close\r\n\r\n");
fputs($fp, $param."\r\n\r\n");
while(!feof($fp)) $return .= fgets($fp,4096);
}
fclose ($fp);
$_temp_array = explode("\r\n\r\n", $return);
$_temp_array2 = explode("\r\n", $_temp_array[1]);
if (sizeof($_temp_array2) > 1) {
$return_string = $_temp_array2[1];
} else {
$return_string = $_temp_array2[0];
}
return $return_string;
}
i would be glad if anyone can show me a way.
thank you.
I don't know PHP, but based on my understanding, here should be a raw line-for-line translation of the code you provided, from PHP to python. I've preserved your existing comments, and added new ones for clarification in places where I was unsure or where you might want to change.
It should be pretty straightforward to follow - the difference is mostly in syntax (e.g. + for concatenation instead of .), and in converting str to bytes and vice versa.
import socket
def munja_send(mtype, name, phone, msg, callback, contents):
host = "www.sendgo.co.kr"
remote_id = "" # id (changed the variable name, since `id` is also a builtin function)
password = "" # password (`pass` is a reserved keyword in python)
param = "remote_id=" + remote_id
param += "&remote_pass=" + password
param += "&remote_name=" + name
param += "&remote_phone=" + phone # cellphone number
param += "&remote_callback=" + callback # my cellphone number
param += "&remote_msg=" + msg # message
param += "&remote_contents=" + contents # image
if mtype == "lms"
path = "/Remote/RemoteMms.html"
else:
path = "/Remote/RemoteSms.html"
socket.settimeout(30)
# change these parameters as necessary for your desired outcome
fp = socket.socket(family=socket.AF_INET, type=socket.SOCK_STREAM)
errno = fp.connect_ex((host, 80))
if errno != 0:
# I'm not sure where errmsg comes from in php or how to get it in python
# errno should be the same, though, as it refers to the same system call error code
print("Error(" + errno + ")")
else:
returnstr = b""
fp.send("POST " + path + "HTTP/1.1\r\n")
fp.send("Host: " + host + "\r\n")
fp.send("Content-type: application/x-www-form-urlencoded\r\n")
# for accuracy, we convert param to bytes using utf-8 encoding
# before checking its length. Change the encoding as necessary for accuracy
fp.send("Content-length: " + str(len(bytes(param, 'utf-8'))) + "\r\n")
fp.send("Connection: close\r\n\r\n")
fp.send(param + "\r\n\r\n")
while (data := fp.recv(4096)):
# fp.recv() should return an empty string if eof has been hit
returnstr += data
fp.close()
_temp_array = returnstr.split(b'\r\n\r\n')
_temp_array2 = _temp_array[1].split(b'\r\n')
if len(temp_array2) > 1:
return_string = _temp_array2[1]
else:
return_string = _temp_array2[0]
# here I'm converting the raw bytes to a python string, using encoding
# utf-8 by default. Replace with your desired encoding if necessary
# or just remove the `.decode()` call if you're fine with returning a
# bytestring instead of a regular string
return return_string.decode('utf-8')
If possible, you should probably use subprocess to execute your php code directly, as other answers suggest, as straight-up translating code is often error-prone and has slightly different behavior (case in point, the lack of errmsg and probably different error handling in general, and maybe encoding issues in the above snippet). But if that's not possible, then hopefully this will help.
according to the internet, you can use subprocess and then execute the PHP script
import subprocess
# if the script don't need output.
subprocess.call("php /path/to/your/script.php")
# if you want output
proc = subprocess.Popen("php /path/to/your/script.php", shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
PHP code can be executed in python using libraries subprocess or php.py based on the situation.
Please refer this answer for further details.

How to escape only delimiter and not the newline character in CSV

I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.

How do I make variables from a Python module available in Go?

I am building a tool in Go which needs to provide a way to resolve variables declared in the global scope of Python scripts. In the future I would like to extend this to Node.js as well. It needs to be cross-platform.
Basically, if someone were to have the following Python code:
#!/usr/bin/env python
hello = "world"
some_var = "one"
another_var = "two"
var_three = some_func()
I would like to have access to these variable keys and values in my Golang code. In case of the function, I would like to have access to the value it returns.
My current idea is to run the script with the Golang exec.Command function and have the variables printed to its stdout in some format (e.g. JSON), which in turn can be parsed with Golang. Thoughts?
They are of different runtime environments. Golang cannot directly access variables in Python's runtime. Vica versa. You can, however, program them to pass on variable values through standard I/O or environment variables. The key is to determine the proper format for information exchanges.
For example, if the python script takes arguments as input and print the result, encoded as JSON, to the stdout. Then you can call the script with proper arguments, and decode the stdout as JSON.
Such as:
range.py
import json
import sys
def toNum(str):
return int(str)
def main(argv):
# Basically called range() with all the arguments from script call
print(json.dumps(list(range(*map(toNum, argv)))))
if __name__ == '__main__':
main(sys.argv[1:])
main.go
package main
import (
"encoding/json"
"fmt"
"log"
"os/exec"
)
func pythonRange(start, stop, step int) (c []byte, err error) {
return exec.Command(
"python3",
"./range.py",
fmt.Sprintf("%d", start),
fmt.Sprintf("%d", stop),
fmt.Sprintf("%d", step),
).Output()
}
func main() {
var arr []int
// get the output of the python script
result, err := pythonRange(1, 10, 1)
if err != nil {
log.Fatal(err)
}
// decode the stdout of the python script
// as a json array of integer
err = json.Unmarshal(result, &arr)
if err != nil {
log.Fatal(err)
}
// show the result with log.Printf
log.Printf("%#v", arr)
}
Output global variables
To output global variables in Python as JSON object:
import json
def dump_globals():
# Basically called range() with all the arguments from script call
vars = dict()
for (key, value) in globals().items():
if key.startswith("__") and key.endswith("__"):
continue # skip __varname__ variables
try:
json.dumps(value) # test if value is json serializable
vars[key] = value
except:
continue
print(json.dumps(vars))
foo = "foo"
bar = "bar"
dump_globals()
Output:
{"foo": "foo", "bar": "bar"}
You can use a main() similar to the last one for this script:
import (
"encoding/json"
"fmt"
"log"
"os/exec"
)
func pythonGetVars() (c []byte, err error) {
return exec.Command(
"python3",
"./dump_globals.py",
).Output()
}
func main() {
var vars map[string]interface{}
// get the output of the python script
result, err := pythonGetVars()
if err != nil {
log.Fatal(err)
}
// decode the json object
err = json.Unmarshal(result, &vars)
if err != nil {
log.Fatal(err)
}
// show the result with log.Printf
fmt.Printf("%#v", vars)
}
Output:
map[string]interface {}{"bar":"bar", "foo":"foo"}

Convert Outlook PST to json using libpst

I have an Outlook PST file, and I'd like to get a json of the emails, e.g. something like
{"emails": [
{"from": "alice#example.com",
"to": "bob#example.com",
"bcc": "eve#example.com",
"subject": "mitm",
"content": "be careful!"
}, ...]}
I've thought using readpst to convert to MH format and then scan it in a ruby/python/bash script, is there a better way?
Unfortunately the ruby-msg gem doesn't work on my PST files (and looks like it wasn't updated since 2014).
I found a way to do it in 2 stages, first convert to mbox and then to json:
# requires installing libpst
pst2json my.pst
# or you can specify a custom output dir and an outlook mail folder,
# e.g. Inbox, Sent, etc.
pst2json -o email/ -f Inbox my.pst
Where pst2json is my script and mbox2json is slightly modified from Mining the Social Web.
pst2json:
#!/usr/bin/env bash
usage(){
echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
echo "default output-dir: email/mbox-all/<pst-file>"
echo "default folder: Inbox"
exit 1
}
which readpst || { echo "Error: libpst not installed"; exit 1; }
folder=Inbox
while (( $# > 0 )); do
[[ -n "$pst_file" ]] && usage
case "$1" in
-o)
if [[ -n "$2" ]]; then
out_dir="$2"
shift 2
else
usage
fi
;;
-f)
if [[ -n "$2" ]]; then
folder="$2"
shift 2
else
usage
fi
;;
*)
pst_file="$1"
shift
esac
done
default_out_dir="email/mbox-all/$(basename $pst_file)"
out_dir=${out_dir:-"$default_out_dir"}
mkdir -p "$out_dir"
readpst -o "$out_dir" "$pst_file"
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
res="$out_dir"/"$folder".json
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"
mbox2json (python 2.7):
# -*- coding: utf-8 -*-
import sys
import mailbox
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup
MBOX = sys.argv[1]
OUT_FILE = sys.argv[2]
SKIP_HTML=True
def cleanContent(msg):
# Decode message from "quoted printable" format
msg = quopri.decodestring(msg)
# Strip out HTML tags, if any are present
soup = BeautifulSoup(msg)
return ''.join(soup.findAll(text=True))
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
# The To, CC, and Bcc fields, if present, could have multiple items
# Note that not all of these fields are necessarily defined
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
, '').replace(' ', '').decode('utf-8', 'ignore').split(',')
try:
for part in msg.walk():
json_part = {}
if part.get_content_maintype() == 'multipart':
continue
type = part.get_content_type()
if SKIP_HTML and type == 'text/html':
continue
json_part['contentType'] = type
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)
except Exception, e:
sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
finally:
return json_msg
# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects
class Encoder(json.JSONEncoder):
def default(self, o):
return {'emails': list(o)}
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)
Now, it's possible to process the file easily. E.g. to get just the contents of the emails:
jq '.emails[] | .parts[] | .content' < out/Inbox.json

Categories