I have a script that scraps a list of IPs from a router. The final output should look like this:
if net ~ [
12.5.161.0/24,
12.9.242.0/24,
12.11.215.0/24,
12.17.239.0/24,
.... etc etc
216.248.237.0/24,
216.248.238.0/24,
216.248.239.0/24,
216.251.224.0/19,
216.253.79.0/24
] then {
accept;
} else {
reject;
}
I've gotten to the point where i can get the list of IPs in the right format, ie
12.5.161.0/24,
12.9.242.0/24,
12.11.215.0/24,
12.17.239.0/24,
.... etc etc
216.248.237.0/24,
216.248.238.0/24,
216.248.239.0/24,
216.251.224.0/19,
216.253.79.0/24
The prb I run into is stringing together the str at the beginning with all of my IPs as one batch in the middle and the 4 line str at the end.
So far I have:
routes = get_bird_routes(args.s)
prefixes = parse_routes(routes, args.p)
dropped = drop_prefixes(prefixes, args.d)
for p in dropped:
lines = [ "if net ~ [", str(p), "] then {", " accept;", "} else {", " reject;", "}\n" ]
print "\n".join(lines)
but this gives me
if net ~ [
199.89.247.0/24
] then {
accept;
} else {
reject;
}
if net ~ [
192.149.228.0/24
] then {
accept;
} else {
reject;
}
if net ~ [
206.180.165.0/24
] then {
accept;
} else {
reject;
}
instead of all my IPs together and the str at the beg and at the end only. I tried to see what type(p) was (before i set it to str(p)) and it came back unicode. Looking at the document, I didnt get a clear understanding of what I was doing wrong. Newish to python still, any help appreciated!!
You should be joining dropped, not looping over it.
dropped_lines = ",\n".join(dropped)
lines = [ "if net ~ [", dropped_lines, "] then {", " accept;", "} else {", " reject;", "}\n" ]
print "\n".join(lines)
Try converting your list to a string with str() and join(). Then merge the one string into your logic statement(string) with .format().
Viz:
dropped_as_strs = map(str, dropped) # "a", "b", "c"
dropped_str = ',\n'.join(dropped_as_strs) # "a,\nb,\nc"
logic = "if net ~ [\n{}\n] then {{\n accept;\n}} ..."
result = logic.format(dropped_str)
(Note the need to double '{' and '}' in str.format() calls.)
Related
I'm trying to create a re pattern in python to extract this pattern of text.
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08'
contentId: 'a887526b-ff19-4409-91ff-e1679e418922'
The length of the content ID is 36 characters long and has a mix of lowercase letters and numbers with dashes included at position 8,13,18,23,36.
Any help with this would be much appreciated as I just can't seem to get the results right now.
r1 = re.findall(r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*{36}$',f.read())
print(r1)
Below is the file I'm trying to pull from
Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},
Is there a typo in the regex in your question? *{36} after the bracket ] that closes the character group causes an error: multiple repeat. Did you mean r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}$'?
Fixing that, you get no results because ^ anchors the match to the start of the line, and $ to the end of the line, so you'd only get results if this pattern was alone on a single line.
Removing these anchors, we get lots of matches because it matches any string of those characters that is 36-long:
r1 = re.findall(r'[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}',t)
r1: ['var t = r(d[0])(r(d[1])), n = r(d[0]',
')(r(d[2])), o = r(d[0])(r(d[3])), c ',
'= r(d[0])(r(d[4])), l = r(d[0])(r(d[',
'2301ae56-3b9c-4653-963b-2ad84d06ba08',
' style: { height: 0.5',
'a887526b-ff19-4409-91ff-e1679e418922',
' style: { height: t }']
To only match your ids, only look for alphanumeric characters or dashes.
r1 = re.findall(r'[a-zA-Z0-9\-]{36}',t)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
To make it even more specific, you could specify the positions of the dashes:
r1 = re.findall(r'[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}', t, re.IGNORECASE)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
Specifying the re.IGNORECASE flag removes the need to look for both upper- and lower-case characters.
Note:
You should read the file into a variable and use that variable if you're going to use its contents more than once, since f.read() won't give anything after the first .read() unless you f.seek(0)
To avoid creating a new file on disk with those contents, I just defined
t = """Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},"""
and used t in place of f.read() from your question.
I'm writing an automation script in Python that makes use of another library. The output I'm given contains the array I need, however the output also includes log messages in string format that are irrelevant.
For my script to work, I need to retrieve only the array which is in the file.
Here's an example of the output I'm getting.
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
How would I get only the array from this file?
FWIW, I'd rather not have the logic based on the strings outside of the array, as they could be subject to change.
UPDATE: Script I'm getting the data from is here: https://github.com/brave/ab2cb/tree/master/ab2cb
My full code is here:
def pipe_in(process, filter_lists):
try:
for body, _, _ in filter_lists:
process.stdin.write(body)
finally:
process.stdin.close()
def write_block_lists(filter_lists, path, expires):
block_list = generate_metadata(filter_lists, expires)
process = subprocess.Popen(('ab2cb'),
cwd=ab2cb_dirpath,
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
threading.Thread(target=pipe_in, args=(process, filter_lists)).start()
result = process.stdout.read()
with open('output.json', 'w') as destination_file:
destination_file.write(result)
destination_file.close()
if process.wait():
raise Exception('ab2cb returned %s' % process.returncode)
The output will ideally be modified in stdout and written later to file as I still need to modify the data within the previously mentioned array.
You can use regex too
import re
input = """
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
asd
asd
"""
regex = re.compile(r"\[(.|\n)*(?:^\]$)", re.M)
x = re.search(regex, input)
print(x.group(0))
EDIT
re.M turns on 'MultiLine matching'
https://repl.it/repls/InfantileDopeyLink
I have written a library for this purpose. It's not often that I get to plug it!
from jsonfinder import jsonfinder
logs = r"""
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
Something else that looks like JSON: [1, 2]
"""
for start, end, obj in jsonfinder(logs):
if (
obj
and isinstance(obj, list)
and isinstance(obj[0], dict)
and {"action", "trigger"} <= obj[0].keys()
):
print(obj)
Demo: https://repl.it/repls/ImperfectJuniorBootstrapping
Library: https://github.com/alexmojaki/jsonfinder
Install with pip install jsonfinder.
I need to enhance my below script, which takes an input file that contains almost a million unique lines. Against each line, it has different values in 3 lookup files which I intend to add in my output as comma separated values.
The below script works fine, but it takes hours to finish the job. I am looking for a real fast solution which would also be less heavy on the system.
#!/bin/bash
while read -r ONT
do
{
ONSTATUS=$(grep "$ONT," lookupfile1.csv | cut -d" " -f2)
CID=$(grep "$ONT." lookupfile3.csv | head -1 | cut -d, -f2)
line1=$(grep "$ONT.C2.P1," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
line2=$(grep "$ONT.C2.P2," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
echo "$ONT,$ONSTATUS,$CID,$line1,$line2" >> BUwithPO.csv
} &
done < inputfile.csv
inputfile.csv contains the lines shown below:
343OL5:LT1.PN1.ONT1
343OL5:LT1.PN1.ONT10
225OL0:LT1.PN1.ONT34
225OL0:LT1.PN1.ONT39
343OL5:LT1.PN1.ONT100
225OL0:LT1.PN1.ONT57
lookupfile1.csv contains:
343OL5:LT1.PN1.ONT100, Down,Locked,No
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes
343OL5:LT1.PN1.ONT1, Down,Unlocked,No
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes
lookupfile2.csv contains:
225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P1, +123125334019,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P2, +123125334819,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
343OL5:LT1.PN1.ONT100.C2.P11, +123128994019,REG,DigitMap,Unlocked,_media_ANT,FD_BSFU.xml,
lookupfile3.csv contains:
343OL5:LT1.PON1.ONT100.SERV1,12-654-0330
343OL5:LT1.PON1.ONT100.C1.P1,12-654-0330
343OL5:LT7.PON8.ONT75.SERV1,12-664-1186
225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
225OL0:LT1.PN1.ONT57.C1.P1.FLOW1,12-533-4019
the output is:
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes,12-533-4019,+123125334019,FD_BSFU.xml,+123125334819,FD_BSFU.xml
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes,12-530-2766,+123125302766,FD_BSFU.xml,
343OL5:LT1.PN1.ONT1, Down,Unlocked,No,,,
343OL5:LT1.PN1.ONT100, Down,Locked,No,,,
343OL5:LT1.PN1.ONT10,,,,
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes,,,
As you'll see, the bottleneck will be executing grep within the loop multiple times. You can increase the efficiency by creating a look-up table with associative arrays.
If awk is available, please try the following:
[Update]
#!/bin/bash
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", $1);
onstatus[$1] = $2
}
FILENAME=="lookupfile2.csv" {
split($2, a, ",")
if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split($0, a, ",")
if (match(a[1], ".+\\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME=="inputfile.csv" {
print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
{EDIT]
If you need to specify absolute paths to the files, please try:
#!/bin/bash
awk '
FILENAME ~ /lookupfile1.csv$/ {
sub(",$", "", $1);
onstatus[$1] = $2
}
FILENAME ~ /lookupfile2.csv$/ {
split($2, a, ",")
if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
}
FILENAME ~ /lookupfile3.csv$/ {
split($0, a, ",")
if (match(a[1], ".+\\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME ~ /inputfile.csv$/ {
print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
}
' /path/to/lookupfile1.csv /path/to/lookupfile2.csv /path/to/lookupfile3.csv /path/to/inputfile.csv > /path/to/BUwithPO.csv
Hope this helps.
If as you have indicated in the comments, you are unable to use the solution provided by #tshiono due to lacking gensub provided by GNU awk, you can replace gensub with two calls to sub with a temporary variable to accomplish trimming the needed suffix.
Example:
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", $1);
onstatus[$1] = $2
}
FILENAME=="lookupfile2.csv" {
split($2, a, ",")
if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split($0, a, ",")
# ont = gensub("(\\.ONT[0-9]+).*", "\\1", 1, a[1])
sfx = a[1]
sub(/^.*[.]ONT[^.]*/, "", sfx)
sub(sfx, "", a[1])
# cid[ont] = a[2]
cid[a[1]] = a[2]
}
FILENAME=="inputfile.csv" {
print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
I have commented out the use of gensub in the portion related to FILENAME=="lookupfile3.csv" and replaced the gensub expression with two calls to sub using sfx (suffix) as the temporary variable.
Give it a try and let me know if you are able to use that.
Perl solution
The following script is similar to the awk solution but written in Perl.
Save it as filter.pl and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
my %lookup1;
my %lookup2_1;
my %lookup2_2;
my %lookup3;
while( <> ) {
if ( $ARGV eq 'lookupfile1.csv' ) {
# 225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
if (/^([^,]+),\s*(.*)$/) {
$lookup1{$1} = $2;
}
} elsif ( $ARGV eq 'lookupfile2.csv' ) {
# 225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)\.C2\.P1,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_1{$1} = "$2,$3";
} elsif (/^(.+ONT\d+)\.C2\.P2,\s*([^,]+),(?:[^,]+,){4}([^,]+)/) {
$lookup2_2{$1} = "$2,$3";
}
} elsif ( $ARGV eq 'lookupfile3.csv' ) {
# 225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
if (/^(.+ONT\d+)[^,]+,\s*(.*)$/) {
$lookup3{$1} = $2;
}
} else { # assume 'inputfile.csv'
no warnings 'uninitialized'; # because not all keys ($_) have values in the lookup tables
# 225OL0:LT1.PN1.ONT34
chomp;
print "$_,$lookup1{$_},$lookup3{$_},$lookup2_1{$_},$lookup2_2{$_}\n";
}
}
Execute it like so:
./filter.pl lookupfile{1,2,3}.csv inputfile.csv > BUwithPO.csv
It is important that the lookupfiles come first (as in the awk solutions, btw.) because
they build the four dictionaries (hashes in Perl parlance) %lookup1, %lookup2_1, etc.
and then the values from inputfile.csv are matched against those dictionaries.
How can i cut from such a string (json) everything before and including the first [ and everything behind and including the last ] with Python?
{
"Customers": [
{
"cID": "w2-502952",
"soldToId": "34124"
},
...
...
],
"status": {
"success": true,
"message": "Customers: 560",
"ErrorCode": ""
}
}
I want to have at least only
{
"cID" : "w2-502952",
"soldToId" : "34124",
}
...
...
String manipulation is not the way to do this. You should parse your JSON into Python and extract the relevant data using normal data structure access.
obj = json.loads(data)
relevant_data = obj["Customers"]
Addition to #Daniel Rosman answer, if you want all the list from JSON.
result = []
obj = json.loads(data)
for value in obj.values():
if isinstance(value, list):
result.append(*value)
While I agree that Daniel's answer is the absolute best way to go, if you must use string splitting, you can try .find()
string = #however you are loading this json text into a string
start = string.find('[')
end = string.find(']')
customers = string[start:end]
print(customers)
output will be everything between the [ and ] braces.
If you really want to do this via string manipulation (which I don't recommend), you can do it this way:
start = s.find('[') + 1
finish = s.find(']')
inner = s[start : finish]
I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
Demo:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.
You can do some odd things and get it (probably) right.
Because strings on JSON cannot have control chars such as \t, you could replace every ; to \t, so the file will be parsed correctly if your JSON parser is able to load non strict JSON (such as Python's).
After, you only need to convert your data back to JSON so you can replace back all these \t, to ; and use a normal JSON parser to finally load the correct object.
Some sample code in Python:
data = '''{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world"
}'''
import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))
Using a simple character state machine, you can convert this text back to valid JSON. The basic thing we need to handle is to determine the current "state" (whether we are escaping a character, in a string, list, dictionary, etc), and replace ';' by ',' when in a certain state.
I don't know if this is properly way to write it, there is a probably a way to make it shorter, but I don't have enough programming skills to make an optimal version for this.
I tried to comment as much as I could :
def filter_characters(text):
# we use this dictionary to match opening/closing tokens
STATES = {
'"': '"', "'": "'",
"{": "}", "[": "]"
}
# these two variables represent the current state of the parser
escaping = False
state = list()
# we iterate through each character
for c in text:
if escaping:
# if we are currently escaping, no special treatment
escaping = False
else:
if c == "\\":
# character is a backslash, set the escaping flag for the next character
escaping = True
elif state and c == state[-1]:
# character is expected closing token, update state
state.pop()
elif c in STATES:
# character is known opening token, update state
state.append(STATES[c])
elif c == ';' and state == ['}']:
# this is the delimiter we want to change
c = ','
yield c
assert not state, "unexpected end of file"
def filter_text(text):
return ''.join(filter_characters(text))
Testing with :
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Returns :
{
"client" : "someone",
"server" : ["s1"; "s2"],
"timestamp" : 1000000,
"content" : "hello; world",
...
}
Pyparsing makes it easy to write a string transformer. Write an expression for the string to be changed, and add a parse action (a parse-time callback) to replace the matched text with what you want. If you need to avoid some cases (like quoted strings or comments), then include them in the scanner, but just leave them unchanged. Then, to actually transform the string, call scanner.transformString.
(It wasn't clear from your example whether you might have a ';' after the last element in one of your bracketed lists, so I added a term to suppress these, since a trailing ',' in a bracketed list is also invalid JSON.)
sample = """
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
}"""
from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json
SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString
scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))
prints:
{
"client" : "someone",
"server" : ["s1", "s2"],
"timestamp" : 1000000,
"content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}