Extract json values using just regex - python

I have a description field that is embedded within json and I'm unable to utilize json libraries to parse this data.
I use {0,23} in order in attempt to extract first 23 characters of string, how to extract entire value associated with description ?
import re
description = "'\description\" : \"this is a tesdt \n another test\" "
re.findall(r'description(?:\w+){0,23}', description, re.IGNORECASE)
For above code just ['description'] is displayed

You could try this code out:
import re
description = "description\" : \"this is a tesdt \n another test\" "
result = re.findall(r'(?<=description")(?:\s*\:\s*)(".{0,23}?(?=")")', description, re.IGNORECASE+re.DOTALL)[0]
print(result)
Which gives you the result of:
"this is a tesdt
another test"
Which is essentially:
\"this is a tesdt \n another test\"
And is what you have asked for in the comments.
Explanation -
(?<=description") is a positive look-behind that tells the regex to match the text preceded by description"
(?:\s*\:\s*) is a non-capturing group that tells the regex that description" will be followed by zero-or-more spaces, a colon (:) and again zero-or-more spaces.
(".{0,23}?(?=")") is the actual match desired, which consists of a double-quotes ("), zero-to-twenty three characters, and a double-quotes (") at the end.

# First just creating some test JSON
import json
data = {
'items': [
{
'description': 'A "good" thing',
# This is ignored because I'm assuming we only want the exact key 'description'
'full_description': 'Not a good thing'
},
{
'description': 'Test some slashes: \\ \\\\ \" // \/ \n\r',
},
]
}
j = json.dumps(data)
print(j)
# The actual code
import re
pattern = r'"description"\s*:\s*("(?:\\"|[^"])*?")'
descriptions = [
# I'm using json.loads just to parse the matched string to interpret
# escapes properly. If this is not acceptable then ast.literal_eval
# will probably also work
json.loads(d)
for d in re.findall(pattern, j)]
# Testing that it works
assert descriptions == [item['description'] for item in data['items']]

Related

Parsing Regular expression from YAML file adds extra \

I have a bunch of regular expression I am using to scrape lot of specific fields from a text document. Those all work fine when used directly inside the python script.
But I thought of putting them in a YAML file and reading from there. Here's how it looks:
# Document file for Regular expression patterns for a company invoice
---
issuer: ABCCorp
fields:
invoice_number: INVOICE\s*(\S+)
invoice_date: INVOICE DATE\s*(\S+)
cusotmer_id: CUSTOMER ID\s*(\S+)
origin: ORIGIN\s*(.*)ETD
destination: DESTINATION\s*(.*)ETA
sub_total: SUBTOTAL\s*(\S+)
add_gst: SUBTOTAL\s*(\S+)
total_cost: TOTAL USD\s*(\S+)
description_breakdown: (?s)(DESCRIPTION\s*GST IN USD\s*.+?TOTAL CHARGES)
package_details_fields: (?s)(WEIGHT\s*VOLUME\s*.+?FLIGHT|ROAD REFERENCE)
mawb_hawb: (?s)((FLIGHT|ROAD REFERENCE).*(MAWB|MASTER BILL)\s*.+?GOODS COLLECTED FROM)
When I retrieve it using pyyml in python, it is adding a string quote around that (which is ok as I can add r'' later) but I see it is also adding extra \ in between the regex. That would make the regex go wrong when used in code now
import yaml
with open(os.path.join(TEMPLATES_DIR,"regex_template.yml")) as f:
my_dict = yaml.safe_load(f)
print(my_dict)
{'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)', 'cusotmer_id': 'CUSTOMER ID\\s*(\\S+)', 'origin': 'ORIGIN\\s*(.*)ETD', 'destination': 'DESTINATION\\s*(.*)ETA', 'sub_total': 'SUBTOTAL\\s*(\\S+)', 'add_gst': 'SUBTOTAL\\s*(\\S+)', 'total_cost': 'TOTAL USD\\s*(\\S+)', 'description_breakdown': '(?s)(DESCRIPTION\\s*GST IN USD\\s*.+?TOTAL CHARGES)', 'package_details_fields': '(?s)(WEIGHT\\s*VOLUME\\s*.+?FLIGHT|ROAD REFERENCE)', 'mawb_hawb'
How to read the right regex as I have it in yaml file? Does any string written in yaml file gets a quotation mark around that when read in python because that is a string?
EDIT:
The main regex in yaml file is:
INVOICE\s*(\S+)
Output in dict is:
'INVOICE\\s*(\\S+)'
This is too long to do as a comment.
The backslash character is used to escape special characters. For example:
'\n': newline
'\a': alarm
When you use it before a letter that has no special meaning it is just taken to be a backslash character:
'\s': backslash followed by 's'
But to be sure, whenever you want to enter a backslash character in a string and not have it interpreted as the start of an escape sequence, you double it up:
'\\s': also a backslash followed by a 's'
'\\a': a backslash followed by a 'a'
If you use a r'' type literal, then a backslash is never interpreted as the start of an escape sequence:
r'\a': a backslash followed by 'a' (not an alarm character)
r'\n': a backslash followed by n (not a newline -- however when used in a regex. it will match a newline)
Now here is the punchline:
When you print out these Python objects, such as:
d = {'x': 'ab\sd'}
print(d)
Python will print the string representation of the dictionary and the string will print:
'ab\\sd'. If you just did:
print('ab\sd')
You would see ab\sd. Quite a difference.
Why the difference. See if this makes sense:
d = {'x': 'ab\ncd'}
print(d)
print('ab\ncd')
Results:
d = {'x': 'ab\ncd'}
ab
cd
The bottom line is that when you print a Python object other than a string, it prints a representation of the object showing how you would have created it. And if the object contains a string and that string contains a backslash, you would have doubled up on that backslash when entering it.
Update
To process your my_dict: Since you did not provide the complete value of my_dict, I can only use a truncated version for demo purposes. But this will demonstrate that my_dict has perfectly good regular expressions:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
fields = my_dict['fields']
invoice_number_re = fields['invoice_number']
m = re.search(invoice_number_re, 'blah-blah INVOICE 12345 blah-blah')
print(m[1])
Prints:
12345
If you are going to be using the same regular expressions over and over again, then it is best to compile them:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
#compile the strings to regular expressions
fields = my_dict['fields']
for k, v in fields.items():
fields[k] = re.compile(v)
invoice_number_re = fields['invoice_number']
m = invoice_number_re.search('blah-blah INVOICE 12345 blah-blah')
print(m[1])

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

Extract the data specified in brackets '[ ]' from a string message in python

I want to extract fields from below Log message.
Example:
Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]
I need to extract the data specified in the brackets [ ] for "Affected columns,reason, Details"
What would be the efficient way to extract these fields in Python?
Note: I can modify the log message format if needed.
If you are free to change the log format, it's easiest to use a common data format - I'd recommend JSON for such data. It is structured, but lightweight enough to write it even from custom bash scripts. The json module allows you to directly convert it to native python objects:
import json # python has a default parser
# assume this is your log message
log_line = '{"Ignoring entry" : {"Affected columns": [1, 3], "reason" : "some reason", "Details": {}}}'
data = json.loads(log_line)
print("Columns to ignore:", data["Ignoring entry"]["Affected columns"])
If you want to work with the current format, you'll have to work with str methods or the re module.
For example, you could do this:
log_msg = "Ignoring entry, Affected columns [column1:column2], reason[some reason], Details[some entry details]"
def parse_log_line(log_line):
if log_line.startswith("Ignoring entry"):
log_data = {
for element in log_line.split(',')[1:]: # parse all elements but the header
key, value = element.partition('[')
if value[-1] != ']':
raise ValueError('Malformed Content. Expected %r to end with "]"' % element)
value = value[:-1]
log_data[key] = value
return log_data
raise ValueError('Unrecognized log line type')
Many parsing tasks are best compactly handled by the re module. It allows you to use regular expressions. They are very powerful, but difficult to maintain if you are not used to it. In your case, the following would work:
log_data = {key: value for key, value in re.findall(',\s?(.+?)\s?\[(.+?)\]', log_line)}
The re works like this:
, a literal comma, separating your entries
\s* an arbitrary sequence of whitespace after the comma, before the next element
(.+?) any non-whitespace characters (the key, captured via '()')
\s* an arbitrary sequence of whitespace between key and value
\[ a literal [
(.+?) the shortest sequence of non-whitespace characters before the next element (the value, captured via '()')
\] a literal ]
The symbols *, + and ? mean "any", "more than one", and "as few as possible".

Python: Can't turn string into JSON

For the past few hours, I've been fighting to get a string into a JSON dict. I've tried everything from json.loads(... which throws an error:
requestInformation = json.loads(entry["request"]["postData"]["text"])
//throws this error
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
to stripping out the slashes using a medley of re.sub('\\','',mystring) ,mystring.sub(... to no effect. My problem string looks like so
'{items:[{n:\\'PackageChannel.GetUnitsInConfigurationForUnitType\\',ps:[{n:\\'unitType\\',v:"ActionTemplate"}]}]}'
The origin of this string is that it's a HAR dump from Google Chrome. I think those backslashes are from it being escaped somewhere along the way because the bulk of the HAR file doesn't contain them, but they do appear commonly in any field labeled "text".
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
EDIT I eventually gave up on turning the text above into JSON and instead opted for regex. Sometimes the slashes showed up, sometimes they didn't based on what I was viewing the text in and that made it difficult to work with.
the json module wants a string where the keys are also wrapped in double quotes
so the string below would work:
mystring = '{"items":[{"n":"PackageChannel.GetUnitsInConfigurationForUnitType", "ps":[{"n":"unitType","v":"ActionTemplate"}]}]}'
myjson = json.loads(mystring)
This function should remove the double backslashes and put double quotes around your keys.
import json, re
def make_jsonable(mystring):
# we'll use this regex to find any key that doesn't contain any of: {}[]'",
key_regex = "([\,\[\{](\s+)?[^\"\{\}\,\[\]]+(\s+)?:)"
mystring = re.sub("[\\\]", "", mystring) # remove any backslashes
mystring = re.sub("\'", "\"", mystring) # replace single quotes with doubles
match = re.search(key_regex, mystring)
while match:
start_index = match.start(0)
end_index = match.end(0)
print(mystring[start_index+1:end_index-1].strip())
mystring = '%s"%s"%s'%(mystring[:start_index+1], mystring[start_index+1:end_index-1].strip(), mystring[end_index-1:])
match = re.search(key_regex, mystring)
return mystring
I couldn't directly test it on the first string you wrote, the double/single quotes don't match up, but on the one in the last code sample it works.
You'll need a r before JSON String, or replace all \ with \\
This works:
import json
validasst_json = r'''{
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
}'''
txt = json.loads(validasst_json)
print(txt["postData"]['mimeType'])
print(txt["postData"]['text'])

Regular expression in python

I am trying to match/sub the following line
line1 = '# Some text\n'
But avoid match/sub lines like this
'# Some text { .blah}\n'
So in other a # followed by any amount of words spaces and numbers (no punctuation) and then the end of line.
line2 = re.sub(r'# (\P+)$', r'# \1 { .text}', line1)
Puts the contents of line1 into line2 unchanged.
(I read somewhere that \P means everything except punctuation)
line2 = re.sub(r'# (\w*\d*\s*)+$', r'# \1 { .text}', line1)
Whereas the above gives
'# { .text}'
Any help is appreciated
Thanks
Tom
Your regex is a bit weird; expanded, it looks like
r"# ([a-zA-Z0-9_]*[0-9]*[ \t\n\r\f\v]*)+$"
Things to note:
It is not anchored to the beginning of the string, meaning it would match
print("Important stuff!") # Very important
The \d* is redundant, because it is already captured by \w*
Looking at your example, it seems you should be less worried about punctuation; the only thing you cannot have is a curly-brace ({).
Try
from functools import partial
def add_text(txt):
return re.sub(r"^#([^{]*)$", r"#\1 { .text }", txt, flags=re.M)
text = "# Some text\n# More text { .blah}\nprint('abc') # but not me!\n# And once again"
print("===before===")
print(text)
print("\n===after===")
print(add_text(text))
which gives
===before===
# Some text
# More text { .blah}
print('abc') # but not me!
# And once again
===after===
# Some text { .text }
# More text { .blah}
print('abc') # but not me!
# And once again { .text }
If you only want lines which start with a # and continue with alphanumeric values, spaces and _, you want this:
/^#[\w ]+$/gm

Categories