Python Extract numbers in after the string foo: with regex - python

I have data like this:
{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}
I want extract numbers after balance, but its can be from 0 to infinity.
So, from example above the output desired:
1234
And btw one more question.
I have got data like this
{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "invoice": "invNrKU2ZFMuAJKUiejyVe3X34ybP9awyWZBfUEdY2dZKxYTB8ajW", "redeem_code": "BTCvQDD9xFYHHDYNi1JYeLY1eEkGFBFB49qojETjLBZ2CVYyPm56B"}
Whats the normal way of doing that:
strs = repr(s)
address = s[13:47]
invoice = s[62:115]
redeem_code = s[134:187]
print(address)
print(invoice)
print(redeem_code)
Thx for help.

don't ever use regexes to parse structured data like this. Once parsed with proper means (json.loads or ast.literal_eval both work here), they become native python structure, trivial to access to.
In your case, using json.loads in one line:
import json
print(json.loads('{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}')["balance"])
result:
1234
(same method applies for your second question)

Actually what you are showing us is what in Python is called a dictionary.
That is a set of key and values.
Look here for more info: https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
Your dictionary has the following keys and values:
"address" --> "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK"
"balance" --> 1234
"pending_balance" --> 0
"paid_out" --> 0
Now if what you have is a dictionary:
d = {"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}
print(d.get('balanace')) #1234
If however what you have is an external file with that information or you got it from a web service of some sort, you have a string representation of a dictionary. Here is where the JSON-library becomes valuable:
import json
# Assuming you got a string
s = '{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}'
d = json.loads(s) # <-- converts the string to a dictionary
print(d.get('balance')) #1234

Your data looks like json, so the preferable way of dealing with it is parsing using json module
import json
parsed_data = json.loads(data)
balance = parsed_data['balance']
If using regular expressions is a must, you can use following code
import re
match = re.search('"balance": (\d+)', data)
balance = int(match.group(1))
In this example me use \d+ to match string of digits and parenthesis to create a group. Group 0 would be the whole matched string and group 1 - the first group we created.

Related

Parsing data containing escaped quotes and separators in python

I have data that is structured like this:
1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah"
It always starts with a unix timestamp, but then I can't know how many other fields follow and how they are called.
The goal is to parse this into a dictionary as such:
{"timestamp": 1661171420,
"foo": "bar",
"test": 'This, is a "TEST"',
"count": 5,
"com": "foo, bar=blah"}
I'm having trouble parsing this, especially regarding the escaped quotes and commas in the values.
What would be the best way to parse this correctly? preferably without any 3rd party modules.
If changing the format of input data is not an option (JSON would be much easier to handle, but if it is an API as you say then you might be stuck with this) the following would work assuming the file follows given structure more or less. Not the cleanest solution, I agree, but it does the job.
import re
d = r'''1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah", fraction=-0.11'''.replace(r"\"", "'''")
string_pattern = re.compile(r'''(\w+)="([^"]*)"''')
matches = re.finditer(string_pattern, d)
parsed_data = {}
parsed_data['timestamp'] = int(d.partition(", ")[0])
for match in matches:
parsed_data[match.group(1)] = match.group(2).replace("'''", "\"")
number_pattern = re.compile(r'''(\w+)=([+-]?\d+(?:\.\d+)?)''')
matches = re.finditer(number_pattern, d)
for match in matches:
try:
parsed_data[match.group(1)] = int(match.group(2))
except ValueError:
parsed_data[match.group(1)] = float(match.group(2))
print(parsed_data)

Converting data inside a JSON file into Python dict

I have a JSON file and the inside of it looks like this:
"{'reviewId': 'gp:AOqpTOGiJUWB2pk4jpWSuvqeXofM9B4LQQ4Iom1mNeGzvweEriNTiMdmHsxAJ0jaJiK7CbjJ_s7YEWKE2DA_Qzo', 'userName': '\u00c0ine Mongey', 'userImage': 'https://play-lh.googleusercontent.com/a-/AOh14GhUv3c6xHP4kvLSJLaRaydi6o2qxp6yZhaLeL8QmQ', 'content': \"Honestly a great game, it does take a while to get money at first, and they do make it easier to get money by watching ads. I'm glad they don't push it though, and the game is super relaxing and fun!\", 'score': 5, 'thumbsUpCount': 2, 'reviewCreatedVersion': '1.33.0', 'at': datetime.datetime(2021, 4, 23, 8, 20, 34), 'replyContent': None, 'repliedAt': None}"
I am trying to convert this into a dict and then to a pandas DataFrame. I tried this but it will just turn this into a string representation of a dict, not a dict itself:
with open('sampledict.json') as f:
dictdump = json.loads(f.read())
print(type(dictdump))
I feel like I am so close now but I can't find out what I miss to get a dict out of this. Any input will be greatly appreciated!
If I get your data format correctly, this will work:
with open('sampledict.json') as f:
d = json.load(f)
d = eval(d)
# Or this works as well
d = json.loads(f.read())
d = eval(d)
>>> d.keys()
['userName', 'userImage', 'repliedAt', 'score', 'reviewCreatedVersion', 'at', 'replyContent', 'content', 'reviewId', 'thumbsUpCount']
Are you sure that you have your source JSON correct? The JSON snippet you have provided is a string; it has a " at the start and end. So in its current form getting a string is correct behaviour.
Note also that it is a string representation of a Python dict rather than a JSON object. This is evidenced by the fact that the strings are denoted by single quotes rather than double, and the use of the Python keyword None rather than the JSON null.
If the JSON file were a representative of a plain object then the content would be something of the form:
{
"reviewId": "gp"AO...",
"userName": "...",
"replyContent": null,
"repliedAt": null
}
I.e. the first and last characters are curly braces, not double quotes.

How to fix missing double quotes issue when parsing JSON data?

I am running a piece of code in Python3 where I am consuming JSON data from the source. I don't have control over the source. While reading the json data I am getting following error:
simplejson.errors.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2
Here is the code
import logging
import simplejson as json
logging.basicConfig(level=logging.INFO)
consumer = KafkaConsumer(
bootstrap_servers='localhost:9092',
api_version=(1,0,0))
consumer.subscribe(['Test_Topic-1'])
for message in consumer:
msg_str=message.value
y = json.loads(msg_str)
print(y["city_name"])
As I can not change the source, I need to fix it at my end. I found out this post helpful as my data contains the timestamps with : in it: How to Fix JSON Key Values without double-quotes?
But it also fails for some values in my json data as those values contain : in it. e.g.
address:"1600:3050:rf02:hf64:h000:0000:345e:d321"
Is there any way where I can add double quotes to keys in my json data?
You can try to use module dirtyjson - it can fix some mistakes.
import dirtyjson
d = dirtyjson.loads('{address:"1600:3050:rf02:hf64:h000:0000:345e:d321"}')
print( d['address'] )
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d['abc'] )
It creates AttributedDict so it may need dict() to create normal dictionary
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d )
print( dict(d) )
Result:
AttributedDict([('abc', '1:2:3:4'), ('efg', '5:6:7:8'), ('hij', 'foo')])
{'abc': '1:2:3:4', 'efg': '5:6:7:8', 'hij': 'foo'}
I think your problem is that you have strings like this:
{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}
which are not valid JSON. You could try to repair it with a regular expression substitution:
import re
jtxt_bad ='{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"\n}'
jtxt = re.sub(r'\b([a-zA-Z]+):("[^"]+"[,\n}])', r'"\1":\2', jtxt_bad)
print(f'Original: {jtxt_bad}\nRepaired: {jtxt}')
The output of this is:
Original: {abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"
}
Repaired: {"abc":"1:2:3:4", "efg":"5:6:7:8", "hij":"foo", "klm":"bar"
}
The regular expression \b([a-zA-Z]+):("[^"]+"[,\}]) means: boundary, followed by one or more letters, followed by a :, followed by double-quoted string, followed by one of ,, }, \n. However, this will fail if there is a quote inside the string, such as "1:\"2:3".

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.
you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

decoding JSON data with backslash encoding

I have the following JSON data.
"[
\"msgType\": \"0\",
\"tid\": \"1\",
\"data\": \"[
{
\\\"EventName\\\": \\\"TExceeded\\\",
\\\"Severity\\\": \\\"warn\\\",
\\\"Subject\\\": \\\"Exceeded\\\",
\\\"Message\\\": \\\"tdetails: {
\\\\\\\"Message\\\\\\\": \\\\\\\"my page tooktoolong(2498ms: AT: 5ms,
BT: 1263ms,
CT: 1230ms),
andexceededthresholdof5ms\\\\\\\",
\\\\\\\"Referrer\\\\\\\": \\\\\\\"undefined\\\\\\\",
\\\\\\\"Session\\\\\\\": \\\\\\\"None\\\\\\\",
\\\\\\\"ResponseTime\\\\\\\": 0,
\\\\\\\"StatusCode\\\\\\\": 0,
\\\\\\\"Links\\\\\\\": 215,
\\\\\\\"Images\\\\\\\": 57,
\\\\\\\"Forms\\\\\\\": 2,
\\\\\\\"Platform\\\\\\\": \\\\\\\"Linuxx86_64\\\\\\\",
\\\\\\\"BrowserAppname\\\\\\\": \\\\\\\"Netscape\\\\\\\",
\\\\\\\"AppCodename\\\\\\\": \\\\\\\"Mozilla\\\\\\\",
\\\\\\\"CPUs\\\\\\\": 8,
\\\\\\\"Language\\\\\\\": \\\\\\\"en-GB\\\\\\\",
\\\\\\\"isEvent\\\\\\\": \\\\\\\"true\\\\\\\",
\\\\\\\"PageLatency\\\\\\\": 2498,
\\\\\\\"Threshold\\\\\\\": 5,
\\\\\\\"AT\\\\\\\": 5,
\\\\\\\"BT\\\\\\\": 1263,
\\\\\\\"CT\\\\\\\": 1230
}\\\",
\\\"EventTimestamp\\\": \\\"1432514783269\\\"
}
]\",
\"Timestamp\": \"1432514783269\",
\"AppName\": \"undefined\",
\"Group\": \"UndefinedGroup\"
]"
I want to make this JSON file into a single level of wrapping.i.e I want to remove the nested structure inside and copy that data over to the top level JSON structure. How can I do this?
If this strucutre is named json_data
I want to be able to access
json_data['Platform']
json_data[BrowserAppname']
json_data['Severity']
json_data['msgType']
Basically some kind of rudimentary normalization.What is the easiest way to do this using python
A generally unsafe but probably okay in this case solution would be:
import json
d = json.loads(json_string.replace('\\', ''))
I'm not sure what happened but this doesn't look like valid JSON.
You have some double-quotes escaped once, some twice, some three times etc.
You have key/value pairs inside of a list-like object []
tdetails is missing a trailing quote
Even if you fix the above you still have your data list quoted as a multi-line string which is invalid.
It appears to be that this "JSON" was constructed by hand, by someone with no knowledge of JSON.
You can try "massaging" the data into JSON with the following:
import re
x = re.sub(r'\\+', '', js_str)
x = re.sub(r'\n', '', js_str)
x = '{' + js_str.strip()[1:-1] + '}'
Which would make the string almost json like, but you still need to fix point #3.

Categories