Python regex finding all multi line occurences in file - python

I have a text file including tons of output like this:
<Tons-of-random-text>
...
Received Access-Accept (code=2) length=300
Attribute 2 (User-Name) length=16
Value 'myTextValue1'
Attribute 4 (Bla1) length=16
Value 'myTextValue2'
Attribute 6 (Bla2) length=16
Value 0xABCDEFG
<Tons-of-random-text>
At the end of the day I want to use named capture groups to extract:
the code in the first outlined line.
A list of Attributes
Based on the example above, the desired extract data structe is:
code=2
attributes = [
{
"attribute": "2",
"attribute-name": "User-Name",
"value": "myTextValue1"
},
{
"attribute": "4",
"attribute-name": "Bla1",
"value": "myTextValue2"
},
# ...
]
I'm struggling with finditer and findall... at the end of the day I'm only able to match the first or the last attribute block...
Anybody have a hint for a good regex?

This code will match the line with (code=\d) and what follows for as long as each lines contains "Attribute" or "Value".
/(\(code=\d\).*$(\n^\s*(Attribute|Value).*$)+)/gm
see https://regex101.com/r/HzIW5V/1

Related

Python file search using regex

I have a file that has many lines. Each line starts with {"id": followed by the id number in quotes. (i.e {"id": "106"). I am trying to use regex to search the whole document line by line and print the lines that match 5 different id values. To do this I made a list with the ids and want to iterate through the list only matching lines that start with {"id": "(id number from list)". I am really confused on how to do this. Here is what I have so far:
f= "bdata.txt"
statids = ["85", "106", "140", "172" , "337"]
x= re.findall('{"id":', statids, 'f')
for line in open(file):
print(x)
The error code I keep getting is: TypeError: unsupported operand type(s) for &: 'str' and 'int'
I need to whole line to be matched so I can split it and put it into a class.
Any advice? Thanks for your time.
You can retrieve the id from the line using the regex, ^\{\"id\": \"(\d+)\" where the value of group#1 will give you the id. Then, you can check if the id is present in statids.
Demo:
import re
statids = ["85", "106", "140", "172", "337"]
with open("bdata.txt") as file:
for line in file:
search = re.search('^\{\"id\": \"(\d+)\"', line)
if search:
id = search.group(1)
if id in statids:
print(line.rstrip())
For the following sample content in the file:
{"id": "100" hello
{"id": "106" world
{"id": "2" hi
{"id": "85" bye
{"id": "10" ok
{"id": "140" good
{"id": "165" fine
{"id": "172" great
{"id": "337" morning
{"id": "16" evening
the output will be:
{"id": "106" world
{"id": "85" bye
{"id": "140" good
{"id": "172" great
{"id": "337" morning
I the issue here is the way you're using re.findall, according to the docs you have to pass a regular expression as the first argument and the string that you want to match the expression to as the second argument. In your case I think this is how you should do it:
pattern = f'id: ({"|".join(statsids)})'
with open(f) as file:
for line in file:
match = re.findall(pattern, line)
print(match.group(0))
in the regex the pipe operator "|" works same as or so by joining all the ids as an string with | in between them will find all the cases where it matches one id or the other. the match.group line returns where it was found.

How to select certain parts of a JSON file using python?

I'm pretty new to programming, how would I select a certain part of the JSON file and have it display the value? For example:
{
"name": "John",
"age": 18,
"state": "New York"
}
What would be the code in Python needed to get the value of any of the items by giving it the keyword? (i.e. I give it "name" and the output displays "John")
Put the title in brackets.
myJSON = #that JSON
print(myJSON["name"]) #prints John
print(myJSON["age"]) #prints 18
print(myJSON["state"]) #prints New York

Parsing incomplete json array

I have downloaded 5MB of a very large json file. From this, I need to be able to load that 5MB to generate a preview of the json file. However, the file will probably be incomplete. Here's an example of what it may look like:
[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
From here, I'd like to "rebuild it" so that it can parse the first two objects (and ignore the third).
Is there a json parser that can infer or cut off the end of the string to make it parsable? Or perhaps to 'stream' the parsing of the json array, so that when it fails on the last object, I can exit the loop? If not, how could the above be accomplished?
If your data will always look somewhat similar, you could do something like this:
import json
json_string = """[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
"""
while True:
if not json_string:
raise ValueError("Couldn't fix JSON")
try:
data = json.loads(json_string + "]")
except json.decoder.JSONDecodeError:
json_string = json_string[:-1]
continue
break
print(data)
This assumes that the data is a list of dicts. Step by step, the last character is removed and a missing ] appended. If the new string can be interpreted as JSON, the infinite loop breaks. Otherwise the next character is removed and so on. If there are no characters left ValueError("Couldn't fix JSON") is raised.
For the above example, it prints:
[{'first': 'bob', 'address': {'zip': 1920, 'street': 13301}}, {'first': 'sarah', 'address': {'zip': 1920, 'street': 13301}}]
For the specific structure in the example we can walk through the string and track occurrences of curly brackets and their closing counterparts. If at the end one or more curly brackets remain unmatched, we know that this indicates an incomplete object. We can then strip any intermediate characters such as commas or whitespace and close the resulting string with a square bracket.
This method ensures that the string is only parsed twice, one time manually and one time by the JSON parser, which might be advantageous for large text files (with incomplete objects consisting of many characters).
brackets = []
for i, c in enumerate(string):
if c == '{':
brackets.append(i)
elif c == '}':
brackets.pop()
if brackets:
string = string[:brackets[0]].rstrip(', \n')
if not string.endswith(']'):
string += ']'

Python JSON parser error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I need some help parsing JSON file. I've tried a couple of different ways to get the data I need. Below is a sample of the code and also a section of the JSON data but when I run the code I get the error listed above.
There's 500K lines of text in the JSON and it first fails about about 1400 lines in and I can't see anything in that area section to indicate why.
I've run it successfully by only checking blocks of JSON up to the first 1400 lines and I've used a different parser and got the same error.
I'm debating if it's an error in the code, an error in the JSON or a result of the JSON being made of different kids of data as some (like the example below) is for a forklift and others for fixed machines but it is all structured just like below.
All help sincerely appreciate.
Code:
import json
file_list = ['filename.txt'] #insert filename(s) here
for x in range(len(file_list)):
with open(file_list[x], 'r') as f:
distros_dict = json.load(f)
#list the headlines to be parsed
for distro in distros_dict:
print(distro['name'], distro['positionTS'], distro['smoothedPosition'][0], distro['smoothedPosition'][1], distro['smoothedPosition'][2])
And here is a section of the JSON:
{
"id": "b4994c877c9c",
"name": "Trukki_0001",
"areaId": "Tracking001",
"areaName": "Ajoneuvo",
"color": "#FF0000",
"coordinateSystemId": "CoordSys001",
"coordinateSystemName": null,
"covarianceMatrix": [
0.47,
0.06,
0.06,
0.61
],
"position": [
33.86,
33.07,
2.15
],
"positionAccuracy": 0.36,
"positionTS": 1489363199493,
"smoothedPosition": [
33.96,
33.13,
2.15
],
"zones": [
{
"id": "Zone001",
"name": "Halli1"
}
],
"direction": [
0,
0,
0
],
"collisionId": null,
"restrictedArea": "",
"tagType": "VEHICLE_MANNED",
"drivenVehicleId": null,
"drivenByEmployeeIds": null,
"simpleXY": "33|33",
"EventProcessedUtcTime": "2017-03-13T00:00:00.3175072Z",
"PartitionId": 1,
"EventEnqueuedUtcTime": "2017-03-13T00:00:00.0470000Z"
}
The actual problem was that the JSON file was coded in UTF not ASCII. If you change the encoding using something like notepad++ then it will be solved.
Using the file provided I got it to work by changing "distros_dict" to a list. In you code you assign distros_dict not add to it, so if more than 1 file were to be read it would assign it to the last one.
This is my implementation
import json
file_list = ['filename.txt'] #insert filename(s) here
distros_list = []
for x in range(len(file_list)):
with open(file_list[x], 'r') as f:
distros_list.append(json.load(f))
#list the headlines to be parsed
for distro in distros_list:
print(distro['name'], distro['positionTS'], distro['smoothedPosition'][0], distro['smoothedPosition'][1], distro['smoothedPosition'][2])
You will be left with a list of dictionaries
I'm guessing that your JSON is actually a list of objects, i.e. the whole stream looks like:
[
{ x:1, y:2 },
{ x:3, y:4 },
...
]
... with each element being structured like the section you provided above. This is perfectly valid JSON, and if I store it in a file named file.txt and paste your snippet between a set of [ ], thus making it a list, I can parse it in Python. Note, however, that the result will be again a Python list, not a dict, so you'd iterate like this over each list-item:
import json
import pprint
file_list = ['file.txt']
# Just iterate over the file-list like this, no need for range()
for x in file_list:
with open(x, 'r') as f:
# distros is a list!
distros = json.load(f)
for distro in distros:
print(distro['name'])
print(distro['positionTS'])
print(distro['smoothedPosition'][1])
pprint.pprint(distro)
Edit: I moved the second for-loop into the loop over the files. This seems to make more sense, as otherwise you'll iterate once over all files, store the last one in distros, then print elements only from the last one. By nesting the loops, you'll iterate over all files, and for each file iterate over all elements in the list. Hat-tip to the commenters for pointing this out!

Iterating over JSON list in Python

I'm trying to iterate over a JSON list to print out all of the results of the following:
"examples": [
{
"text": "carry all of the blame"
},
{
"text": "she left all her money to him"
},
{
"text": "we all have different needs"
},
{
"text": "he slept all day"
},
{
"text": "all the people I met"
},
{
"text": "10% of all cars sold"
}
],
I've tried to iterate over it by doing:
iterator = 0
json_example = str(json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples'][iterator]['text']).capitalize()
for i in json_example:
print(i)
iterator += 1
But this is only printing each letter of the first example, as oppose to the entire example, followed by other entire examples.
Can I iterate over these as I would like to, or do I need to create separate variables with each example?
Following your code and example, it looks like what you need is :
for example in json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples']:
print(example["text"])
In your code, by doing json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples'][iterator]['text'] you were only accessing the iteratorth item, so, always the first one (iterator=0), and then iterating on the content of the "text" member.
Only index the json data out to 'examples':
json_example = json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples']
then treat each element of 'examples' like a dictionary:
for dictionary in json_example:
for key in dictionary:
print(dictionary[key])
This will print out each value correlated with the key 'text', like you want.

Categories