Python file search using regex - python

I have a file that has many lines. Each line starts with {"id": followed by the id number in quotes. (i.e {"id": "106"). I am trying to use regex to search the whole document line by line and print the lines that match 5 different id values. To do this I made a list with the ids and want to iterate through the list only matching lines that start with {"id": "(id number from list)". I am really confused on how to do this. Here is what I have so far:
f= "bdata.txt"
statids = ["85", "106", "140", "172" , "337"]
x= re.findall('{"id":', statids, 'f')
for line in open(file):
print(x)
The error code I keep getting is: TypeError: unsupported operand type(s) for &: 'str' and 'int'
I need to whole line to be matched so I can split it and put it into a class.
Any advice? Thanks for your time.

You can retrieve the id from the line using the regex, ^\{\"id\": \"(\d+)\" where the value of group#1 will give you the id. Then, you can check if the id is present in statids.
Demo:
import re
statids = ["85", "106", "140", "172", "337"]
with open("bdata.txt") as file:
for line in file:
search = re.search('^\{\"id\": \"(\d+)\"', line)
if search:
id = search.group(1)
if id in statids:
print(line.rstrip())
For the following sample content in the file:
{"id": "100" hello
{"id": "106" world
{"id": "2" hi
{"id": "85" bye
{"id": "10" ok
{"id": "140" good
{"id": "165" fine
{"id": "172" great
{"id": "337" morning
{"id": "16" evening
the output will be:
{"id": "106" world
{"id": "85" bye
{"id": "140" good
{"id": "172" great
{"id": "337" morning

I the issue here is the way you're using re.findall, according to the docs you have to pass a regular expression as the first argument and the string that you want to match the expression to as the second argument. In your case I think this is how you should do it:
pattern = f'id: ({"|".join(statsids)})'
with open(f) as file:
for line in file:
match = re.findall(pattern, line)
print(match.group(0))
in the regex the pipe operator "|" works same as or so by joining all the ids as an string with | in between them will find all the cases where it matches one id or the other. the match.group line returns where it was found.

Related

Parsing data containing escaped quotes and separators in python

I have data that is structured like this:
1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah"
It always starts with a unix timestamp, but then I can't know how many other fields follow and how they are called.
The goal is to parse this into a dictionary as such:
{"timestamp": 1661171420,
"foo": "bar",
"test": 'This, is a "TEST"',
"count": 5,
"com": "foo, bar=blah"}
I'm having trouble parsing this, especially regarding the escaped quotes and commas in the values.
What would be the best way to parse this correctly? preferably without any 3rd party modules.
If changing the format of input data is not an option (JSON would be much easier to handle, but if it is an API as you say then you might be stuck with this) the following would work assuming the file follows given structure more or less. Not the cleanest solution, I agree, but it does the job.
import re
d = r'''1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah", fraction=-0.11'''.replace(r"\"", "'''")
string_pattern = re.compile(r'''(\w+)="([^"]*)"''')
matches = re.finditer(string_pattern, d)
parsed_data = {}
parsed_data['timestamp'] = int(d.partition(", ")[0])
for match in matches:
parsed_data[match.group(1)] = match.group(2).replace("'''", "\"")
number_pattern = re.compile(r'''(\w+)=([+-]?\d+(?:\.\d+)?)''')
matches = re.finditer(number_pattern, d)
for match in matches:
try:
parsed_data[match.group(1)] = int(match.group(2))
except ValueError:
parsed_data[match.group(1)] = float(match.group(2))
print(parsed_data)

Python regex finding all multi line occurences in file

I have a text file including tons of output like this:
<Tons-of-random-text>
...
Received Access-Accept (code=2) length=300
Attribute 2 (User-Name) length=16
Value 'myTextValue1'
Attribute 4 (Bla1) length=16
Value 'myTextValue2'
Attribute 6 (Bla2) length=16
Value 0xABCDEFG
<Tons-of-random-text>
At the end of the day I want to use named capture groups to extract:
the code in the first outlined line.
A list of Attributes
Based on the example above, the desired extract data structe is:
code=2
attributes = [
{
"attribute": "2",
"attribute-name": "User-Name",
"value": "myTextValue1"
},
{
"attribute": "4",
"attribute-name": "Bla1",
"value": "myTextValue2"
},
# ...
]
I'm struggling with finditer and findall... at the end of the day I'm only able to match the first or the last attribute block...
Anybody have a hint for a good regex?
This code will match the line with (code=\d) and what follows for as long as each lines contains "Attribute" or "Value".
/(\(code=\d\).*$(\n^\s*(Attribute|Value).*$)+)/gm
see https://regex101.com/r/HzIW5V/1

How to select certain parts of a JSON file using python?

I'm pretty new to programming, how would I select a certain part of the JSON file and have it display the value? For example:
{
"name": "John",
"age": 18,
"state": "New York"
}
What would be the code in Python needed to get the value of any of the items by giving it the keyword? (i.e. I give it "name" and the output displays "John")
Put the title in brackets.
myJSON = #that JSON
print(myJSON["name"]) #prints John
print(myJSON["age"]) #prints 18
print(myJSON["state"]) #prints New York

Python Create a List file and write query

so sorry for my question if it seems so easy but I am newbie user of python and I can not find a way to solve it.
I have a "dish.py" file which includes some sub-lists
Fruits={"Ap":Apple
"Br":Black Mulberry
"Ch":Black Cherry
}
Meals={"BN":Bean
"MT":Meat
"VG":Vegetable
}
Legumes={"LN":Green Lentil
"P": Pea
"PN":Runner Peanut
}
I want to impelement the dish.py file in a code that at the end, I want to create a query inside of the file
with open("/home/user/Py_tut/Cond/dish.py", 'r') as dish:
content = dish.read()
print dish.closed
dm=dict([dish])
nl=[x for x in dm if x[0]=='P']
for x in dm:
x=str(raw_input("Enter word:"))
if x in dm:
print dm[x]
elif x[0]==("P"):
nl.append(x)
print .join( nl)
It may be look so messy but
dm=dict([dish]) I want to create a dictionary for query
nl=[x for x in dm if x[0]=='P'] I want to write words begin with "P" letter
Here is my questions:
1. Q: I suppose there is a problem with my dish.py file. How can I reorganize it?
2. Q: How can I apply a query to the file and extract the words begin with "P"
Thank you so much in advance
dict() can't load strings:
>>> dict("{'a': 1, 'b': 2}")
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
dict("{'a': 1, 'b': 2}")
ValueError: dictionary update sequence element 0 has length 1; 2 is required
As a sequence it would be ("{", "'", "a", "'", ":",...
Instead I would use the json module, change the dish.py format (changing extension to .json and using JSON syntax) and change the code.
dish.json
{
"Fruits": {
"Ap": "Apple",
"Br": "Black Mulberry",
"Ch": "Black Cherry"
},
"Meals": {
"BN": "Bean",
"MT": "Meat",
"VG": "Vegetable"
},
"Legumes": {
"GL": "Green Lentin",
"P": "Pea",
"PN": "Running Peanut"
}
}
__init__.py
import json
with open("/home/user/Py_tut/Cond/dish.py", 'r') as dish:
content = dish.read()
print(dish.closed)
dm = json.loads(content) # loads JSON
nl=[x for x in dm if x[0]=='P']
for x in dm:
x = str(raw_input("Enter word:"))
if x in dm:
print dm[x]
elif x[0] == ("P"):
nl.append(x)
print "".join(nl)
Q: How can I apply a query to the file and extract the words begin with "P" Thank you so much in advance
Assuming that you want to get every string separated by either space or newline and return them into a list, i'd do this:
import re #Importing RegExp module
def wordsBeginP():
with open("words.txt") as wordsfile: # Querying a file
words = wordsfile.open
parsed = re.sub(r"\n", " ", words) # Replace \n to " "
return [for i in parsed.split(" ") if i[0] == "P"] # Return list of words
So I think you have more than these two issues/questions.
First, if you want to include 'hardcoded' lists, dicts and such, you probably want to include dish with dish.py being in your working directory.
That is, if your data structures in the python file are actually in the correct form:
Fruits={"Ap":'Apple',
"Br":'Black Mulberry',
"Ch":'Black Cherry'
}
Meals={"BN":'Bean',
"MT":'Meat',
"VG":'Vegetable'
}
Legumes={"LN":'Green Lentil',
"P":'Pea',
"PN":'Runner Peanut'
}
Finally, you can search in all the datastructures that were named and included in the file, under the created namespace of the include (which is dish).
for f in [dish.Fruits,dish.Meals,dish.Legumes]:
for k,v in f.items():
if k.startswith('P'):
print k,v
Also interesting for you might be pickling (though there are some caveats).

Parsing a file that looks like JSON to a JSON

I'm trying to figure out was is the best way to go about this problem:
I'm reading text lines from a certain buffer that eventually creates a certain log that looks something like this:
Some_Information: here there's some information about date and hour
Additional information: log summary #1234:
details {
name: "John Doe"
address: "myAdress"
phone: 01234567
}
information {
age: 30
height: 1.70
weight: 70
}
I would like to get all the fields in this log to a dictionary which I can later turn into a json file, the different sections in the log are not important so for example if myDictionary is a dictionary variable in python I would like to have:
> myDictionary['age']
will show me 30.
and the same for all other fields.
Speed is very important here that's why I would like to just go through every line once and get it in a dictionary
My way about doing this would be to for each line that contains ":" colon I would split the string and get the key and the value in the dictionary.
is there a better way to do it?
Is there any python module that would be sufficient?
If more information is needed please let me know.
Edit:
So I've tried something that to me look to work best so far,
I am currently reading from a file to simulate the reading of the buffer
My code:
import json
import shlex
newDict = dict()
with open('log.txt') as f:
for line in f:
try:
line = line.replace(" ", "")
stringSplit = line.split(':')
key = stringSplit[0]
value = stringSplit[1]
value = shlex.split(value)
newDict[key] = value[0]
except:
continue
with open('result.json', 'w') as fp:
json.dump(newDict, fp)
Resulting in the following .json:
{"name": "JohnDoe", "weight": "70", "Additionalinformation": "logsummary#1234",
"height": "1.70", "phone": "01234567", "address": "myAdress", "age": "30"}
You haven't described exactly what the desired output should be from the sample input, so it's not completely clear what you want done. So I guessed and the following only extracts data values from lines following one that contains a '{' until one with a '}' in it is encountered, while ignoring others.
It uses the re module to isolate the two parts of each dictionary item definition found on the line, and then uses the ast module to convert the value portion of that into a valid Python literal (i.e. string, number, tuple, list, dict, bool, and None).
import ast
import json
import re
pat = re.compile(r"""(?P<key>\w+)\s*:\s*(?P<value>.+)$""")
data_dict = {}
with open('log.txt', 'rU') as f:
braces = 0
for line in (line.strip() for line in f):
if braces > 0:
match = pat.search(line)
if match and len(match.groups()) == 2:
key = match.group('key')
value = ast.literal_eval(match.group('value'))
data_dict[key] = value
elif '{' in line:
braces += 1
elif '}' in line:
braces -= 1
else:
pass # ignore line
print(json.dumps(data_dict, indent=4))
Output from your example input:
{
"name": "John Doe",
"weight": 70,
"age": 30,
"height": 1.7,
"phone": 342391,
"address": "myAdress"
}

Categories