Parsing a large pseudo csv log file in Python - python

I have several very large not quite csv log files.
Given the following conditions:
value fields have unescaped newlines and commas, almost anything can be in the value field including '='
each valid line has an unknown number of valid value fields
valid value looks like key=value such that a valid line looks like key1=value1, key2=value2, key3=value3 etc.
the start of each valid line should begin with eventId=<some number>,
What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?
I have tried
file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')
This correctly parses the first entry but all other entries starts with =# instead of eventId=#. Is there a way to keep the deliminator and split on the valid newline?
Also, speed is very important.
Example Data:
eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,
Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.

This problem is pretty insane, but here's a solution that seems to work. Always use an existing library to output formatted data, kids.
import re;
in_string = """eventId=123, goodkey=goodvalue, key2=somestuff:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue, gotit=see,
the problem===s,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, validkey=validvalue,"""
line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string))
lines = []
for i in range(len(line_matches)):
match_start = line_matches[i].start()
next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1
line = in_string[match_start:next_match_start].lstrip(',\n')
lines.append(line)
lineDicts = []
for line in lines:
d = {}
pad_line = ', '+line
matches = list(re.finditer(r', [\w\d]+=', pad_line))
for i in range(len(matches)):
match = matches[i]
key = match.group().lstrip(', ').rstrip('=')
next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line)
value = pad_line[match.end():next_match_start]
d[key] = value
lineDicts.append(d)
print lineDicts
Outputs [{'eventId': '123', 'key2': 'somestuff:\nthis, will, be, a problem,\nmaybe?=,\nanotherkey=anothervalue', 'goodkey': 'goodvalue', 'gotit': 'see,\nthe problem===s'}, {'eventId': '1234', 'key2': 'value2', 'key1': 'value1', 'key3': 'value3'}, {'eventId': '12345', 'key1': '\nmsg= {this is not a valid key value pair}', 'validkey': 'validvalue'}]

If the start of each valid line should begin with eventId= is correct, you can groupby those lines and find valid pairs with a regex:
from itertools import groupby
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = dict(l.split("=") for k, v in grps if k
for l in r.findall(next(v))[1:])
print(d)
{'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}
If you want to keep the eventIds:
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = list(r.findall(next(v)) for k, v in grps if k)
print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]
Not clear from your description exactly what the output should be, if you want all the valids key=value pairs and if the start of each valid line should begin with eventId= is not accurate:
from itertools import groupby,chain
import re
def parse(fle):
with open(fle) as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
for k, v in grps:
if k:
sub = "".join((list(v)) + list(next(grps)[1]))
yield from r.findall(sub)
print(list(parse("test.txt")))
Output:
['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3',
'eventId=12345', 'key=value', 'key21=value']

If your values are can really contain anything, there's no unambiguous way of parsing. Any key=value pair could be part of the preceding value. Even a eventID=# pair on a new line could be part of a value from the previous line.
Now, perhaps you can do a "good enough" parse on the data despite the ambiguity, if you assume that values will never contain valid looking key= substrings. If you know the possible keys (or at least, what constraints they have, like being alphanumeric), it will be a lot easier to guess at what is a new key and what is just part of the previous value.
Anyway, if we assume that all alphanumeric strings followed by equals signs are indeed keys, we can do a parse with regular expressions. Unfortunately, there's no easy way to do this line by line, nor is there a good way to capture all the key-value pairs in a single scan. However, it's not too hard to scan once to get the log lines (which may have embedded newlines) and then separately get the key=value, pairs for each one.
with open("my_log_file") as infile:
text = infile.read()
line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]
I'm assuming that the file is small enough to fit into memory as a string. It would be quite a bit more obnoxious to solve the problem if the file can't all be handled at once.
If we run this regex matching on your example text, we get:
[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
[('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
[('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]
maybe? is not considered a key because of the question mark. msg and the final value are not considered keys because there were no commas separating them from a previous value.

Oh! This is an interesting problem, you'll want to process each line and part of line separately without iterating though the file more than once.
data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
line_list = line.split(',')
if len(line_list)>=1:
if 'eventId' in line_list[0]:
for item in line_list:
pair = item.split('=')
data_dict.update({pair[0]:pair[1]})
That should do it. Enjoy!
If there are spaces in the 'pseudo csv' please change the last line to:
data_dict.update({pair[0].split():pair[1].split()})
In order to remove spaces from the strings for your key and value.
p.s. If this answers your question, please click the check mark on the left to record this as an accepted answer. Thanks!
p.p.s. A set of lines from your actual data would be very helpful in writing something to avoid error cases.

Related

write the same element in a list into txt file

I have a list in dict which I have extracted the data that I need; 'uni', 'gp', 'fr', 'rn'.
uni:1 gp:CC fr:c2 rn:DS
uni:1 gp:CC fr:c2 rn:PP
uni:1 gp:CC fr:c2 rn:LL
uni:2 gp:CC fr:c2 rn:DS
uni:2 gp:CC fr:c2 rn:LL
.
.
.
Above is the output that I write in a txt file with code in below:
for line in new_l:
for key,value in line.items():
if key == 'uni':
unique.append(value)
elif key == 'gp':
pg.append(value)
elif key == 'fr':
rf.append(value)
elif key == 'rn':
rn.append(value)
with open('sampel1.list',mode='w') as f:
for unique,gp,fr,rn in zip(uni,gp,fr,rn):
f.write('uni:{uni}\t,gp:{gp}\t,fr:{fr}\t,rn:{rn}-\n'.format(uni=uni,gp=gp,fr=fr,rn=rn))
The expected output that I want is to merge the 'rn' which has different value with each other and same value of 'unique','gp','fr'.
unique:1 gp:CC fr:c2 rn:DS+PP+LL
unique:2 gp:CC fr:c2 rn:DS+LL
Here's one way I might do something like this using pure Python. Note: this particular solution is relying on the fact that Python 3.7 dicts preserve insertion order:
from collections import defaultdict
# This will map the (uni, gp, fr) triplets to the list of merged rn values
merged = defaultdict(list)
for l in new_l:
# Assuming these keys are always present; if not you will need to check
# that and skip invalid entries
key = (l['uni'], l['gp'], l['fr'])
merged[key].append(l['rn'])
# Now if you wanted to write this to a file, say:
with open(filename, 'w') as f:
for (uni, gp, fr), rn in merged.items():
f.write(f'uni:{uni}\tgp:{gp}\tfr:{fr}\trn:{"+".join(rn)}\n')
Note, when I wrote "pure Python" I meant just using the standard library. In practice I might use Pandas if I'm working with tabular data.
You need study a little about algoritms and data structure.
In this case you can use the first 3 elements to create a unique hash, and based in this value append or not the last element.
Example:
lst = []
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'LL'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':3, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
hash = {}
for line in lst:
hashkey = str(line['uni'])+line['gp']+line['fr']
if hashkey in hash.keys():
hash[hashkey]['rn']+="+"+line['rn']
else:
hash[hashkey]={'uni':line['uni'], 'gp':line['gp'], 'fr':line['fr'], 'rn':line['rn']}
print(hash)
result: {'1CCc2': {'uni': 1, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP+LL'}, '2CCc2': {'uni': 2, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP'}, '3CCc2': {'uni': 3, 'gp': 'CC', 'fr': 'c2', 'rn':
'DS'}}
I thought I'd add how I approach problems like this. You are grouping by the first 3 fields, so I would place them in a tuple (not a list; the dictionary index must be an immutable object) and use that as an index to a dictionary. Then, as you read each line from your input file, test if the tuple is already in the dictionary or not. If it is, concatenate to the previous values already saved.
myDict = {}
f = open("InputData.txt","r")
for line in f:
#print( line.strip())
tup = line.strip().split('\t')
#print(tup)
ind = (tup[0],tup[1],tup[2])
if ind in myDict:
if tup[3] not in myDict[ind]:
myDict[ind] = myDict[ind] + "+" + tup[3][3:]
else:
myDict[ind] = tup[3][3:]
f.close()
print(myDict)
Once the data is in the dictionary object, you can iterate over it and write your output like in the other answers above. (My answer assumes your input text file is tab delimited.)
I find dictionaries very helpful in cases like this.

Can I import dictionary items with the same values into a list?

I'm importing data from a text file, and then made a dictionary out of that. I'm now trying to make a separate one, with the entries that have the same value only. Is that possible?
Sorry if that's a little confusing! But basically, the text file looks like this:
"Andrew", "Present"
"Christine", "Absent"
"Liz", "Present"
"James", "Present"
I made it into a dictionary first, so I could group them into keys and values, and now I'm trying to make a list of the people who were 'present' only (I don't want to delete the absent ones, I just want a separate list), and then pick one from that list randomly.
This is what I tried:
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
d[name.strip()] = attendance.strip()
present_list = []
present_list.append({"name": str(d.keys), "attendance": "Present"})
print(random.choice(present_list))
When I tried running it, I only get:
{'name': '<built-in method keys of dict object at 0x02B26690>', 'attendance': 'Present'}
Which part should I change? Thank you so much in advance!
You can try this:
present_list = [key for key in d if d[key] == "Present"]
first, you have to change the way you the read lines than you can have in your initial dict as key the attendence :
from collections import defaultdict
d = defaultdict(list)
with open('directory.txt') as f:
for line in f.readlines():
name, attendance = line.strip().split(',')
d[attendance.strip()].append(name.strip())
present_list = d["Present"]
print(random.choice(present_list) if present_list else "All absent")
Dict.Keys is a method, not a field. So you must instead do:
d.keys()
This returns an array generator: if you want a comma separated list with square brackets, just calling str() on it is ok. If you want a different formatting, consider ','.join(dict.keys()) to do a simple comma separated list with no square brackets.
UPDATE:
You also have no filtering in place, instead I'd try something like this, where you grab the list of statuses and then compile (new code in BOLD):
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
**if name.strip() not in d.keys():
d[attendance.strip()] = [name.strip()]
else:
d[attendance.strip()] = d[attendance.strip()].append(name.strip())**
This way you don't need to go through all those intermediate steps, and you will have something like {"present": "Andrew, Liz, James"}

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Python replace values in unknown structure JSON file

Say that I have a JSON file whose structure is either unknown or may change overtime - I want to replace all values of "REPLACE_ME" with a string of my choice in Python.
Everything I have found assumes I know the structure. For example, I can read the JSON in with json.load and walk through the dictionary to do replacements then write it back. This assumes I know Key names, structure, etc.
How can I replace ALL of a given string value in a JSON file with something else?
This function recursively replaces all strings which equal the value original with the value new.
This function works on the python structure - but of course you can use it on a json file - by using json.load
It doesn't replace keys in the dictionary - just the values.
def nested_replace( structure, original, new ):
if type(structure) == list:
return [nested_replace( item, original, new) for item in structure]
if type(structure) == dict:
return {key : nested_replace(value, original, new)
for key, value in structure.items() }
if structure == original:
return new
else:
return structure
d = [ 'replace', {'key1': 'replace', 'key2': ['replace', 'don\'t replace'] } ]
new_d = nested_replace(d, 'replace', 'now replaced')
print(new_d)
['now replaced', {'key1': 'now replaced', 'key2': ['now replaced', "don't replace"]}]
I think there's no big risk if you want to replace any key or value enclosed with quotes (since quotes are escaped in json unless they are part of a string delimiter).
I would dump the structure, perform a str.replace (with double quotes), and parse again:
import json
d = { 'foo': {'bar' : 'hello'}}
d = json.loads(json.dumps(d).replace('"hello"','"hi"'))
print(d)
result:
{'foo': {'bar': 'hi'}}
I wouldn't risk to replace parts of strings or strings without quotes, because it could change other parts of the file. I can't think of an example where replacing a string without double quotes can change something else.
There are "clean" solutions like adapting from Replace value in JSON file for key which can be nested by n levels but is it worth the effort? Depends on your requirements.
Why not modify the file directly instead of treating it as a JSON?
with open('filepath') as f:
lines = f.readlines()
for line in lines:
line = line.replace('REPLACE_ME', 'whatever')
with open('filepath_new', 'a') as f:
f.write(line)
You could load the JSON file into a dictionary and recurse through that to find the proper values but that's unnecessary muscle flexing.
The best way is to simply treat the file as a string and do the replacements that way.
json_file = 'my_file.json'
with open(json_file) as f:
file_data = f.read()
file_data = file_data.replace('REPLACE_ME', 'new string')
<...>
with open(json_file, 'w') as f:
f.write(file_data)
json_data = json.loads(file_data)
From here the file can be re-written and you can continue to use json_data as a dict.
Well that depends, if you want to place all the strings entitled "REPLACE_ME" with the same string you can use this. The for loop loops through all the keys in the dictionary and then you can use the keys to select each value in the dictionary. If it is equal to your replacement string it will replace it with the string you want.
search_string = "REPLACE_ME"
replacement = "SOME STRING"
test = {"test1":"REPLACE_ME", "test2":"REPLACE_ME", "test3":"REPLACE_ME", "test4":"REPLACE_ME","test5":{"test6":"REPLACE_ME"}}
def replace_nested(test):
for key,value in test.items():
if type(value) is dict:
replace_nested(value)
else:
if value==search_string:
test[key] = replacement
replace_nested(test)
print(test)
To solve this problem in a dynamic way, I have obtained to use the same json file to declare the variables that we want to replace.
Json File :
{
"properties": {
"property_1": "value1",
"property_2": "value2"
},
"json_file_content": {
"key_to_find": "{{property_1}} is my value"
"dict1":{
"key_to_find": "{{property_2}} is my other value"
}
}
Python code (references Replace value in JSON file for key which can be nested by n levels):
import json
def fixup(self, a_dict:dict, k:str, subst_dict:dict) -> dict:
"""
function inspired by another answers linked below
"""
for key in a_dict.keys():
if key == k:
for s_k, s_v in subst_dict.items():
a_dict[key] = a_dict[key].replace("{{"+s_k+"}}",s_v)
elif type(a_dict[key]) is dict:
fixup(a_dict[key], k, subst_dict)
# ...
file_path = "my/file/path"
if path.exists(file_path):
with open(file_path, 'rt') as f:
json_dict = json.load(f)
fixup(json_dict ["json_file_content"],"key_to_find",json_dict ["properties"])
print(json_dict) # json with variables resolved
else:
print("file not found")
Hope it helps

Searching and writing

I need to write a program which looks for words with the same three middle characters(each word is 5 characters long) in a list, then writes them into a file like this :
wasdy
casde
tasdf
gsadk
csade
hsadi
Between the similar words i need to leave an empty line. I am kinda stuck.
Is there a way to do this? I use Python 3.2 .
Thanks for your help.
I would use the itertools.groupby function for this. Assuming wordlist is a list of the words you want to group, this code does the trick.
import itertools
for k, v in itertools.groupby(wordlist, lambda word: word[1:4]):
# here, k is the key the words are grouped by, i.e. word[1:4]
# and v is a list/iterable of the words in the group
for word in v:
print word
print
itertools.groupby(wordlist, lambda word: word[1:4]) basically takes all the words, and groups them by word[1:4], i.e. the three middle characters. Here's the output of the above code with your sample data:
wasdy
casde
tasdf
gsadk
csade
hsadi
 
To get you started: try using the builtin sorted function on the list of words, and for the key you should experiment with using a slice(1, 4).
For example:
some_list = ['wasdy', 'casde', 'tasdf', 'gsadk', 'other', 'csade', 'hsadi']
sorted(some_list, key = lambda x: sorted(x[1:4]))
# outputs ['wasdy', 'casde', 'tasdf', 'gsadk', 'csade', 'hsadi', 'other']
edit: It was unclear to me whether you wanted "same three middle characters, in order" or just "same three middle characters". If the latter, then you could look at sorted(some_list, key = lambda x: x[1:4]) instead.
try:
from collections import defaultdict
dict_of_words = defaultdict(list)
for word in list_of_words:
dict_of_words[word[1:-1]].append(word)
then, to write to an output file:
with open('outfile.txt', 'w') as f:
for key in dict_of_words:
f.write('\n'.join(dict_of_words[key])
f.write('\n' )
word_list = ['wasdy', 'casde','tasdf','gsadk','csade','hsadi']
def test_word(word):
return all([x in word[1:4] for x in ['a','s','d']])
f = open('yourfile.txt', 'w')
f.write('\n'.join([word for word in word_list if test_word(word)]))
f.close()
returns:
wasdy
casde
tasdf
gsadk
csade
hsadi

Categories