Related
Dictionary is below
a = {'querystring': {'dataproduct.keyword': 'abc,def'}}
How to split into two dictionary with values?
a['querystring'] = {'dataproduct.keyword': 'abc,def'}
Expected out while printing
{'dataproduct.keyword': 'abc'}
{'dataproduct.keyword': 'def'}
Since dictionary is hashmap
[{'dataproduct.keyword': 'abc'} {'dataproduct.keyword': 'def'}]
Disclaimer:
before executing need to check the comma
if a['querystring'] = {'dataproduct.keyword': 'abc'} then no need to split
if a['querystring'] = {'dataproduct.keyword': 'abc,def,efg'} if comma is there then only need to split
[{key: item} for key, value in a['querystring'].items() for item in value.split(',')]
A solution that works with across all top-level entries, not just the entry with key "querystring":
a = {'querystring': {'dataproduct.keyword': 'abc,def'}}
split_a = []
for value in a.values():
for sub_key, sub_value in value.items():
for split_sub_value in sub_value.split(","):
split_a.append({sub_key: split_sub_value})
Resulting value of split_a is [{'dataproduct.keyword': 'abc'}, {'dataproduct.keyword': 'def'}].
I have the following code to create empty dictionary:
empty_dict = dict.fromkeys(['apple','ball'])
empty_dict = {'apple': None, 'ball': None}
I have this empty dictionary.
Now I want to add the values from value.txt which has the following content:
value.txt
1
2
I want the resultant dictionary to be as:
{
"apple" : 1,
"ball" : 2
}
I'm not sure how to update only the value from the dictionary.
You don't really need to make the dict first — it makes it inconvenient to get the order correct. You can just zip() the keys and the file lines and pass it to the dictionary constructor like:
keys = ['apple','ball']
with open(path, 'r') as file:
d = dict(zip(keys, map(str.strip, file)))
print(d)
# {'apple': '1', 'ball': '2'}
This uses strip() to remove the \n characters from the lines in the file.
It's not clear what should happen if you have more lines than keys, but the above will ignore them.
Consider the line below read in from a txt file:
EDIT: The text file has thousands of lines just like the one below: TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055 ...
In the line there would be some data that corresponds to TAG1 and lots of data that have &TAG2 at their start.
I want to make a dictionary that has further dictionaries within it, like
{
{'TAG1':1494947148,1,d,ble,0,2,0,0}
{'TAG2:
{'1': 0, '2':229109531800552}
{'1': 0, '2':22910953180055}
}
.
.
}
How do I split the string starting at TAG1 and stopping just before the ampersand before TAG2? Does python allow some way to check if a certain character(s) has been encountered and stop/start there?
I would turn them into a dictionary of string key and list of values. It doesn't matter if a tag has one or more items, just lists would make parsing them simple. You can further process the result dictionary if you find that necessary.
The code will discard the [] in tag names, as they all turned to list anyway.
from itertools import groupby
from operator import itemgetter
import re
s = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
And when you print the result, you get:
print(result)
{'TAG2': [['0', '229109531800552'], ['0', '22910953180055']], 'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']]}
How it works
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
first you split the line with &. That will turn the line into little chunks like "TAG2[]=0,229109531800552", then map turns each chunk into two parts removing the = or []= between them.
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
Because of the map function, splitted is now a iterable that will return lists of two items when consumed. We further sort then group them with the tag(the string on the left of =). Now we have tag_values with keys represent tags and each tag paired with all the matching values(including the tag). Still an iterable though, which means all the thing we talked about haven't really happend yet, except for s.split("&")
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
The last line uses both list and dictionary comprehension. We want to turn the result into a dict of tag and list of values. The curly brackets are dictionary comprehension. The inner variables t and v are extracted from tag_values where t is the tag and v is the grouped matching values(again tag included). At the beginning of the curly bracket t: means use t as a dictionary key, after the column would be the key's matching value.
We want to turn the dictionary value to a list of lists. The square brackets are the list comprehension that consumes the iterable v and turn it into a list. Variable c represent each item in v, and finally because c has two items, the tag and the string values, by using c[1].split(',') we take the value part and split it right into a list. And there is your result.
Further Reading
You really ought to get familiar with list/dict comprehension and generator expression, also take a look at yield if you want to get more things done with python, and learn itertools, functools, operator along the way. Basically just functional programming stuff, python is not a pure functional language though, these are just some powerful metaphors you can use. Read up on some functional languages like haskell that would also improve your python skills.
I think this might what you need:
import json
data = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
items = data.split("&")
res ={}
for item in items:
key, value = item.split("=")
key = key.replace("[]","")
values = value.split(",")
if key in res:
res[key].append(values)
else:
res[key] = [values]
print(res)
print(json.dumps(res))
The results:
{'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']],
'TAG2': [['0', '229109531800552'], ['0', '22910953180055']]}
{"TAG1": [["1494947148", "1", "d", "ble", "0", "2", "0", "0"]],
"TAG2": [["0", "229109531800552"], ["0", "22910953180055"]]}
This may helps you
string = 'TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552'
data = map(str,string.split('&'))
print data
in_data_dic= {}
for i in data:
in_data = map(str,i.split('='))
in_data_dic[in_data[0]] = in_data[1]
in_data=[]
print in_data_dic
output
{'TAG2[]': '0,229109531800552', 'TAG1': '1494947148,1,d,ble,0,2,0,0'}
How would I remove a \n or newline character from a dict value in Python?
testDict = {'salutations': 'hello', 'farewell': 'goodbye\n'}
testDict.strip('\n') # I know this part is incorrect :)
print(testDict)
To update the dictionary in-place, just iterate over it and apply str.rstrip() to values:
for key, value in testDict.items():
testDict[key] = value.rstrip()
To create a new dictionary, you can use a dictionary comprehension:
testDict = {key: value.rstrip() for key, value in testDict.items()}
Use dictionary comprehension:
testDict = {key: value.strip('\n') for key, value in testDict.items()}
You're trying to strip a newline from the Dictionary Object.
What you want is to iterate over all Dictionary keys and update their values.
for key in testDict.keys():
testDict[key] = testDict[key].strip()
That would do the trick.
I have several very large not quite csv log files.
Given the following conditions:
value fields have unescaped newlines and commas, almost anything can be in the value field including '='
each valid line has an unknown number of valid value fields
valid value looks like key=value such that a valid line looks like key1=value1, key2=value2, key3=value3 etc.
the start of each valid line should begin with eventId=<some number>,
What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?
I have tried
file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')
This correctly parses the first entry but all other entries starts with =# instead of eventId=#. Is there a way to keep the deliminator and split on the valid newline?
Also, speed is very important.
Example Data:
eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,
Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.
This problem is pretty insane, but here's a solution that seems to work. Always use an existing library to output formatted data, kids.
import re;
in_string = """eventId=123, goodkey=goodvalue, key2=somestuff:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue, gotit=see,
the problem===s,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, validkey=validvalue,"""
line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string))
lines = []
for i in range(len(line_matches)):
match_start = line_matches[i].start()
next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1
line = in_string[match_start:next_match_start].lstrip(',\n')
lines.append(line)
lineDicts = []
for line in lines:
d = {}
pad_line = ', '+line
matches = list(re.finditer(r', [\w\d]+=', pad_line))
for i in range(len(matches)):
match = matches[i]
key = match.group().lstrip(', ').rstrip('=')
next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line)
value = pad_line[match.end():next_match_start]
d[key] = value
lineDicts.append(d)
print lineDicts
Outputs [{'eventId': '123', 'key2': 'somestuff:\nthis, will, be, a problem,\nmaybe?=,\nanotherkey=anothervalue', 'goodkey': 'goodvalue', 'gotit': 'see,\nthe problem===s'}, {'eventId': '1234', 'key2': 'value2', 'key1': 'value1', 'key3': 'value3'}, {'eventId': '12345', 'key1': '\nmsg= {this is not a valid key value pair}', 'validkey': 'validvalue'}]
If the start of each valid line should begin with eventId= is correct, you can groupby those lines and find valid pairs with a regex:
from itertools import groupby
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = dict(l.split("=") for k, v in grps if k
for l in r.findall(next(v))[1:])
print(d)
{'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}
If you want to keep the eventIds:
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = list(r.findall(next(v)) for k, v in grps if k)
print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]
Not clear from your description exactly what the output should be, if you want all the valids key=value pairs and if the start of each valid line should begin with eventId= is not accurate:
from itertools import groupby,chain
import re
def parse(fle):
with open(fle) as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
for k, v in grps:
if k:
sub = "".join((list(v)) + list(next(grps)[1]))
yield from r.findall(sub)
print(list(parse("test.txt")))
Output:
['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3',
'eventId=12345', 'key=value', 'key21=value']
If your values are can really contain anything, there's no unambiguous way of parsing. Any key=value pair could be part of the preceding value. Even a eventID=# pair on a new line could be part of a value from the previous line.
Now, perhaps you can do a "good enough" parse on the data despite the ambiguity, if you assume that values will never contain valid looking key= substrings. If you know the possible keys (or at least, what constraints they have, like being alphanumeric), it will be a lot easier to guess at what is a new key and what is just part of the previous value.
Anyway, if we assume that all alphanumeric strings followed by equals signs are indeed keys, we can do a parse with regular expressions. Unfortunately, there's no easy way to do this line by line, nor is there a good way to capture all the key-value pairs in a single scan. However, it's not too hard to scan once to get the log lines (which may have embedded newlines) and then separately get the key=value, pairs for each one.
with open("my_log_file") as infile:
text = infile.read()
line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]
I'm assuming that the file is small enough to fit into memory as a string. It would be quite a bit more obnoxious to solve the problem if the file can't all be handled at once.
If we run this regex matching on your example text, we get:
[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
[('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
[('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]
maybe? is not considered a key because of the question mark. msg and the final value are not considered keys because there were no commas separating them from a previous value.
Oh! This is an interesting problem, you'll want to process each line and part of line separately without iterating though the file more than once.
data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
line_list = line.split(',')
if len(line_list)>=1:
if 'eventId' in line_list[0]:
for item in line_list:
pair = item.split('=')
data_dict.update({pair[0]:pair[1]})
That should do it. Enjoy!
If there are spaces in the 'pseudo csv' please change the last line to:
data_dict.update({pair[0].split():pair[1].split()})
In order to remove spaces from the strings for your key and value.
p.s. If this answers your question, please click the check mark on the left to record this as an accepted answer. Thanks!
p.p.s. A set of lines from your actual data would be very helpful in writing something to avoid error cases.