I have a list in dict which I have extracted the data that I need; 'uni', 'gp', 'fr', 'rn'.
uni:1 gp:CC fr:c2 rn:DS
uni:1 gp:CC fr:c2 rn:PP
uni:1 gp:CC fr:c2 rn:LL
uni:2 gp:CC fr:c2 rn:DS
uni:2 gp:CC fr:c2 rn:LL
.
.
.
Above is the output that I write in a txt file with code in below:
for line in new_l:
for key,value in line.items():
if key == 'uni':
unique.append(value)
elif key == 'gp':
pg.append(value)
elif key == 'fr':
rf.append(value)
elif key == 'rn':
rn.append(value)
with open('sampel1.list',mode='w') as f:
for unique,gp,fr,rn in zip(uni,gp,fr,rn):
f.write('uni:{uni}\t,gp:{gp}\t,fr:{fr}\t,rn:{rn}-\n'.format(uni=uni,gp=gp,fr=fr,rn=rn))
The expected output that I want is to merge the 'rn' which has different value with each other and same value of 'unique','gp','fr'.
unique:1 gp:CC fr:c2 rn:DS+PP+LL
unique:2 gp:CC fr:c2 rn:DS+LL
Here's one way I might do something like this using pure Python. Note: this particular solution is relying on the fact that Python 3.7 dicts preserve insertion order:
from collections import defaultdict
# This will map the (uni, gp, fr) triplets to the list of merged rn values
merged = defaultdict(list)
for l in new_l:
# Assuming these keys are always present; if not you will need to check
# that and skip invalid entries
key = (l['uni'], l['gp'], l['fr'])
merged[key].append(l['rn'])
# Now if you wanted to write this to a file, say:
with open(filename, 'w') as f:
for (uni, gp, fr), rn in merged.items():
f.write(f'uni:{uni}\tgp:{gp}\tfr:{fr}\trn:{"+".join(rn)}\n')
Note, when I wrote "pure Python" I meant just using the standard library. In practice I might use Pandas if I'm working with tabular data.
You need study a little about algoritms and data structure.
In this case you can use the first 3 elements to create a unique hash, and based in this value append or not the last element.
Example:
lst = []
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'LL'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':3, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
hash = {}
for line in lst:
hashkey = str(line['uni'])+line['gp']+line['fr']
if hashkey in hash.keys():
hash[hashkey]['rn']+="+"+line['rn']
else:
hash[hashkey]={'uni':line['uni'], 'gp':line['gp'], 'fr':line['fr'], 'rn':line['rn']}
print(hash)
result: {'1CCc2': {'uni': 1, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP+LL'}, '2CCc2': {'uni': 2, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP'}, '3CCc2': {'uni': 3, 'gp': 'CC', 'fr': 'c2', 'rn':
'DS'}}
I thought I'd add how I approach problems like this. You are grouping by the first 3 fields, so I would place them in a tuple (not a list; the dictionary index must be an immutable object) and use that as an index to a dictionary. Then, as you read each line from your input file, test if the tuple is already in the dictionary or not. If it is, concatenate to the previous values already saved.
myDict = {}
f = open("InputData.txt","r")
for line in f:
#print( line.strip())
tup = line.strip().split('\t')
#print(tup)
ind = (tup[0],tup[1],tup[2])
if ind in myDict:
if tup[3] not in myDict[ind]:
myDict[ind] = myDict[ind] + "+" + tup[3][3:]
else:
myDict[ind] = tup[3][3:]
f.close()
print(myDict)
Once the data is in the dictionary object, you can iterate over it and write your output like in the other answers above. (My answer assumes your input text file is tab delimited.)
I find dictionaries very helpful in cases like this.
Related
I'm importing data from a text file, and then made a dictionary out of that. I'm now trying to make a separate one, with the entries that have the same value only. Is that possible?
Sorry if that's a little confusing! But basically, the text file looks like this:
"Andrew", "Present"
"Christine", "Absent"
"Liz", "Present"
"James", "Present"
I made it into a dictionary first, so I could group them into keys and values, and now I'm trying to make a list of the people who were 'present' only (I don't want to delete the absent ones, I just want a separate list), and then pick one from that list randomly.
This is what I tried:
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
d[name.strip()] = attendance.strip()
present_list = []
present_list.append({"name": str(d.keys), "attendance": "Present"})
print(random.choice(present_list))
When I tried running it, I only get:
{'name': '<built-in method keys of dict object at 0x02B26690>', 'attendance': 'Present'}
Which part should I change? Thank you so much in advance!
You can try this:
present_list = [key for key in d if d[key] == "Present"]
first, you have to change the way you the read lines than you can have in your initial dict as key the attendence :
from collections import defaultdict
d = defaultdict(list)
with open('directory.txt') as f:
for line in f.readlines():
name, attendance = line.strip().split(',')
d[attendance.strip()].append(name.strip())
present_list = d["Present"]
print(random.choice(present_list) if present_list else "All absent")
Dict.Keys is a method, not a field. So you must instead do:
d.keys()
This returns an array generator: if you want a comma separated list with square brackets, just calling str() on it is ok. If you want a different formatting, consider ','.join(dict.keys()) to do a simple comma separated list with no square brackets.
UPDATE:
You also have no filtering in place, instead I'd try something like this, where you grab the list of statuses and then compile (new code in BOLD):
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
**if name.strip() not in d.keys():
d[attendance.strip()] = [name.strip()]
else:
d[attendance.strip()] = d[attendance.strip()].append(name.strip())**
This way you don't need to go through all those intermediate steps, and you will have something like {"present": "Andrew, Liz, James"}
Consider the line below read in from a txt file:
EDIT: The text file has thousands of lines just like the one below: TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055 ...
In the line there would be some data that corresponds to TAG1 and lots of data that have &TAG2 at their start.
I want to make a dictionary that has further dictionaries within it, like
{
{'TAG1':1494947148,1,d,ble,0,2,0,0}
{'TAG2:
{'1': 0, '2':229109531800552}
{'1': 0, '2':22910953180055}
}
.
.
}
How do I split the string starting at TAG1 and stopping just before the ampersand before TAG2? Does python allow some way to check if a certain character(s) has been encountered and stop/start there?
I would turn them into a dictionary of string key and list of values. It doesn't matter if a tag has one or more items, just lists would make parsing them simple. You can further process the result dictionary if you find that necessary.
The code will discard the [] in tag names, as they all turned to list anyway.
from itertools import groupby
from operator import itemgetter
import re
s = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
And when you print the result, you get:
print(result)
{'TAG2': [['0', '229109531800552'], ['0', '22910953180055']], 'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']]}
How it works
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
first you split the line with &. That will turn the line into little chunks like "TAG2[]=0,229109531800552", then map turns each chunk into two parts removing the = or []= between them.
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
Because of the map function, splitted is now a iterable that will return lists of two items when consumed. We further sort then group them with the tag(the string on the left of =). Now we have tag_values with keys represent tags and each tag paired with all the matching values(including the tag). Still an iterable though, which means all the thing we talked about haven't really happend yet, except for s.split("&")
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
The last line uses both list and dictionary comprehension. We want to turn the result into a dict of tag and list of values. The curly brackets are dictionary comprehension. The inner variables t and v are extracted from tag_values where t is the tag and v is the grouped matching values(again tag included). At the beginning of the curly bracket t: means use t as a dictionary key, after the column would be the key's matching value.
We want to turn the dictionary value to a list of lists. The square brackets are the list comprehension that consumes the iterable v and turn it into a list. Variable c represent each item in v, and finally because c has two items, the tag and the string values, by using c[1].split(',') we take the value part and split it right into a list. And there is your result.
Further Reading
You really ought to get familiar with list/dict comprehension and generator expression, also take a look at yield if you want to get more things done with python, and learn itertools, functools, operator along the way. Basically just functional programming stuff, python is not a pure functional language though, these are just some powerful metaphors you can use. Read up on some functional languages like haskell that would also improve your python skills.
I think this might what you need:
import json
data = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
items = data.split("&")
res ={}
for item in items:
key, value = item.split("=")
key = key.replace("[]","")
values = value.split(",")
if key in res:
res[key].append(values)
else:
res[key] = [values]
print(res)
print(json.dumps(res))
The results:
{'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']],
'TAG2': [['0', '229109531800552'], ['0', '22910953180055']]}
{"TAG1": [["1494947148", "1", "d", "ble", "0", "2", "0", "0"]],
"TAG2": [["0", "229109531800552"], ["0", "22910953180055"]]}
This may helps you
string = 'TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552'
data = map(str,string.split('&'))
print data
in_data_dic= {}
for i in data:
in_data = map(str,i.split('='))
in_data_dic[in_data[0]] = in_data[1]
in_data=[]
print in_data_dic
output
{'TAG2[]': '0,229109531800552', 'TAG1': '1494947148,1,d,ble,0,2,0,0'}
I have several very large not quite csv log files.
Given the following conditions:
value fields have unescaped newlines and commas, almost anything can be in the value field including '='
each valid line has an unknown number of valid value fields
valid value looks like key=value such that a valid line looks like key1=value1, key2=value2, key3=value3 etc.
the start of each valid line should begin with eventId=<some number>,
What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?
I have tried
file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')
This correctly parses the first entry but all other entries starts with =# instead of eventId=#. Is there a way to keep the deliminator and split on the valid newline?
Also, speed is very important.
Example Data:
eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,
Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.
This problem is pretty insane, but here's a solution that seems to work. Always use an existing library to output formatted data, kids.
import re;
in_string = """eventId=123, goodkey=goodvalue, key2=somestuff:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue, gotit=see,
the problem===s,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, validkey=validvalue,"""
line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string))
lines = []
for i in range(len(line_matches)):
match_start = line_matches[i].start()
next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1
line = in_string[match_start:next_match_start].lstrip(',\n')
lines.append(line)
lineDicts = []
for line in lines:
d = {}
pad_line = ', '+line
matches = list(re.finditer(r', [\w\d]+=', pad_line))
for i in range(len(matches)):
match = matches[i]
key = match.group().lstrip(', ').rstrip('=')
next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line)
value = pad_line[match.end():next_match_start]
d[key] = value
lineDicts.append(d)
print lineDicts
Outputs [{'eventId': '123', 'key2': 'somestuff:\nthis, will, be, a problem,\nmaybe?=,\nanotherkey=anothervalue', 'goodkey': 'goodvalue', 'gotit': 'see,\nthe problem===s'}, {'eventId': '1234', 'key2': 'value2', 'key1': 'value1', 'key3': 'value3'}, {'eventId': '12345', 'key1': '\nmsg= {this is not a valid key value pair}', 'validkey': 'validvalue'}]
If the start of each valid line should begin with eventId= is correct, you can groupby those lines and find valid pairs with a regex:
from itertools import groupby
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = dict(l.split("=") for k, v in grps if k
for l in r.findall(next(v))[1:])
print(d)
{'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}
If you want to keep the eventIds:
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = list(r.findall(next(v)) for k, v in grps if k)
print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]
Not clear from your description exactly what the output should be, if you want all the valids key=value pairs and if the start of each valid line should begin with eventId= is not accurate:
from itertools import groupby,chain
import re
def parse(fle):
with open(fle) as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
for k, v in grps:
if k:
sub = "".join((list(v)) + list(next(grps)[1]))
yield from r.findall(sub)
print(list(parse("test.txt")))
Output:
['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3',
'eventId=12345', 'key=value', 'key21=value']
If your values are can really contain anything, there's no unambiguous way of parsing. Any key=value pair could be part of the preceding value. Even a eventID=# pair on a new line could be part of a value from the previous line.
Now, perhaps you can do a "good enough" parse on the data despite the ambiguity, if you assume that values will never contain valid looking key= substrings. If you know the possible keys (or at least, what constraints they have, like being alphanumeric), it will be a lot easier to guess at what is a new key and what is just part of the previous value.
Anyway, if we assume that all alphanumeric strings followed by equals signs are indeed keys, we can do a parse with regular expressions. Unfortunately, there's no easy way to do this line by line, nor is there a good way to capture all the key-value pairs in a single scan. However, it's not too hard to scan once to get the log lines (which may have embedded newlines) and then separately get the key=value, pairs for each one.
with open("my_log_file") as infile:
text = infile.read()
line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]
I'm assuming that the file is small enough to fit into memory as a string. It would be quite a bit more obnoxious to solve the problem if the file can't all be handled at once.
If we run this regex matching on your example text, we get:
[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
[('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
[('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]
maybe? is not considered a key because of the question mark. msg and the final value are not considered keys because there were no commas separating them from a previous value.
Oh! This is an interesting problem, you'll want to process each line and part of line separately without iterating though the file more than once.
data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
line_list = line.split(',')
if len(line_list)>=1:
if 'eventId' in line_list[0]:
for item in line_list:
pair = item.split('=')
data_dict.update({pair[0]:pair[1]})
That should do it. Enjoy!
If there are spaces in the 'pseudo csv' please change the last line to:
data_dict.update({pair[0].split():pair[1].split()})
In order to remove spaces from the strings for your key and value.
p.s. If this answers your question, please click the check mark on the left to record this as an accepted answer. Thanks!
p.p.s. A set of lines from your actual data would be very helpful in writing something to avoid error cases.
Quick question: in Python 3, if I have the following code
def file2dict(filename):
dictionary = {}
data = open(filename, 'r')
for line in data:
[ key, value ] = line.split(',')
dictionary[key] = value
data.close()
return dictionary
It means that file MUST contain exactly 2 strings(or numbers, or whatever) on every line in the file because of this line:
[ key, value ] = line.split(',')
So, if in my file I have something like this
John,45,65
Jack,56,442
The function throws an exception.
The question: why key, value are in square brackets? Why, for example,
adr, port = s.accept()
does not use square brackets?
And how to modify this code if I want to attach 2 values to every key in a dictionary? Thank you.
The [ and ] around key, value aren't getting you anything.
The error that you're getting, ValueError: too many values to unpack is because you are splitting text like John,45,65 by the commas. Do "John,45,65".split(',') in a shell. You get
>>> "John,45,65".split(',')
['John', '45', '65']
Your code is trying to assign 3 values, "John", 45, and 65, to two variables, key and value, thus the error.
There are a few options:
1) str.split has an optional maxsplit parameter:
>>> "John,45,65".split(',', 1)
['John', '45,65']
if "45,65" is the value you want to set for that key in the dictionary.
2) Cut the extra value.
If the 65 isn't what you want, then you can do something either like
>>> name, age, unwanted = "John,45,65".split(',',)
>>> name, age, unwanted
('John', '45', '65')
>>> dictionary[name] = age
>>> dictionary
{'John': '45'}
and just not use the unwanted variable, or split into a list and don't use the last element:
>>> data = "John,45,65".split(',')
>>> dictionary[data[0]] = data[1]
>>> dictionary
{'John': '45'}
you can use three variable's instead of two, make first one key,
def file2dict(filename):
dictionary = {}
data = open(filename, 'r')
for line in data:
key, value1,value2 = line.split(',')
dictionary[key] = [int(value1), int(value2)]
data.close()
return dictionary
When doing a line split to a dictionary, consider limiting the number of splits by specifying maxsplit, and checking to make sure that the line contains a comma:
def file2dict(filename):
data = open(filename, 'r')
dictionary = dict(item.split(",",1) for item in data if "," in item)
data.close()
return dictionary
i have a text file which looks like this
sample.txt
49416 286:25:58 2570460 36252408 04:29:00 R qp256
49486 180:56:21 5714784 7585688 06:44:33 R qp32
49501 58:19:52 36640572 39860816 02:02:09 R qp32
how can i get an output in the form of dictionary or assign it to a new file or a list(which ever is better method) so that i can use the output later and access each of those elements .
I found an example which does something like this but could not match it to my code
newdict:
{'j_id':'49416','t1':'286:25:58','t2':'2570460','t3':'36252408','ot':'04:29:00','stat':'R','q':'qp256'}
{'j_id':'49486','t1':'180:56:21','t2':'5714784','t3':'7585688','ot':'06:44:33','stat':'R','q':'qp32'}
{'j_id':'49501','t1':'58:19:52','t2':'36640572','t3':'39860816','ot':'02:02:09','stat':'R','q':'qp32'}
The straightforward way would be:
keys = ['j_id', 't1', 't2', 't3', 'ot', 'stat', 'q']
dicts = []
with open('sample.txt') as f:
for line in f:
values = line.strip().split()
dicts.append(dict(zip(keys, values)))
print dicts
If you need something more robust use the csv module.