Related
I have a list in dict which I have extracted the data that I need; 'uni', 'gp', 'fr', 'rn'.
uni:1 gp:CC fr:c2 rn:DS
uni:1 gp:CC fr:c2 rn:PP
uni:1 gp:CC fr:c2 rn:LL
uni:2 gp:CC fr:c2 rn:DS
uni:2 gp:CC fr:c2 rn:LL
.
.
.
Above is the output that I write in a txt file with code in below:
for line in new_l:
for key,value in line.items():
if key == 'uni':
unique.append(value)
elif key == 'gp':
pg.append(value)
elif key == 'fr':
rf.append(value)
elif key == 'rn':
rn.append(value)
with open('sampel1.list',mode='w') as f:
for unique,gp,fr,rn in zip(uni,gp,fr,rn):
f.write('uni:{uni}\t,gp:{gp}\t,fr:{fr}\t,rn:{rn}-\n'.format(uni=uni,gp=gp,fr=fr,rn=rn))
The expected output that I want is to merge the 'rn' which has different value with each other and same value of 'unique','gp','fr'.
unique:1 gp:CC fr:c2 rn:DS+PP+LL
unique:2 gp:CC fr:c2 rn:DS+LL
Here's one way I might do something like this using pure Python. Note: this particular solution is relying on the fact that Python 3.7 dicts preserve insertion order:
from collections import defaultdict
# This will map the (uni, gp, fr) triplets to the list of merged rn values
merged = defaultdict(list)
for l in new_l:
# Assuming these keys are always present; if not you will need to check
# that and skip invalid entries
key = (l['uni'], l['gp'], l['fr'])
merged[key].append(l['rn'])
# Now if you wanted to write this to a file, say:
with open(filename, 'w') as f:
for (uni, gp, fr), rn in merged.items():
f.write(f'uni:{uni}\tgp:{gp}\tfr:{fr}\trn:{"+".join(rn)}\n')
Note, when I wrote "pure Python" I meant just using the standard library. In practice I might use Pandas if I'm working with tabular data.
You need study a little about algoritms and data structure.
In this case you can use the first 3 elements to create a unique hash, and based in this value append or not the last element.
Example:
lst = []
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':1, 'gp':'CC', 'fr':'c2', 'rn':'LL'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
lst.append({'uni':2, 'gp':'CC', 'fr':'c2', 'rn':'PP'})
lst.append({'uni':3, 'gp':'CC', 'fr':'c2', 'rn':'DS'})
hash = {}
for line in lst:
hashkey = str(line['uni'])+line['gp']+line['fr']
if hashkey in hash.keys():
hash[hashkey]['rn']+="+"+line['rn']
else:
hash[hashkey]={'uni':line['uni'], 'gp':line['gp'], 'fr':line['fr'], 'rn':line['rn']}
print(hash)
result: {'1CCc2': {'uni': 1, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP+LL'}, '2CCc2': {'uni': 2, 'gp': 'CC', 'fr': 'c2', 'rn': 'DS+PP'}, '3CCc2': {'uni': 3, 'gp': 'CC', 'fr': 'c2', 'rn':
'DS'}}
I thought I'd add how I approach problems like this. You are grouping by the first 3 fields, so I would place them in a tuple (not a list; the dictionary index must be an immutable object) and use that as an index to a dictionary. Then, as you read each line from your input file, test if the tuple is already in the dictionary or not. If it is, concatenate to the previous values already saved.
myDict = {}
f = open("InputData.txt","r")
for line in f:
#print( line.strip())
tup = line.strip().split('\t')
#print(tup)
ind = (tup[0],tup[1],tup[2])
if ind in myDict:
if tup[3] not in myDict[ind]:
myDict[ind] = myDict[ind] + "+" + tup[3][3:]
else:
myDict[ind] = tup[3][3:]
f.close()
print(myDict)
Once the data is in the dictionary object, you can iterate over it and write your output like in the other answers above. (My answer assumes your input text file is tab delimited.)
I find dictionaries very helpful in cases like this.
Say I have a list of filenames files containing data in json format. To receive the data in a list with an entry for each file, I use a list comprehension:
>>> import json
>>> data = [json.load(open(file)) for file in files]
Now I was wondering, if there is a way to append the file name file to the json data, as if it looked like this:
{
'Some': ['data', 'that', 'has', 'already', 'been', 'there'],
'Filename': 'filename'
}
For my case, json.load() returns a dict, so I've tried something similar to this question. This didn't work out for me, because files contains strings and not dictionaries.
Edit
For clarification, if dict.update() didn't return None, this would probably work:
>>> data = [dict([('filename',file)]).update(json.load(open(file))) for file in files]
Yes, you can. Here's one way (requires Python 3.5+):
import json
data = [{**json.load(open(file)), **{'Filename': file}} for file in files]
The syntax {**d1, **d2} combines 2 dictionaries, with preference for d2. If you wish to add items explicitly, you can simply add an extra item as so:
data = [{**json.load(open(file)), 'Filename': file} for file in files]
You can merge a custom dictionary into the one being loaded as in this answer.
data = [{**json.loads("{\"el...\": 5}"), **{'filename': file}} for file in files]
Please help, I am using extend list to append multiple values to list.
I need to extend to list as a new line for every extend.
>>> list1 = []
>>> list1 = (['Para','op','qa', 'reason'])
>>> list1.extend(['Power','pass','ok', 'NA'])
>>> print list1
['Para', 'op', 'qa', 'reason', 'Power', 'pass', 'ok', 'NA']
I need to provide this list to csv and It has to print like two lines.
Para, op, qa, reason
Power, pass, ok, NA
If you wanted separate lists, make them separate. Don't use list.extend(), use appending:
list1 = [['Para','op','qa', 'reason']] # brackets, creating a list with a list
list1.append(['Power','pass','ok', 'NA'])
Now list1 is a list with two objects, each itself a list:
>>> list1
[['Para', 'op', 'qa', 'reason'], ['Power', 'pass', 'ok', 'NA']]
If you are using the csv module to write out your CSV file, use the csvwriter.writerows() method to write each row into a separate line:
>>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> writer.writerows(list1)
Para,op,qa,reason
Power,pass,ok,NA
Your desired result, list1, should be a list of two elements, that each one of them is a list by itself.
list1 = ['Para','op','qa', 'reason']
# wrapping list1 with [] crates a new list which its first element is the original list1.
# In your case, this action gives a list of lines with only one single line
# Only after that I can add a new list of lines that contains another single line
list1 = [list1] + [['Power','pass','ok', 'NA']]
print (list1)
I am extracting data from the Google Adwords Reporting API via Python. I can successfully pull the data and then hold it in a variable data.
data = get_report_data_from_google()
type(data)
str
Here is a sample:
data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
I need to process this data more, and ultimately output a processed flat file (Google Adwords API can return a CSV, but I need to pre-process the data before loading it into a database.).
If I try to turn data into a csv object, and try to print each line, I get one character per line like:
c = csv.reader(data, delimiter=',')
for i in c:
print(i)
['I']
['D']
['', '']
['L']
['a']
['b']
['e']
['l']
['s']
['', '']
['D']
['a']
['t']
['e']
So, my idea was to process each column of each line into a list, then add that to a csv object. Trying that:
for line in data.splitlines():
print(line)
3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016
What I actually find is that inside of the str, there is a list: "[""SKWS"",""Exact""]"
This value is a "label" documentation
This list is formatted a bit weird - it has numerous parentheses in the value, so trying to use a quote char, like ", will return something like this: [ SKWS Exact ]. If I could get to [""SKWS"",""Exact""], that would be acceptable.
Is there a good way to extract a list object within a str? Is there a better way to process and output this data to a csv?
You need to split the string first. csv.reader expects something that provides a single line on each iteration, like a standard file object does. If you have a string with newlines in it, split it on the newline character with splitlines():
>>> import csv
>>> data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
>>> c = csv.reader(data.splitlines(), delimiter=',')
>>> for line in c:
... print(line)
...
['ID', 'Labels', 'Date', 'Year']
['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']
This has to do with how csv.reader works.
According to the documentation:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called
The issue here is that if you pass a string, it supports the iterator protocol, and returns a single character for each call to next. The csv reader will then consider each character as a line.
You need to provide a list of line, one for each line of your csv. For example:
c = csv.reader(data.split(), delimiter=',')
for i in c:
print i
# ['ID', 'Labels', 'Date', 'Year']
# ['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
# ['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
# ['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']
Now, your list looks like a JSON list. You can use the json module to read it.
I have several very large not quite csv log files.
Given the following conditions:
value fields have unescaped newlines and commas, almost anything can be in the value field including '='
each valid line has an unknown number of valid value fields
valid value looks like key=value such that a valid line looks like key1=value1, key2=value2, key3=value3 etc.
the start of each valid line should begin with eventId=<some number>,
What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?
I have tried
file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')
This correctly parses the first entry but all other entries starts with =# instead of eventId=#. Is there a way to keep the deliminator and split on the valid newline?
Also, speed is very important.
Example Data:
eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,
Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.
This problem is pretty insane, but here's a solution that seems to work. Always use an existing library to output formatted data, kids.
import re;
in_string = """eventId=123, goodkey=goodvalue, key2=somestuff:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue, gotit=see,
the problem===s,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, validkey=validvalue,"""
line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string))
lines = []
for i in range(len(line_matches)):
match_start = line_matches[i].start()
next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1
line = in_string[match_start:next_match_start].lstrip(',\n')
lines.append(line)
lineDicts = []
for line in lines:
d = {}
pad_line = ', '+line
matches = list(re.finditer(r', [\w\d]+=', pad_line))
for i in range(len(matches)):
match = matches[i]
key = match.group().lstrip(', ').rstrip('=')
next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line)
value = pad_line[match.end():next_match_start]
d[key] = value
lineDicts.append(d)
print lineDicts
Outputs [{'eventId': '123', 'key2': 'somestuff:\nthis, will, be, a problem,\nmaybe?=,\nanotherkey=anothervalue', 'goodkey': 'goodvalue', 'gotit': 'see,\nthe problem===s'}, {'eventId': '1234', 'key2': 'value2', 'key1': 'value1', 'key3': 'value3'}, {'eventId': '12345', 'key1': '\nmsg= {this is not a valid key value pair}', 'validkey': 'validvalue'}]
If the start of each valid line should begin with eventId= is correct, you can groupby those lines and find valid pairs with a regex:
from itertools import groupby
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = dict(l.split("=") for k, v in grps if k
for l in r.findall(next(v))[1:])
print(d)
{'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}
If you want to keep the eventIds:
import re
with open("test.txt") as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
d = list(r.findall(next(v)) for k, v in grps if k)
print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]
Not clear from your description exactly what the output should be, if you want all the valids key=value pairs and if the start of each valid line should begin with eventId= is not accurate:
from itertools import groupby,chain
import re
def parse(fle):
with open(fle) as f:
r = re.compile("\w+=\w+")
grps = groupby(f, key=lambda x: x.startswith("eventId="))
for k, v in grps:
if k:
sub = "".join((list(v)) + list(next(grps)[1]))
yield from r.findall(sub)
print(list(parse("test.txt")))
Output:
['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3',
'eventId=12345', 'key=value', 'key21=value']
If your values are can really contain anything, there's no unambiguous way of parsing. Any key=value pair could be part of the preceding value. Even a eventID=# pair on a new line could be part of a value from the previous line.
Now, perhaps you can do a "good enough" parse on the data despite the ambiguity, if you assume that values will never contain valid looking key= substrings. If you know the possible keys (or at least, what constraints they have, like being alphanumeric), it will be a lot easier to guess at what is a new key and what is just part of the previous value.
Anyway, if we assume that all alphanumeric strings followed by equals signs are indeed keys, we can do a parse with regular expressions. Unfortunately, there's no easy way to do this line by line, nor is there a good way to capture all the key-value pairs in a single scan. However, it's not too hard to scan once to get the log lines (which may have embedded newlines) and then separately get the key=value, pairs for each one.
with open("my_log_file") as infile:
text = infile.read()
line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]
I'm assuming that the file is small enough to fit into memory as a string. It would be quite a bit more obnoxious to solve the problem if the file can't all be handled at once.
If we run this regex matching on your example text, we get:
[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
[('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
[('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]
maybe? is not considered a key because of the question mark. msg and the final value are not considered keys because there were no commas separating them from a previous value.
Oh! This is an interesting problem, you'll want to process each line and part of line separately without iterating though the file more than once.
data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
line_list = line.split(',')
if len(line_list)>=1:
if 'eventId' in line_list[0]:
for item in line_list:
pair = item.split('=')
data_dict.update({pair[0]:pair[1]})
That should do it. Enjoy!
If there are spaces in the 'pseudo csv' please change the last line to:
data_dict.update({pair[0].split():pair[1].split()})
In order to remove spaces from the strings for your key and value.
p.s. If this answers your question, please click the check mark on the left to record this as an accepted answer. Thanks!
p.p.s. A set of lines from your actual data would be very helpful in writing something to avoid error cases.