Parsing colon delimited data

Parsing colon delimited data - python

I have the following text chunk:
string = """
apples: 20
oranges: 30
ripe: yes
farmers:
elmer fudd
lives in tv
farmer ted
lives close
farmer bill
lives far
selling: yes
veggies:
carrots
potatoes
"""
I am trying to find a good regex that will allow me to parse out the key values. I can grab the single line key values with something like:
'(.+?):\s(.+?)\n'
However, the problem comes when I hit farmers, or veggies.
Using the re flags, I need to do something like:
re.findall( '(.+?):\s(.+?)\n', string, re.S),
However, I am having a heck of a time grabbing all of the values associated with farmers.
There is a newline after each value, and a tab, or series of tabs before the values when they are multiline.
and goal is to have something like:
{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }
etc.
Thank you in advance for your help.

You might look at PyYAML, this text is very close to, if not actually valid YAML.

Here's a really dumb parser that takes into account your (apparent) indentation rules:
def parse(s):
d = {}
lastkey = None
for fullline in s:
line = fullline.strip()
if not line:
pass
elif ':' not in line:
indent = len(fullline) - len(fullline.lstrip())
if lastindent is None:
lastindent = indent
if lastindent == indent:
lastval.append(line)
else:
if lastkey:
d[lastkey] = lastval
lastkey = None
if line.endswith(':'):
lastkey, lastval, lastindent = key, [], None
else:
key, _, value = line.partition(':')
d[key] = value.strip()
if lastkey:
d[lastkey] = lastval
lastkey = None
return d
import pprint
pprint(parse(string.splitlines()))
The output is:
{'apples': '20',
'oranges': '30',
'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
'selling': ['carrots', 'potatoes']}
I think this is already complicated enough that it would look cleaner as an explicit state machine, but I wanted to write this in terms that any novice could understand.

Here's a totally silly way to do it:
import collections
string = """
apples: 20
oranges: 30
ripe: yes
farmers:
elmer fudd
lives in tv
farmer ted
lives close
farmer bill
lives far
selling: yes
veggies:
carrots
potatoes
"""
def funky_parse(inval):
lines = inval.split("\n")
items = collections.defaultdict(list)
at_val = False
key = ''
val = ''
last_indent = 0
for j, line in enumerate(lines):
indent = len(line) - len(line.lstrip())
if j != 0 and at_val and indent > last_indent > 4:
continue
if j != 0 and ":" in line:
if val:
items[key].append(val.strip())
at_val = False
key = ''
line = line.lstrip()
for i, c in enumerate(line, 1):
if at_val:
val += c
else:
key += c
if c == ':':
at_val = True
if i == len(line) and at_val and val:
items[key].append(val.strip())
val = ''
last_indent = indent
return items
print dict(funky_parse(string))
OUTPUT
{'farmers:': ['elmer fudd', 'farmer ted', 'farmer bill'], 'apples:': ['20'], 'veggies:': ['carrots', 'potatoes'], 'ripe:': ['yes'], 'oranges:': ['30'], 'selling:': ['yes']}

Related

Taking values from one file and after some calculation and a bit changes need to print into another file in python

below is c.txt
CO11 CSE C1 8
CO12 ETC C1 8
CO13 Electrical C2 12
CO14 Mech E 5
my program needs to print a course summary on screen and save that summary into a file
named cr.txt. Given the above c.txt, your program output should look like
below. The content of course_report.txt should also be the same, except the last line. Course
names in the second column use * to indicate a compulsory course and – to indicate an elective
course. The fourth column is the number of students enrolled in that course. The fifth column is the average score of the course.
CID Name Points. Enrollment. Average.
----------------------------------
CO11 * CSE 8 2 81
CO12 * ETC 8 10 71
CO13 * Electrical 12 8 61
CO14 - Mech 5 4 51
----------------------------------
poor-performing subject is CO14 with an average 51.
cr.txt generated!
below is what I've tried:
def read(self):
ctype = []
fi = open("c.txt", "r")
l = fi.readline()
while l != "":
fields = l.strip().split(" ")
self.c.append(fields)
l = fi.readline().strip()
f.close()
# print(f"{'CID'}{'Name':>20}{'Points.':>16}{'Enrollment.':>18}{'Average.':>10}")
# print("-" * 67, end="")
print()
for i in range(0, len(self.c)):
for j in range(len(self.c[i])):
obj = self.c[i][j]
print(obj.ljust(18), end="")
print()
print("-" * 67, end="")
print()

you can try use 'file.read' or 'file.readlines' after use 'open' function, if you choose 'file.readlines' you'll have to use 'for row in file.readlines()' look my example with 'file.read':
headers = ['CID', 'Name', 'Points.', 'Enrollment.', 'Average.']
compulsory_course = ['CO11', 'CO12', 'CO13']
elective_course = ['CO14']
count = 0
with open('c.txt', 'r') as file_c:
file_c.seek(0, 0)
file_string = file_c.read().replace('\n', ' ')
fields = file_string.split(' ')
with open('cr.txt', 'w') as file_cr:
for field in headers:
file_cr.write(f'{field} ')
file_cr.write('\n')
for v in fields:
if count == 4:
file_cr.write('\n')
count = 0
count += 1
if v in compulsory_course:
file_cr.write(f'{v} * ')
continue
elif v in elective_course:
file_cr.write(f'{v} - ')
continue
elif count == 3:
file_cr.write(f' ')
continue
file_cr.write(f'{v} ')

Python retrieving data from a block of lines containing specific characters and appending relevant data into separate lines

I am trying to create a program which selects specific information from a bulk paste, extract the relevant information and then proceed to paste said information into lines.
Here is some example data;
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA
I would like to have the output where the relevant data for the Track1 is grouped together in a single line, with semicolon joining identical information and " - " seperating between others.
LyrcistA - ComposerA - ArrangerA; ArrangerB
LyrcistA; LyrcistC - ComposerA - ArrangerA
I have not gotten very far despite my best efforts
while True:
YodobashiData = input("")
SplitData = YodobashiData.splitlines();
returns the following
['1. Track1 03:01']
['VOC:PersonA ']
['LYR：LyrcistA']
['COM：ComposerA']
['ARR：ArrangerA']
['ARR：ArrangerB']
[]
['2. Track2 04:18']
['VOC:PersonB']
['VOC:PersonC']
['LYR：LyrcistA']
['LYR：LyrcistC']
['COM：ComposerA']
['ARR：ArrangerA']
Whilst I have all the data now in separate lists, I have no idea how to identify and extract the information from the list I need from the ones I do not.
Also, it seems I need to have the while loop or else it will only return the first list and nothing else.

Here's a script that doesn't use regular expressions.
It assumes that header lines, and only the header lines, will always start with a digit, and that the overall structure of header line then credit lines is consistent. Empty lines are ignored.
Extraction and formatting of the track data are handled separately, so it's easier to change formats, or use the extracted data in other ways.
import collections
import unicodedata
data_from_question = """\
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA
"""
def prepare_data(data):
# The "colons" in the credits lines are actually
# "full width colons". Replace them (and other such characters)
# with their normal width equivalents.
# If full normalisation is undesirable then we could return
# data.replace('\N{FULLWIDTH COLON}', ':')
return unicodedata.normalize('NFKC', data)
def is_new_track(line):
return line[0].isdigit()
def parse_track_header(line):
id_, title, duration = line.split()
return {'id': id_.rstrip('.'), 'title': title, 'duration': duration}
def get_credit(line):
credit, _, name = line.partition(':')
return credit.strip(), name.strip()
def format_track_heading(track):
return 'id: {id} title: {title} length: {duration}'.format(**track)
def format_credits(track):
order = ['ARR', 'COM', 'LYR', 'VOC']
parts = ['; '.join(track[k]) for k in order]
return ' - '.join(parts)
def get_data():
# The data is expected to be a multiline string.
return data_from_question
def parse_data(data):
track = None
for line in filter(None, data.splitlines()):
if is_new_track(line):
if track:
yield track
track = collections.defaultdict(list)
header_data = parse_track_header(line)
track.update(header_data)
else:
role, name = get_credit(line)
track[role].append(name)
yield track
def report(tracks):
for track in tracks:
print(format_track_heading(track))
print(format_credits(track))
print()
def main():
data = get_data()
prepared_data = prepare_data(data)
tracks = parse_data(prepared_data)
report(tracks)
main()
Output:
id: 1 title: Track1 length: 03:01
ArrangerA; ArrangerB - ComposerA - LyrcistA - PersonA
id: 2 title: Track2 length: 04:18
ArrangerA - ComposerA - LyrcistA; LyrcistC - PersonB; PersonC

Here's another take on an answer to your question:
data = """
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA"""
import re
import collections
# Regular expression to pull apart the headline of each entry
headlinePattern = re.compile(r"(\d+)\.\s+(.*?)\s+(\d\d:\d\d)")
def main():
# break the data into lines
lines = data.strip().split("\n")
# while we have more lines...
while lines:
# The next line should be a title line
line = lines.pop(0)
m = headlinePattern.match(line)
if not m:
raise Exception("Unexpected data format")
id = m.group(1)
title = m.group(2)
length = m.group(3)
people = collections.defaultdict(list)
# Now read person lines until we hit a blank line or the end of the list
while lines:
line = lines.pop(0)
if not line:
break
# Break the line into label and name
label, name = re.split(r"\W+", line, 1)
# Add this entry to a map of lists, where the map's keys are the label and the
# map's values are all the people who had that label
people[label].append(name)
# Now we have everything for one entry in the data. Print everything we got.
print("id:", id, "title:", title, "length:", length)
print(" - ".join(["; ".join(person) for person in people.values()]))
# go on to the next entry...
main()
Result:
id: 1 title: Track1 length: 03:01
PersonA - LyrcistA - ComposerA - ArrangerA; ArrangerB
id: 2 title: Track2 length: 04:18
PersonB; PersonC - LyrcistA; LyrcistC - ComposerA - ArrangerA
You can just comment out the line that prints the headline info if you really just want the line with all of the people on it. Just replace the built in data with data = input("") if you want to read the data from a user prompt.

Assuming your data is in the format you specified in a file called tracks.txt, the following code should work:
import re
with open('tracks.txt') as fp:
tracklines = fp.read().splitlines()
def split_tracks(lines):
track = []
all_tracks = []
while True:
try:
if lines[0] != '':
track.append(lines.pop(0))
else:
all_tracks.append(track)
track = []
lines.pop(0)
except:
all_tracks.append(track)
return all_tracks
def gather_attrs(tracks):
track_attrs = []
for track in tracks:
attrs = {}
for line in track:
match = re.match('([A-Z]{3}):', line)
if match:
attr = line[:3]
val = line[4:].strip()
try:
attrs[attr].append(val)
except KeyError:
attrs[attr] = [val]
track_attrs.append(attrs)
return track_attrs
if __name__ == '__main__':
tracks = split_tracks(tracklines)
attrs = gather_attrs(tracks)
for track in attrs:
semicolons = map(lambda va: '; '.join(va), track.values())
hyphens = ' - '.join(semicolons)
print(hyphens)
The only thing you may have to change is the colon characters in your data - some of them are ASCII colons : and others are Unicode colons ：, which will break the regex.

import re
list_ = data_.split('\n') # here data_ is your data
regObj = re.compile(rf'[A-Za-z]+(:|{chr(65306)})[A-Za-z]+')
l = []
pre = ''
for i in list_:
if regObj.findall(i):
if i[:3] != 'VOC':
if pre == i[:3]:
l.append('; ')
else:
l.append(' - ')
l.append(i[4:].strip())
else:
l.append(' => ')
pre = i[:3]
track_list = list(map(lambda item: item.strip(' - '), filter(lambda item: item, ''.join(l).split(' => '))))
print(track_list)
OUTPUT : list of result you want
['LyrcistA - ComposerA - ArrangerA; ArrangerB', 'LyrcistA; LyrcistC - ComposerA - ArrangerA']

Python and PyQt string can't print

So, I'm using Python with PyQt and I have a very strange problem. A string that prints OK at one point doesn't print OK after a few lines of code! Here's my code:
name = str(self.lineEdit.text().toUtf8())
self.let_change = Search()
name_no_ind = self.let_change.indentation(name)
print(name_no_ind)
name_cap = self.let_change.capital(name)
name_low = self.let_change.lower(name)
print(name_no_ind, name_cap, name_low)
col = self.combobox.currentIndex()
row = 0
for i in range(0, self.tableWidget.rowCount()):
try:
find_no_ind = self.let_change.indentation(self.tableWidget.item(row, col).text())
find_cap = self.let_change.capital(self.tableWidget.item(row, col).text())
find_lower = self.let_change.lower(self.tableWidget.item(row, col).text())
if name_no_ind or name_cap or name_low in find_no_ind or find_cap or find_lower:
self.tableWidget.setItemSelected(self.tableWidget.item(row, col), True)
print("Item found in %d, %d" % (row,col))
row += 1
except AttributeError:
row += 1
And here's what I get:
Αντωνης
('\xce\x91\xce\xbd\xcf\x84\xcf\x89\xce\xbd\xce\xb7\xcf\x82', '\xce\x91\xce\x9d\xce\xa4\xcf\x8e\xce\x9d\xce\x97\xce\xa3', '\xce\xb1\xce\xbd\xcf\x84\xcf\x8e\xce\xbd\xce\xb7\xcf\x82')
Item found in 0, 0
Isn't that strange? It prints OK and then it doesn't. Does anybody know what can I do?
P.S.: Here are the functions:
# -*- coding: utf-8 -*-
class Search():
#A function that removes indentations:
def indentation(self, name):
a = name
b = ["ά", "Ά", "ή", "Ή", "ώ", "Ώ", "έ", "Έ", "ύ", "Ύ", "ί", "Ί", "ό", "Ό"]
c = ['α', 'Α', 'η', 'Η', 'ω', 'Ω', 'ε', 'Ε', 'υ', 'Υ', 'ι', 'Ι', 'ο', 'Ο']
for i in b:
a = a.replace(i, c[b.index(i)])
return a
# A function that makes letters capital:
def capital(self, name):
a = name
greek_small = ["α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ", "ν", "ξ", "ο", "π", "ρ", "σ", "τ", "υ", "φ", "χ", "ψ", "ω", "ς"]
greek_capital = ["Α", "Β", "Γ", "Δ", "Ε", "Ζ", "Η", "Θ", "Ι", "Κ", "Λ", "Μ", "Ν", "Ξ", "Ο", "Π", "Ρ", "Σ", "Τ", "Υ", "Φ", "Χ", "Ψ", "Ω", "Σ"]
for i in greek_small:
a = a.replace(i, greek_capital[greek_small.index(i)])
return a
#A function that makes letters lower:
def lower(self, name):
a = name
greek_small = ["α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ", "ν", "ξ", "ο", "π", "ρ", "σ", "τ", "υ", "φ", "χ", "ψ", "ω", "ς"]
greek_capital = ["Α", "Β", "Γ", "Δ", "Ε", "Ζ", "Η", "Θ", "Ι", "Κ", "Λ", "Μ", "Ν", "Ξ", "Ο", "Π", "Ρ", "Σ", "Τ", "Υ", "Φ", "Χ", "Ψ", "Ω", "Σ"]
for i in greek_capital:
a = a.replace(i, greek_small[greek_capital.index(i)])
return a
Basically, it capitalizes or lowers Greek characters...
SOLUTION!!!:
Steve solved the initial problem and based on what he said, I came up with this that solves everything:
name = str(self.lineEdit.text().toUtf8())
self.let_change = Search()
name_no_ind = self.let_change.indentation(name)
name_cap = self.let_change.capital(name)
name_low = self.let_change.lower(name)
name_list = [name, name_no_ind, name_cap, name_low]
col = self.combobox.currentIndex()
row = 0
for i in range(0, self.tableWidget.rowCount()):
try:
item_ = str(self.tableWidget.item(row, col).text().toUtf8())
find_no_ind = self.let_change.indentation(item_)
find_cap = self.let_change.capital(item_)
find_lower = self.let_change.lower(item_)
item_list = [find_no_ind, find_cap, find_lower]
for x in name_list:
for y in item_list:
if x in y:
self.tableWidget.setItemSelected(self.tableWidget.item(row, col), True)
row += 1
except AttributeError:
row += 1

I would say that one r both of self.let_change.capital(name) or self.let_change.lower(name) is overwriting it by using the name of the input parameter or possibly changing the encoding. Since you have not posted the code for them I can not be sure.
Sorry, they are not the problem. The problem is that you are printing them differently:
>>> print(capital(name))
ΑΝΤΩΝΗΣ
>>> print(capital(name), name)
('\xce\x91\xce\x9d\xce\xa4\xce\xa9\xce\x9d\xce\x97\xce\xa3', '\xce\x91\xce\xbd\xcf\x84\xcf\x89\xce\xbd\xce\xb7\xcf\x82')
>>> print(capital(name))
ΑΝΤΩΝΗΣ
>>> print(name, name)
('\xce\x91\xce\xbd\xcf\x84\xcf\x89\xce\xbd\xce\xb7\xcf\x82', '\xce\x91\xce\xbd\xcf\x84\xcf\x89\xce\xbd\xce\xb7\xcf\x82')
>>> print(name,)
('\xce\x91\xce\xbd\xcf\x84\xcf\x89\xce\xbd\xce\xb7\xcf\x82',)
>>> print(name)
Αντωνης
>>> print("%s = %s" % (name, capital(name)))
Αντωνης = ΑΝΤΩΝΗΣ
>>>
So you either need separate print statements or the use of a format string.

Index similar entries in Python

I have a column of data (easily imported from Google Docs thanks to gspread) that I'd like to intelligently align. I ingest entries into a dictionary. Input can include email, twitter handle or a blog URL. For example:
mike.j#gmail.com
#mikej45
j.mike#world.eu
_http://tumblr.com/mikej45
Right now, the "dumb" version is:
def NomineeCount(spreadsheet):
worksheet = spreadsheet.sheet1
nominees = worksheet.col_values(6) # F = 6
unique_nominees = {}
for c in nominees:
pattern = re.compile(r'\s+')
c = re.sub(pattern, '', c)
if unique_nominees.has_key(c) == True: # If we already have the name
unique_nominees[c] += 1
else:
unique_nominees[c] = 1
# Print out the alphabetical list of nominees with leading vote count
for w in sorted(unique_nominees.keys()):
print string.rjust(str(unique_nominees[w]), 2)+ " " + w
return nominees
What's an efficient(-ish) way to add in some smarts during the if process?

You can try with defaultdict:
from collections import defaultdict
unique_nominees = defaultdict(lambda: 0)
unique_nominees[c] += 1

Learning Python: Store values in dict from stdout

How can I do the following in Python:
I have a command output that outputs this:
Datexxxx
Clientxxx
Timexxx
Datexxxx
Client2xxx
Timexxx
Datexxxx
Client3xxx
Timexxx
And I want to work this in a dict like:
Client:(date,time), Client2:(date,time) ...

After reading the data into a string subject, you could do this:
import re
d = {}
for match in re.finditer(
"""(?mx)
^Date(.*)\r?\n
Client\d*(.*)\r?\n
Time(.*)""",
subject):
d[match.group(2)] = (match.group(1), match.group(2))

How about something like:
rows = {}
thisrow = []
for line in output.split('\n'):
if line[:4].lower() == 'date':
thisrow.append(line)
elif line[:6].lower() == 'client':
thisrow.append(line)
elif line[:4].lower() == 'time':
thisrow.append(line)
elif line.strip() == '':
rows[thisrow[1]] = (thisrow[0], thisrow[2])
thisrow = []
print rows
Assumes a trailing newline, no spaces before lines, etc.

What about using a dict with tuples?
Create a dictionary and add the entries:
dict = {}
dict['Client'] = ('date1','time1')
dict['Client2'] = ('date2','time2')
Accessing the entires:
dict['Client']
>>> ('date1','time1')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing colon delimited data - python

You might look at PyYAML, this text is very close to, if not actually valid YAML.

Related

Taking values from one file and after some calculation and a bit changes need to print into another file in python

Python retrieving data from a block of lines containing specific characters and appending relevant data into separate lines

Python and PyQt string can't print

Index similar entries in Python

Learning Python: Store values in dict from stdout

Categories

Resources