how to handle date & time when splitting string on ":" - python

I am processing a text file, reading line by line splitting it, and inserting it into a database.
each line goes like
3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018 12:00:00 AM::11/19
The problem is that it also splits the date-time and as a result it populates the wrong information in the database like in the image below.
my code goes like:
with open(filename, encoding="utf-8") as f:
counter = 0
for line in f:
data = line.split(':')
id = str(counter)
Phonenumber = data[0].strip()
profileID = data[1].strip()
firstname = data[2].strip()
secondname = data[3].strip()
gender = data[4].strip()
LocationWhereLive = data[5].strip()
LocationWhereFrom = data[6].strip()
RelationshipStatus = data[7].strip()
whereWork = data[8].strip()
AccountCreationDate = data [9].strip()
Email = data[10].strip()
Birthdate = data [11].strip()
mycursor = mydb.cursor()
sql = mycursor.execute("insert into dataleads values ('"+id+"','"+Phonenumber+"','"+profileID+"','"+firstname+"','"+secondname+"','"+gender+"','"+LocationWhereLive+"','"+LocationWhereFrom+"','"+RelationshipStatus+"','"+whereWork+"','"+AccountCreationDate+"','"+Email+"','"+Birthdate+"')")
mycursor.execute(sql)
mydb.commit()
counter += 1

Alternative to splitting by spaces, you can also leverage the maxsplit argument in the split and rsplit methods:
def make_list(s):
before = s.split(":", maxsplit= 9) # splits up to the date
after = before[-1].rsplit(":", maxsplit= 2) # splits the last part up to the date (from the right)
return [*before[:-1], *after] # creates a list with both parts
s = "3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018 12:00:00 AM::11/19"
make_list(s)
Out:
['3530000000000',
'100000431506294',
'Jean',
'Camargo',
'male',
'',
'',
'',
'Kefron',
'6/4/2018 12:00:00 AM',
'',
'11/19']

As mentioned in the comments, you can split with the whitespace:
s = "3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018 12:00:00 AM::11/19"
split_s = s.split() # default split is any whitespace character
print(split_s[0]) # will print "3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018"
print(split_s[1]) # will print "12:00:00"
print(split_s[2]) # will print "AM::11/19"

To deal with the original file, you can split this in a loop with knowledge of the count of fields, rather than trying to use how many separator characters there are
collection = []
_line = line # keep a backup of the line to compare and count blocks
for field_index in range(12):
if field_index < 8: # get the first 8 fields (or some set)
prefix, _line = _line.split(":", 1) # only split once!
collection.append(prefix)
continue
if field_index == 9: # match date field _line from regex
if _line.startswith("::"): # test if field was omitted
_line = _line[1:] # truncate the first character
continue
r"^\d+/..." # TODO regex for field
continue
...
This can be tuned or adapted to handle any field which can be
absent
also contain the separators in it (thanks)
However, if you can instead take a moment to educate the author of this file that it's problematic and why and nicely.. they may rewrite the file to be better for you or provide you with its input files you are further munging
Specifically, the tool could either
use a separator unavailable in the resulting data (such as | or ##SEPARATOR##)
escape the fields or swap their separators to another character before writing (.replace(":", "-"))

An alternative solution is to match the field in the line first and transform it, allowing you to deal with the field on its own (perhaps transforming it back via a regex or .replace())
line = re.sub(r"(\d\d?):(\d\d):(\d\d) (AM|PM)", r"\1-\2-\3-\4", line)
# now split out line on :
>>> line = "3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018 12:00:00 AM::11/19"
>>> re.sub(r"(\d\d?):(\d\d):(\d\d) (AM|PM)", r"\1-\2-\3-\4", line).split(":")
['3530000000000', '100000431506294', 'Jean', 'Camargo', 'male', '', '', '', 'Kefron', '6/4/2018 12-00-00-AM', '', '11/19']

The structure is the same you only join the split data back together again
counter = 0
line = "3530000000000:100000431506294:Jean:Camargo:male::::Kefron:6/4/2018 12:00:00 AM::11/19"
data = line.split(':')
id = str(counter)
Phonenumber = data[0].strip()
profileID = data[1].strip()
firstname = data[2].strip()
secondname = data[3].strip()
gender = data[4].strip()
LocationWhereLive = data[5].strip()
LocationWhereFrom = data[6].strip()
RelationshipStatus = data[7].strip()
whereWork = data[8].strip()
AccountCreationDate = data [9].strip() + ':' + data[10].strip() +":" + data[11].strip()
Email = data[12].strip()
Birthdate = data [13].strip()

Related

Python retrieving data from a block of lines containing specific characters and appending relevant data into separate lines

I am trying to create a program which selects specific information from a bulk paste, extract the relevant information and then proceed to paste said information into lines.
Here is some example data;
1. Track1 03:01
VOC:PersonA
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA
I would like to have the output where the relevant data for the Track1 is grouped together in a single line, with semicolon joining identical information and " - " seperating between others.
LyrcistA - ComposerA - ArrangerA; ArrangerB
LyrcistA; LyrcistC - ComposerA - ArrangerA
I have not gotten very far despite my best efforts
while True:
YodobashiData = input("")
SplitData = YodobashiData.splitlines();
returns the following
['1. Track1 03:01']
['VOC:PersonA ']
['LYR:LyrcistA']
['COM:ComposerA']
['ARR:ArrangerA']
['ARR:ArrangerB']
[]
['2. Track2 04:18']
['VOC:PersonB']
['VOC:PersonC']
['LYR:LyrcistA']
['LYR:LyrcistC']
['COM:ComposerA']
['ARR:ArrangerA']
Whilst I have all the data now in separate lists, I have no idea how to identify and extract the information from the list I need from the ones I do not.
Also, it seems I need to have the while loop or else it will only return the first list and nothing else.
Here's a script that doesn't use regular expressions.
It assumes that header lines, and only the header lines, will always start with a digit, and that the overall structure of header line then credit lines is consistent. Empty lines are ignored.
Extraction and formatting of the track data are handled separately, so it's easier to change formats, or use the extracted data in other ways.
import collections
import unicodedata
data_from_question = """\
1. Track1 03:01
VOC:PersonA
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA
"""
def prepare_data(data):
# The "colons" in the credits lines are actually
# "full width colons". Replace them (and other such characters)
# with their normal width equivalents.
# If full normalisation is undesirable then we could return
# data.replace('\N{FULLWIDTH COLON}', ':')
return unicodedata.normalize('NFKC', data)
def is_new_track(line):
return line[0].isdigit()
def parse_track_header(line):
id_, title, duration = line.split()
return {'id': id_.rstrip('.'), 'title': title, 'duration': duration}
def get_credit(line):
credit, _, name = line.partition(':')
return credit.strip(), name.strip()
def format_track_heading(track):
return 'id: {id} title: {title} length: {duration}'.format(**track)
def format_credits(track):
order = ['ARR', 'COM', 'LYR', 'VOC']
parts = ['; '.join(track[k]) for k in order]
return ' - '.join(parts)
def get_data():
# The data is expected to be a multiline string.
return data_from_question
def parse_data(data):
track = None
for line in filter(None, data.splitlines()):
if is_new_track(line):
if track:
yield track
track = collections.defaultdict(list)
header_data = parse_track_header(line)
track.update(header_data)
else:
role, name = get_credit(line)
track[role].append(name)
yield track
def report(tracks):
for track in tracks:
print(format_track_heading(track))
print(format_credits(track))
print()
def main():
data = get_data()
prepared_data = prepare_data(data)
tracks = parse_data(prepared_data)
report(tracks)
main()
Output:
id: 1 title: Track1 length: 03:01
ArrangerA; ArrangerB - ComposerA - LyrcistA - PersonA
id: 2 title: Track2 length: 04:18
ArrangerA - ComposerA - LyrcistA; LyrcistC - PersonB; PersonC
Here's another take on an answer to your question:
data = """
1. Track1 03:01
VOC:PersonA
LYR:LyrcistA
COM:ComposerA
ARR:ArrangerA
ARR:ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR:LyrcistA
LYR:LyrcistC
COM:ComposerA
ARR:ArrangerA"""
import re
import collections
# Regular expression to pull apart the headline of each entry
headlinePattern = re.compile(r"(\d+)\.\s+(.*?)\s+(\d\d:\d\d)")
def main():
# break the data into lines
lines = data.strip().split("\n")
# while we have more lines...
while lines:
# The next line should be a title line
line = lines.pop(0)
m = headlinePattern.match(line)
if not m:
raise Exception("Unexpected data format")
id = m.group(1)
title = m.group(2)
length = m.group(3)
people = collections.defaultdict(list)
# Now read person lines until we hit a blank line or the end of the list
while lines:
line = lines.pop(0)
if not line:
break
# Break the line into label and name
label, name = re.split(r"\W+", line, 1)
# Add this entry to a map of lists, where the map's keys are the label and the
# map's values are all the people who had that label
people[label].append(name)
# Now we have everything for one entry in the data. Print everything we got.
print("id:", id, "title:", title, "length:", length)
print(" - ".join(["; ".join(person) for person in people.values()]))
# go on to the next entry...
main()
Result:
id: 1 title: Track1 length: 03:01
PersonA - LyrcistA - ComposerA - ArrangerA; ArrangerB
id: 2 title: Track2 length: 04:18
PersonB; PersonC - LyrcistA; LyrcistC - ComposerA - ArrangerA
You can just comment out the line that prints the headline info if you really just want the line with all of the people on it. Just replace the built in data with data = input("") if you want to read the data from a user prompt.
Assuming your data is in the format you specified in a file called tracks.txt, the following code should work:
import re
with open('tracks.txt') as fp:
tracklines = fp.read().splitlines()
def split_tracks(lines):
track = []
all_tracks = []
while True:
try:
if lines[0] != '':
track.append(lines.pop(0))
else:
all_tracks.append(track)
track = []
lines.pop(0)
except:
all_tracks.append(track)
return all_tracks
def gather_attrs(tracks):
track_attrs = []
for track in tracks:
attrs = {}
for line in track:
match = re.match('([A-Z]{3}):', line)
if match:
attr = line[:3]
val = line[4:].strip()
try:
attrs[attr].append(val)
except KeyError:
attrs[attr] = [val]
track_attrs.append(attrs)
return track_attrs
if __name__ == '__main__':
tracks = split_tracks(tracklines)
attrs = gather_attrs(tracks)
for track in attrs:
semicolons = map(lambda va: '; '.join(va), track.values())
hyphens = ' - '.join(semicolons)
print(hyphens)
The only thing you may have to change is the colon characters in your data - some of them are ASCII colons : and others are Unicode colons :, which will break the regex.
import re
list_ = data_.split('\n') # here data_ is your data
regObj = re.compile(rf'[A-Za-z]+(:|{chr(65306)})[A-Za-z]+')
l = []
pre = ''
for i in list_:
if regObj.findall(i):
if i[:3] != 'VOC':
if pre == i[:3]:
l.append('; ')
else:
l.append(' - ')
l.append(i[4:].strip())
else:
l.append(' => ')
pre = i[:3]
track_list = list(map(lambda item: item.strip(' - '), filter(lambda item: item, ''.join(l).split(' => '))))
print(track_list)
OUTPUT : list of result you want
['LyrcistA - ComposerA - ArrangerA; ArrangerB', 'LyrcistA; LyrcistC - ComposerA - ArrangerA']

How to read a csv file and sum values based on user input?

Read a CSV file
User have to enter the Mobile number
Program should show the Data usage (i.e. Arithmetic Operation Adding Uplink & downlink) to get the result (Total Data Used)
Here is Example of CSV file
Time_stamp; Mobile_number; Download; Upload; Connection_start_time; Connection_end_time; location
1/2/2020 10:43:55;7777777;213455;2343;1/2/2020 10:43:55;1/2/2020 10:47:25;09443
1/3/2020 10:33:10;9999999;345656;3568;1/3/2020 10:33:10;1/3/2020 10:37:20;89442
1/4/2020 11:47:57;9123456;345789;7651;1/4/2020 11:11:10;1/4/2020 11:40:22;19441
1/5/2020 11:47:57;9123456;342467;4157;1/5/2020 11:44:10;1/5/2020 11:59:22;29856
1/6/2020 10:47:57;7777777;213455;2343;1/6/2020 10:43:55;1/6/2020 10:47:25;09443
With pandas
import pandas as pd
# read in data
df = pd.read_csv('test.csv', sep=';')
# if there are really spaces at the beginning of the column names, they should be removed
df.columns = [col.strip() for col in df.columns]
# sum Download & Upload for all occurrences of the given number
usage = df[['Download', 'Upload']][df.Mobile_number == 7777777].sum().sum()
print(usage)
>>> 431596
if you want Download and Upload separately
# only 1 sum()
usage = df[['Download', 'Upload']][df.Mobile_number == 7777777].sum()
print(usage)
Download 426910
Upload 4686
with user input
This assumes the Mobile_number column has be read into the dataframe as an int
input is a str so it must be converted to int to match the type in the dataframe
df.Mobile_number == 7777777 not df.Mobile_number == '7777777'
number = int(input('Please input a phone number (numbers only)'))
usage = df[['Download', 'Upload']][df.Mobile_number == number].sum().sum()
With no imported modules
# read file and create dict of phone numbers
phone_dict = dict()
with open('test.csv') as f:
for i, l in enumerate(f.readlines()):
l = l.strip().split(';')
if (i != 0):
mobile = l[1]
download = int(l[2])
upload = int(l[3])
if phone_dict.get(mobile) == None:
phone_dict[mobile] = {'download': [download], 'upload': [upload]}
else:
phone_dict[mobile]['download'].append(download)
phone_dict[mobile]['upload'].append(upload)
print(phone_dict)
{'+917777777777': {'download': [213455, 213455], 'upload': [2343, 2343]},
'+919999999999': {'download': [345656], 'upload': [3568]},
'+919123456654': {'download': [345789], 'upload': [7651]},
'+919123456543': {'download': [342467], 'upload': [4157]}}
# function to return usage
def return_usage(data: dict, number: str):
download_usage = sum(data[number]['download'])
upload_usage = sum(data[number]['upload'])
return download_usage + upload_usage
# get user input to return usage
number = input('Please input a phone number')
usage = return_usage(phone_dict, number)
print(usage)
>>> Please input a phone number (numbers only) +917777777777
>>> 431596
The csv is not too much readable, but you could take a look at his library https://pandas.pydata.org/
Once installed you could use:
# ask for the mobile number here
mobile_number = input('phone number? ')
df = pandas.read_csv('data.csv')
# here you will get the data for that user phone
user_data = df[df['Mobile_number'] == mobile_number].copy()
# not pretty sure in this step
user_data['download'].sum()

Extract Entire String Right of the Delimiter from a multi-line string after a match

From the multi-line string, I am attempting to extract the entire string right of = sign after a match. However, only portion of the string is extracted.
How can I rectify this problem? I am open to other implementations search/extraction operations as well.
import re
s = '''jaguar.vintage.aircards = 2
jaguar.vintage.hw.sdb.size = 512.1 GB
jaguar.vintage.hw.tm.firmware = SWI9X15C_05.05.16.02 r21040 carmd-fwbuild1 2014/03/17 23:49:48
jaguar.vintage.hw.tm.hardware = 1.0
jaguar.vintage.hw.tm.iccid = 8901260591783960689
jaguar.vintage.hw.tm.imei = 359225051166726
jaguar.vintage.hw.tm.imsi = 310260598396068
jaguar.vintage.hw.tm.model = MC7354
jaguar.vintage.hw.wifi1.mac = 00:30:1a:4e:06:7a
jaguar.vintage.hw.wifi2.mac = 00:30:1a:4e:06:79
jaguar.vintage.part = P34110-002
jaguar.vintage.product = P34101
jaguar.vintage.psoc = 0.1.16
jaguar.vintage.serial = 34110002T0021
jaguar.vintage.slavepsoc1 = 0.1.5
jaguar.vintage.sw.app.release = 4.0.0.41387-201902131138git367fbda8e
'''
# print(s)
# release = (s.split('jaguar.vintage.sw.app.release =')[1]).strip()
# print(release)
#part_number = jaguar.vintage.part = P34110-002
pnumsrch = r"jaguar.vintage.part =.*?(?=\w)(\w+)"
part_number = re.findall(pnumsrch, s)
print(part_number[0])
# release_number = jaguar.vintage.sw.app.release = 4.0.0.41387-201902131138git367fbda8e
relnumsrch = r"jaguar.vintage.sw.app.release =.*?(?=\w)(\w+)"
rel_number = re.findall(relnumsrch, s)
print(rel_number[0])
Actual:
P34110
4
Expected:
P34110-002
4.0.0.41387-201902131138git367fbda8e
Since . does not match a newline character by default, you can simply use .* to match the rest of the line:
pnumsrch = r"jaguar.vintage.part = (.*)"
and:
relnumsrch = r"jaguar.vintage.sw.app.release = (.*)"
Just catch everything that's not a new line Demo:
pat = re.compile(r'jaguar\.vintage\.part = ([^\n]+)')
pat2 = re.compile(r'jaguar\.vintage\.sw\.app\.release = ([^\n]+)')
>>> pat.findall(s)
['P34110-002']
>>> pat2.findall(s)
['4.0.0.41387-201902131138git367fbda8e']
You also should escape your periods in your pattern.
As mentioned by #WiktorStribiżew, just . is good enough for the [^\n] portion:
pat = re.compile(r'jaguar\.vintage\.part = (.+)')
pat2 = re.compile(r'jaguar\.vintage\.sw\.app\.release = (.+)')

Learning Python: Store values in dict from stdout

How can I do the following in Python:
I have a command output that outputs this:
Datexxxx
Clientxxx
Timexxx
Datexxxx
Client2xxx
Timexxx
Datexxxx
Client3xxx
Timexxx
And I want to work this in a dict like:
Client:(date,time), Client2:(date,time) ...
After reading the data into a string subject, you could do this:
import re
d = {}
for match in re.finditer(
"""(?mx)
^Date(.*)\r?\n
Client\d*(.*)\r?\n
Time(.*)""",
subject):
d[match.group(2)] = (match.group(1), match.group(2))
How about something like:
rows = {}
thisrow = []
for line in output.split('\n'):
if line[:4].lower() == 'date':
thisrow.append(line)
elif line[:6].lower() == 'client':
thisrow.append(line)
elif line[:4].lower() == 'time':
thisrow.append(line)
elif line.strip() == '':
rows[thisrow[1]] = (thisrow[0], thisrow[2])
thisrow = []
print rows
Assumes a trailing newline, no spaces before lines, etc.
What about using a dict with tuples?
Create a dictionary and add the entries:
dict = {}
dict['Client'] = ('date1','time1')
dict['Client2'] = ('date2','time2')
Accessing the entires:
dict['Client']
>>> ('date1','time1')

parsing a text file into lists with python

So I have a generated text file that I'd like to parse into a couple lists of dates. I had it figured out when there was one date per 'group' but i realized i may have to deal with multiple date values per group.
My .txt file looks like this:
DateGroup1
20191129
20191127
20191126
DateGroup2
20191129
20191127
20191126
DateGroup3
2019-12-02
DateGroup4
2019-11-27
DateGroup5
2019-11-27
And ideally i would be able to parse this out into 5 lists that include the dates for each group. I am so stumped
Just loop over each line, check for your key that will group data, remove newlines and store each new date.
DATE_GROUP_SEPARATOR = 'DateGroup'
sorted_data = {}
with open('test.txt') as file:
last_group = None
for line in file.readlines():
line = line.replace('\n', '')
if DATE_GROUP_SEPARATOR in line:
sorted_data[line] = []
last_group = line
else:
sorted_data[last_group].append(line)
for date_group, dates in sorted_data.items():
print(f"{date_group}: {dates}")
Here is an example that you could build off of, every time it reads a string rather than a number it then makes a new list and puts all the dates under the group in it
import os
#read file
lineList = 0
with open("test.txt") as f:
lineList = f.readlines()
#make new list to hold variables
lists = []
#loop through and check for numbers and strings
y=-1
for x in range(len(lineList)):
#check if it is a number or a string
if(lineList[x][0] is not None and not lineList[x][0].isdigit()):
#if it is a string make a new list and push back the name
lists.append([lineList[x]])
y+=1
else:
#if it is the number append it to the current list
lists[y].append(lineList[x])
#print the lists
for x in lists:
print(x)
Start by reading in your whole text file. Then you can count the amount of occurrences of "DateGroup", which seems to be the constant part in your date group separation. You can then parse your file by going through all the data that is in between any two "DateGroup" identifiers or between one "DateGroup" identifier and the end of the file. Try to understand the following piece of code and build your application on top of that:
file = open("dates.txt")
text = file.read()
file.close()
amountGroups = text.count("DateGroup")
list = []
index = 0
i = 0
for i in range(amountGroups):
list.append([])
index = text.find("DateGroup", index)
index = text.find("\n", index) + 1
indexEnd = text.find("DateGroup", index)
if(indexEnd == -1):
indexEnd = len(text)
while(index < indexEnd):
indexNewline = text.find("\n", index)
list[i].append(text[index:indexNewline])
index = indexNewline + 1
print(list)
This first section just to show how to treat a string with data as if it came from a file. That helps if you don't want to generate the actual file of the OP but want to visibly import the data in the editor.
import sys
from io import StringIO # allows treating some lines in editor as if they were from a file)
dat=StringIO("""DateGroup1
20191129
20191127
20191126
DateGroup2
20191129
20191127
20191126
DateGroup3
2019-12-02
DateGroup4
2019-11-27
DateGroup5
2019-11-27""")
lines=[ l.strip() for l in dat.readlines()]
print(lines)
output:
['DateGroup1', '20191129', '20191127', '20191126', 'DateGroup2', '20191129', '20191127', '20191126', 'DateGroup3', '2019-12-02', 'DateGroup4', '2019-11-27', 'DateGroup5', '2019-11-27']
Now one possible way to generate your desired list of lists, while ensuring that both possible date formats are covered:
from datetime import datetime
b=[]
for i,line in enumerate(lines):
try: # try first dateformat
do = datetime.strptime(line, '%Y%m%d')
a.append(datetime.strftime(do,'%Y-%m-%d'))
except:
try: # try second dateformat
do=datetime.strptime(line,'%Y-%m-%d')
a.append(datetime.strftime(do,'%Y-%m-%d'))
except: # if neither date, append old list to list of lists & make a new list
if a!=None:
b.append(a)
a=[]
if i==len(lines)-1:
b.append(a)
b
output:
[['2019-11-27'],
['2019-11-29', '2019-11-27', '2019-11-26'],
['2019-11-29', '2019-11-27', '2019-11-26'],
['2019-12-02'],
['2019-11-27'],
['2019-11-27']]
TTP can help to parse this text as well, here is sample template with code how to run it:
from ttp import ttp
data_to_parse = """
DateGroup1
20191129
20191127
20191126
DateGroup2
20191129
20191127
20191126
DateGroup3
2019-12-02
DateGroup4
2019-11-27
DateGroup5
2019-11-27
"""
ttp_template = """
<group name="date_groups.date_group{{ id }}">
DateGroup{{ id }}
{{ dates | to_list | joinmatches() }}
</group>
"""
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
print(parser.result(format="json")[0])
above code would produce this output:
[
{
"date_groups": {
"date_group1": {
"dates": [
"20191129",
"20191127",
"20191126"
]
},
"date_group2": {
"dates": [
"20191129",
"20191127",
"20191126"
]
},
"date_group3": {
"dates": [
"2019-12-02"
]
},
"date_group4": {
"dates": [
"2019-11-27"
]
},
"date_group5": {
"dates": [
"2019-11-27"
]
}
}
}
]
This is my attempt to parse that text data. I deliberately chose parsec.py, a haskell parsec-like parser combinators library, because it works more clearly then regular expressions, so it is easier to debug and test.
And second cause is much more flexibility of getting output data format.
import re
from parsec import *
spaces = regex(r'\s*', re.MULTILINE)
#generate
def getHeader():
s1 = yield string ("DateGroup")
s2 = ''.join( (yield many1(digit())))
return (s1 + s2)
#generate
def getDataLine():
s1 = yield digit()
s2 = ''.join((yield many1 (none_of ("\r\n"))))
yield spaces
return (s1 + s2)
#generate
def getChunk():
yield spaces
header = yield getHeader
yield spaces
dataList = yield many1 (getDataLine)
return (header,dataList)
#generate
def getData():
yield spaces
parsedData = yield many1(getChunk)
yield eof()
return parsedData
inputText = """DateGroup1
20191129
20191127
20191126
DateGroup2
20191129
20191127
20191126
DateGroup3
2019-12-02
DateGroup4
2019-11-27
DateGroup5
2019-11-27"""
result = getData.parse(inputText)
for p in result:
print(p)
Output:
('DateGroup1', ['20191129', '20191127', '20191126'])
('DateGroup2', ['20191129', '20191127', '20191126'])
('DateGroup3', ['2019-12-02'])
('DateGroup4', ['2019-11-27'])
('DateGroup5', ['2019-11-27'])

Categories