Manipulate time-range in a pandas Dataframe - python

Need to clean up a csv import, which gives me a range of times (in string form). Code is at bottom; I currently use regular expressions and replace() on the df to convert other chars. Just not sure how to:
select the current 24 hour format numbers and add :00
how to select the 12 hour format numbers and make them 24 hour.
Input (from csv import):
break_notes
0 15-18
1 18.30-19.00
2 4PM-5PM
3 3-4
4 4-4.10PM
5 15 - 17
6 11 - 13
So far I have got it to look like (remove spaces, AM/PM, replace dot with colon):
break_notes
0 15-18
1 18:30-19:00
2 4-5
3 3-4
4 4-4:10
5 15-17
6 11-13
However, I would like it to look like this ('HH:MM-HH:MM' format):
break_notes
0 15:00-18:00
1 18:30-19:00
2 16:00-17:00
3 15:00-16:00
4 16:00-16:10
5 15:00-17:00
6 11:00-13:00
My code is:
data = pd.read_csv('test.csv')
data.break_notes = data.break_notes.str.replace(r'([P].|[ ])', '').str.strip()
data.break_notes = data.break_notes.str.replace(r'([.])', ':').str.strip()

Here is the converter function that you need based on your requested input data. convert_entry takes complete value entry, splits it on a dash, and passes its result to convert_single, since both halfs of one entry can be converted individually. After each conversion, it merges them with a dash.
convert_single uses regex to search for important parts in the time string.
It starts with a some numbers \d+ (representing the hours), then optionally a dot or a colon and some more number [.:]?(\d+)? (representing the minutes). And after that optionally AM or PM (AM|PM)? (only PM is relevant in this case)
import re
def convert_single(s):
m = re.search(pattern="(\d+)[.:]?(\d+)?(AM|PM)?", string=s)
hours = m.group(1)
minutes = m.group(2) or "00"
if m.group(3) == "PM":
hours = str(int(hours) + 12)
return hours.zfill(2) + ":" + minutes.zfill(2)
def convert_entry(value):
start, end = value.split("-")
start = convert_single(start)
end = convert_single(end)
return "-".join((start, end))
values = ["15-18", "18.30-19.00", "4PM-5PM", "3-4", "4-4.10PM", "15 - 17", "11 - 13"]
for value in values:
cvalue = convert_entry(value)
print(cvalue)

Related

Put "0" in front of numeric quantities within a string that are missing a numeric figure following the context of a regex pattern

import re
input_text = '2000_-_9_-_01 8:1 am' #example 1
input_text = '(2000_-_1_-_01) 18:1 pm' #example 2
input_text = '(20000_-_12_-_1) (1:1 am)' #example 3
identificate_hours = r"(?:a\s*las|a\s*la|)\s*(\d{1,2}):(\d{1,2})\s*(?:(am)|(pm)|)"
date_format_00 = r"(\d*)_-_(\d{1,2})_-_(\d{1,2})"
identification_re_0 = r"(?:\(|)\s*" + date_format_00 + r"\s*(?:\)|)\s*(?:a\s*las|a\s*la|)\s*(?:\(|)\s*" + identificate_hours + r"\s*(?:\)|)"
input_text = re.sub(identification_re_0,
#lambda m: print(m[2]),
lambda m: (f"({m[1]}_-_{m[2]}_-_{m[3]}({m[4] or '00'}:{m[5] or '00'} {m[6] or m[7] or 'am'}))"),
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
Considering that there are 5 numerical values(year, month, day, hour, minutes) where possibly it should be corrected by adding a "0", and being 2 possibilities (add a zero "0" or not add a zero "0"), after using the combinatorics formula I can know that there would be a total of 32 possible combinations, which it's too much to come up with 32 different regex that do or don't add a "0" in front of every value that needs it. For this reason I feel that trying to repeat the regex, changing only the "(\d{1,2})" one by one, would not be a good solution for this problem.
I was trying to standardize date-time data that is entered by users in natural language so that it can then be processed.
So, once the dates were obtained in this format, I needed those numerical values of months, days, hours and/or minutes that have remained with a single digit are standardized to 2 digits, placing a "0" before them to compensate for possible missing digits.
So that in the output the input date-time are expressed in this way:
YYYY_-_MM_-_DD(hh:mm am or pm)
'(2000_-_09_-_01(08:01 am))' #for example 1
'(2000_-_01_-_01(18:01 pm))' #for example 2
'(20000_-_12_-_01(18:01 am))' #for example 3
I have used the re.sub() function because it contemplates the possibility that within the same input_text there is more than one occasion where a replacement of this type must be carried out. For example, in an input where '2000_-_9_-_01 8:1 am 2000_-_9_-_01 8:1 am', you should perform this procedure 2 times since there are 2 dates present (that is, there are 2 times where this pattern appears), and obtain this '(2000_-_09_-_01(08:01 am)) (2000_-_09_-_01(08:01 am))'
I'm not sure I fully understood you, but I would solve it with datetime instead of regex. But that doesn't support the year 20000, typo? or are you planning way ahead? :-D
from datetime import datetime
testDates = [
'2000_-_9_-_01 8:1 am', #example 1
'(2000_-_1_-_01) 18:1 pm', #example 2
'(2000_-_12_-_1) (1:1 am)', #example 3
]
for testDate in testDates:
testDateClean = testDate
for rm in ('(', ')'):
testDateClean = testDateClean.replace(rm, '')
date = datetime.strptime(testDateClean, '%Y_-_%m_-_%d %H:%M %p')
print(date.strftime('%Y_-_%m_-_%d(%H:%M %p)'))
A regex solution which can handle all provided example strings:
import re
INPUT_DATES = [
'(2000_-_09_-_01 (08:01 am)) (2001_-_10_-_01 (09:02 am))',
'(20000_-_1_-_01) 18:1 pm',
'2000_-_9_-_01 8:1 am',
'(2000_-_12_-_1) (1:1 am)',
]
REGEX_SPLIT = re.compile(r'\(([\dpam_\- :\(\)]{10,})\) \(([\dpam_\- :\(\)]{10,})\)')
REGEX_DATE = re.compile(r'(?P<year>\d{4,})_-_(?P<month>\d{1,2})_-_(?P<day>\d{1,2}) (?P<hour>\d{1,2}):(?P<minute>\d{1,2}) (?P<apm>[apm]{2})')
for testDates in INPUT_DATES:
testDates = REGEX_SPLIT.split(testDates)
for testDate in testDates:
if len(testDate) < 10:
continue
testDateClean = testDate
for rm in ('(', ')'):
testDateClean = testDateClean.replace(rm, '')
date = REGEX_DATE.match(testDateClean).groupdict()
print(f'parsed out: {date["year"]}_-_{date["month"]:>02}_-_{date["day"]:>02}({date["hour"]:>02}:{date["minute"]:>02} {date["apm"]}), from in: {testDate}')
output:
parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_09_-_01 (08:01 am)
parsed out: 2001_-_10_-_01(09:02 am), from in: 2001_-_10_-_01 (09:02 am)
parsed out: 20000_-_01_-_01(18:01 pm), from in: (20000_-_1_-_01) 18:1 pm
parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_9_-_01 8:1 am
parsed out: 2000_-_12_-_01(01:01 am), from in: (2000_-_12_-_1) (1:1 am)

Removing signs and repeating numbers

I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90
Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.
You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90

Python Output not Aligned Using Space and Tab

My code is here:
days = int(input("How many days did you work? : "))
totalSalary = 0
print("Day", "\tDaily Salary", "\tTotal Salary")
for day in range(days):
daily = 2**day
totalSalary += daily
print(day+1, "\t ", daily, "\t\t ", totalSalary)
When I enter 6 as input, here is the output:
Day Daily Salary Total Salary
1 1 1
2 2 3
3 4 7
4 8 15
5 16 31
6 32 63
Why last 2 lines are not aligned?
Edit: I forgot to say that I know there are better solutions like using format, but I just wanted to understand why there is problem with tabs and spaces.
Edit2: The visualization of tabstops in Jason Yang's answer satisfied me.
For statement
print(day+1, "\t ", daily, "\t\t ", totalSalary)
each '\t' will stop at 1, 9, 17, ..., at each 8th character
So it will look like this
1=------____=1=-........____=1
2=------____=2=-........____=3
3=------____=4=-........____=7
4=------____=8=-........____=15
5=------____=16=--------........=31
6=------____=32=--------........=63
12345678123456781234567812345678 <--- Tab stop before each 1
Here
= is the separator space between each two arguments of print
- is the space generated by not-last TAB
_ is the space specified by you in your print.
. is the sapce generated by last TAB.
From here you can find the differece why they stop at different position.
Try to add option sep='' in your print, or change the numbers of spaces you added.
print(day+1, "\t ", daily, "\t\t ", totalSalary, sep='')
then it will be fine.
How many days did you work? : 6
Day Daily Salary Total Salary
1 1 1
2 2 3
3 4 7
4 8 15
5 16 31
6 32 63
When the 2nd columns has double digit values, the rest of the tab and 3rd column gets shifted. You must use zero padding for the values, if you expect correctly aligned column values.
If your python version is 3+, and if you want 2 digit values, you can call print(f'{n:02}'), so as to print 01 if the value of n was having 1.
For 2.7+ version of python, you could use the format like so print('{:02d}'.format(n)).
tab is not just collection of white spaces
Any character before tab fills the space.
>>> print("A\tZ")
A Z
>>> print("AB\tZ")
AB Z
>>> print("ABC\tZ")
ABC Z
and if there are no spaces to fill tab, then it will be shifted
>>> print("ABCDEFGH\tZ")
ABCDEFGH Z
I suppose your question is due to a misunderstanding on what a tab character is and how it behaves:
A tab character should advance to the next tab stop. Historically tab stops were every 8th character, although smaller values are in common use today and most editors can be configured. Source:
How many spaces for tab character(\t)?
try this and see:
print('123456789')
print('1\t1')
print('12\t1')
print('123\t1')
I think that you added too much spaces in " print(day+1, "\t ", daily, "\t\t ", totalSalary)".
when you remove the tab spaces you will not get the "not aligned" problem.
Dynamic space declaration. Due to length of the digit
days = int(input("How many days did you work? : "))
totalSalary = 0
day_data = []
for day in range(days):
daily = 2**day
totalSalary += daily
day_data.append([day+1,daily,totalSalary])
num_space = len(str(day_data[-1][-1]))+2
f_space_len, s_space_len = 5+num_space, 9+num_space
print(f"Day{num_space*' '}Daily Salary{num_space*' '}Total Salary")
for i in day_data:
day, daily, totalSalary = map(str, i)
print(day, (f_space_len-len(day)+1)*' ', daily,(s_space_len-len(daily)+1)*' ', totalSalary)

Use regular expression to extract numbers before specific words

Goal
Extract number before word hours, hour, day, or days
How to use | to match the words?
s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days
return
['5.1', '100', '1']
Since 100 and 1 are not before the exact word hours, they should not show up. Expected
5.1
How to extract the first number from the matched result
s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)
return
['10.2 +/- 30']
Expect
10.2
Note that special characters +/-. is optional. When . appears such as 1.3, 1.3 will need to show up with the .. But when 1 +/- 0.5 happens, 1 will need to be extracted and none of the +/- should be extracted.
I know I could probably do a split and then take the first number
str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]
Gives
'10.2'
But some of the results only return one number so a split will cause an error. Should I do this with another step or could this be done in one step?
Please note that these strings s1, s2 are the values in a dataframe. Therefore, iteration using function like apply and lambda will be needed.
In fact, I would use re.findall here:
units = ["hours", "hour", "days", "day"] # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values) # prints [('5.1')]
If you want to also capture the units being used, then make the units alternation capturing, i.e. use:
regex = r'(' + '|'.join(units) + r')'
Then the output would be:
[('5.1', 'hours')]
Code
import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"]) # possible units
number = '\d+[.,]?\d*' # pattern for number
plus_minus = '\+\/\-' # plus minus
cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
pattern = re.compile(cases)
Tests
print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))
# Output: [5.1]
print(pattern.findall('2 Approximately 10.2 +/- 30hours'))
# Output: ['10.2']
print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))
# Output: ['114', '75']
print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))
# Output: ['102', '68']
print(pattern.findall("102 +/- 30 hrs"))
# Output: ['102']
print(pattern.findall("102-130 hrs"))
# Output: ['102']
print(pattern.findall("102hrs"))
# Output: ['102']
print(pattern.findall("102 hours"))
# Output: ['102']
Explanation
Above uses the convenience that raw strings (r'...') and string interpolation f'...' can be combined to:
fr'...'
per PEP 498
The cases strings:
fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
Parts are sequence:
fr'({number})' - capturing group '(\d+[.,]?\d*)' for integers or floats
r'(?:[\s\d-+/]*)' - non capturing group for allowable characters between number and units (i.e. space, +, -, digit, /)
fr'(?:{units})' - non-capturing group for units

Extract multiple line data between two symbols - Regex and Python3

I have a huge file from which I need data for specific entries. File structure is:
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
What I have to achieve is that I need to extract all the lines (complete record) for certain entries. For e.x. I need record for Entry1.1 than I can use name of entry '>Entry1.1' till next '>' as markers in REGEX to extract lines in between. But I do not know how to build such complex REGEX expressions. Once I have such expression the I will put it a FOR loop:
For entry in entrylist:
GET record from big_file
DO some processing
WRITE in result file
What could be the REGEX to perform such extraction of record for specific entries? Is there any more pythonic way to achieve this? I would appreciate your help on this.
AK
With regex
import re
ss = '''
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
'''
patbase = '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))'
while True:
x = raw_input('What entry do you want ? : ')
found = re.findall(patbase % x, ss, re.DOTALL)
if found:
print 'found ==',found
for each_entry in found:
print '\n%s\n' % each_entry
else:
print '\n ** There is no such an entry **\n'
Explanation of '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))' :
1)
%s receives the reference of entry: 1.1 , 2 , 2.1 etc
2)
The portion (?![^\n]+?\d) is to do a verification.
(?![^\n]+?\d) is a negative look-ahead assertion that says that what is after %s must not be [^\n]+?\d that is to say any characters [^\n]+? before a digit \d
I write [^\n] to mean "any character except a newline \n".
I am obliged to write this instead of simply .+? because I put the flag re.DOTALL and the pattern portion .+? would be acting until the end of the entry.
However, I only want to verify that after the entered reference (represented by %s in the pattern), there won't be supplementary digits before the end OF THE LINE, entered by error
All that is because if there is an Entry2.1 but no Entry2 , and the user enters only 2 because he wants Entry2 and no other, the regex would detect the presence of the Entry2.1 and would yield it, though the user would really like Entry2 in fact.
3)
At the end of '(>Entry *%s(?![^\n]+?\d).+?) , the part .+? will catch the complete block of the Entry, because the dot represents any character, comprised a newline \n
It's for this aim that I put the flag re.DOTALLin order to make the following pattern portion .+? capable to pass the newlines until the end of the entry.
4)
I want the matching to stop at the end of the Entry desired, not inside the next one, so that the group defined by the parenthesises in (>Entry *%s(?![^\n]+?\d).+?) will catch exactly what we want
Hence, I put at the end a positive look-ahaed assertion (?=>|(?:\s*\Z)) that says that the character before which the running ungreedy .+? must stop to match is either > (beginning of the next Entry) or the end of the string \Z.
As it is possible that the end of the last Entry wouldn't exactly be the end of the entire string, I put \s* that means "possible whitespaces before the very end".
So \s*\Z means "there can be whitespaces before to bump into the end of the string"
Whitespaces are a blank , \f, \n, \r, \t, \v
I'm no good with regexes, so I try to look for non-regex solutions whenever I can. In Python, the natural place to store iteration logic is in a generator, and so I'd use something like this (no-itertools-required version):
def group_by_marker(seq, marker):
group = []
# advance past negatives at start
for line in seq:
if marker(line):
group = [line]
break
for line in seq:
# found a new group start; yield what we've got
# and start over
if marker(line) and group:
yield group
group = []
group.append(line)
# might have extra bits left..
if group:
yield group
In your example case, we get:
>>> with open("entry0.dat") as fp:
... marker = lambda line: line.startswith(">Entry")
... for group in group_by_marker(fp, marker):
... print(repr(group[0]), len(group))
...
'>Entry1.1\n' 10
'>Entry2.1\n' 9
'>Entry3.2\n' 4
One advantage to this approach is that we never have to keep more than one group in memory, so it's handy for really large files. It's not nearly as fast as a regex, although if the file is 1 GB you're probably I/O bound anyhow.
Not entirely sure what you're asking. Does this get you any closer? It will put all your entries as dictionary keys and a list of all its entries. Assuming it is formatted like I believe it is. Does it have duplicate entries? Here's what I've got:
entries = {}
key = ''
for entry in open('entries.txt'):
if entry.startswith('>Entry'):
key = entry[1:].strip() # removes > and newline
entries[key] = []
else:
entries[key].append(entry)

Categories