Capturing data regex

Capturing data regex - python

I am trying to create variables using text imported from .doc files. For the given text:
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 CAL STRIPE INC STRIPING (PARTIAL)
SAN BERNARDINO CA
HUBBS CONSTRUCTION ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
YUCAIPA CA
J F L ELECTRIC INC ELECTRICAL (PARTIAL)
LOS ANGELES CA BID245
08-SBD-15-4 PAGE 7
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
02 C W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
COOPER ENGINEERING INCORORATED ITEMS 60 THRU 65
TUSTIN CA
HIGHLIGHT ELECTRIC ITEMS 2 AND 68 THRU 70 (PARTIAL)
P O BOX 7339
RIVERSIDE CA 92513
P R S I ITEM 28 (PARTIAL)
P O BOX 1266
RIVERSIDE CA 92501
R DUGAN ITEMS 31, 46, 51 AND 56 (PARTIAL)
6157 MARLATT STREET
MIRA LOMA CA 91752
STATEWIDE SAFETY AND SIGNS ITEMS 12, 14, 16, 19 AND 57 (PARTIAL)
POWAY CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47 (PARTIAL)
P O BOX 12833
SCOTTSDALE AZ 85267
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 9
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 SULLY MILLER ITEMS 40 THRU 45
VICTORVILLE CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
06 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
VENTURA CA
F B D VANGUARD CONSTRUCTION INC ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 C O N T R A C T P R O P O S A L O F L O W B I D D E R PAGE 10
11/21/08 11/26/08
------------------------------------------------------------------------------------------------------------------------------------
ITEM ITEM UNIT OF ESTIMATED
NO. CODE ITEM DESCRIPTION MEASURE QUANTITY BID AMOUNT
------------------------------------------------------------------------------------------------------------------------------------
1 070012 PROGRESS SCHEDULE (CRITICAL PATH METHOD) LS LUMP SUM 4,000.00 4,000.00
2 070018 TIME-RELATED OVERHEAD WDAY 150 1,000.00 150,000.00
I am trying to build a dataset of the following form (with all bidder IDs as in the text):
bidder-id
number_subcontractors
items
01
5
26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
03
3
26, 27, 58, 59, 68, 69, 47
I think we can do by:
splitting the text into different bidder-id texts
text 1:
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
text 2:
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
Then capture each bidder-id specific string as a dataset. So, the number of subcontractors would be the number of observations for each bidder-id, and we can also concate all the item numbers.
I believe I am struggling at step 1 (splitting the big string into the small strings we want). Right now, I have the following code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/small-test')
# import text
txt = "
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 CAL STRIPE INC STRIPING (PARTIAL)
SAN BERNARDINO CA
HUBBS CONSTRUCTION ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
YUCAIPA CA
J F L ELECTRIC INC ELECTRICAL (PARTIAL)
LOS ANGELES CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 6
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 PAVEMENT RECYCLING SYSTEM INC GRINDING (PARTIAL)
RIVERSIDE CA
VANGUARD CONSTRUCTION ITEMS 60 AND 61 (PARTIAL)
OAKLAND CA
VISUAL POLLUTION TECHNOLOGIES PAINTING (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
08 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 57 THRU 59
VENTURA CA
CAL STRIPE INC ITEMS 15, 22, 23, 66 AND 67
SAN BERNARDINO CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
MARINA LANDSCAPE INC ITEM 38
ANAHEIM CA
MATICH CORPORATION ITEMS 40 THRU 43
SAN BERNARDINO CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 7
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
02 C W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
COOPER ENGINEERING INCORORATED ITEMS 60 THRU 65
TUSTIN CA
HIGHLIGHT ELECTRIC ITEMS 2 AND 68 THRU 70 (PARTIAL)
P O BOX 7339
RIVERSIDE CA 92513
P R S I ITEM 28 (PARTIAL)
P O BOX 1266
RIVERSIDE CA 92501
R DUGAN ITEMS 31, 46, 51 AND 56 (PARTIAL)
6157 MARLATT STREET
MIRA LOMA CA 91752
STATEWIDE SAFETY AND SIGNS ITEMS 12, 14, 16, 19 AND 57 (PARTIAL)
POWAY CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47 (PARTIAL)
P O BOX 12833
SCOTTSDALE AZ 85267
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
07 C AND W FENCE ITEMS 26, 27, 58 AND 59
VENTURA CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 8
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
07 COOPER ENGINEERING INCORPORATED ITEMS 60 THRU 65
TUSTIN CA
MOORE ELECTRIC ITEMS 68 AND 69
CORONA CA
PRS CONSTRUCTION ITEM 29
RIVERSIDE CA
STERNDAHL ENTERPRISES INC ITEMS 15, 66 AND 67
SUN VALLEY CA
TRAFFIC LOOPS ITEM 69
ANAHEIM CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65
MURRIETA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28
RIVERSIDE CA
PERRIS TRAFFIC CONTROL ITEMS 12 AND 13 (PARTIAL)
MURRIETA CA
STERNDAHL ENTERPRISES INC ITEMS 15, 66 AND 67
SUN VALLEY CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 9
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 SULLY MILLER ITEMS 40 THRU 45
VICTORVILLE CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
06 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
VENTURA CA
F B D VANGUARD CONSTRUCTION INC ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 C O N T R A C T P R O P O S A L O F L O W B I D D E R PAGE 10
11/21/08 11/26/08
------------------------------------------------------------------------------------------------------------------------------------
ITEM ITEM UNIT OF ESTIMATED
NO. CODE ITEM DESCRIPTION MEASURE QUANTITY BID AMOUNT
------------------------------------------------------------------------------------------------------------------------------------
1 070012 PROGRESS SCHEDULE (CRITICAL PATH METHOD) LS LUMP SUM 4,000.00 4,000.00
2 070018 TIME-RELATED OVERHEAD WDAY 150 1,000.00 150,000.00 "
# splitting
txt = txt.split('DESCRIPTION OF PORTION OF WORK SUBCONTRACTED')
del txt[0]
data = txt[0]
Any help or lead would be appreciated! Thank you so much!
Reference regex101

If document.txt contains your document, you can try (regex101):
import re
import pandas as pd
with open("document.txt", "r") as f_in:
document = f_in.read()
data = []
for id_, group in re.findall(
r"(?s)BIDDER ID\D+DESCRIPTION OF PORTION OF WORK SUBCONTRACTED\D+(\d+)(.*?)(?=BIDDER ID|-{5,}|\Z)",
document,
):
items = re.findall(r"ITEMS? (.*)", group)
data.append(
{
"bidder-id": id_,
"number_subcontractors": group.count('\n\n'),
"items": ", ".join(
i.replace(" (PARTIAL)", "").replace(" AND", ",").strip() for i in items
),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
bidder-id number_subcontractors items
0 01 5 26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
1 03 3 26, 27, 58, 59, 68, 69, 47
2 10 5 26, 27, 58, 59, 60 THRU 65, 68, 69, 28, 47
3 09 7 60 THRU 65, 68 THRU 69, 28, 47
4 04 3 26, 27, 57 THRU 59
5 04 3 60, 61
6 08 6 57 THRU 59, 15, 22, 23, 66, 67, 68, 69, 38, 40 THRU 43, 47
7 02 7 26, 27, 58, 59, 60 THRU 65, 2, 68 THRU 70, 28, 31, 46, 51, 56, 12, 14, 16, 19, 57, 47
8 07 1 26, 27, 58, 59
9 07 5 60 THRU 65, 68, 69, 29, 15, 66, 67, 69
10 05 5 60 THRU 65, 68, 69, 28, 12, 13, 15, 66, 67
11 05 1 40 THRU 45
12 06 3 26, 27, 57 THRU 59, 60 THRU 65, 47

Related

How do you add or strip a new line based on a condition in Python?

I have the the following list:
b = ['B-PER 0 3 Joe', 'B-LOC 13 20 Angola', 'B-ORG 28 35 ABC', 'I-ORG 37 52 Financial', 'I-ORG 54 59 Center', 'B-LOC 72 80 Angola']
I want to write each item in the list to a string and create a new line if the item in the list does not start with "I-". If the item does start with "I-" I want to join that line to the previous line.
The closest I can get is the below:
a = ''
b = ['B-PER 0 3 Joe ', 'B-LOC 13 20 Angola ', 'B-ORG 28 35 ABC ', 'I-ORG 37 52 Financial ', 'I-ORG 54 59 Center ', 'B-LOC 72 80 Angola ']
for item in b:
if not re.match(r'^I-.*', item):
a += item + '\n'
else:
a += item.strip('\n')
print(a)
OUTPUT Received:
B-PER 0 3 Joe
B-LOC 13 20 Angola
B-ORG 28 35 ABC
I-ORG 37 52 Financial I-ORG 54 59 Center B-LOC 72 80 Angola
OUTPUT Desired:
B-PER 0 3 Joe
B-LOC 13 20 Angola
B-ORG 28 35 ABC I-ORG 37 52 Financial I-ORG 54 59 Center
B-LOC 72 80 Angola
I'm guessing it's a combination of my conditional logic and order. Any solutions would be greatly appreciated!

For stuff like this, it's sometimes easier to create a nested list, add items as appropriate, and then use str.join() afterwards once it's fully-constructed.
a = []
b = ['B-PER 0 3 Joe ', 'B-LOC 13 20 Angola ', 'B-ORG 28 35 ABC ', 'I-ORG 37 52 Financial ', 'I-ORG 54 59 Center ', 'B-LOC 72 80 Angola ']
for item in b:
if not re.match(r'^I-.*', item):
a.append([])
a[-1].append(item)
print('\n'.join(''.join(line) for line in a))
# B-PER 0 3 Joe
# B-LOC 13 20 Angola
# B-ORG 28 35 ABC I-ORG 37 52 Financial I-ORG 54 59 Center
# B-LOC 72 80 Angola

Convert a time-range (e.g., 1230-2P) in python to pre-assigned list of bins?

I'm writing a course scheduling algorithm for myself using python.
I was having a lot of trouble working directly with time, so I decided to create "bins" representing each possible half-hour increment in the day. (For example, Monday, 8-830A is 1, while Wednesday 9-930A is 63 and Sunday 6-630P is 195.)
Each room is given a list of all of the bins in the week (which, for my scheduling hours, is 1-203 representing each day, Monday-Sunday, 8A-10P).
Then I just loop through each course to see if its requested "bins" are available in a certain room: If they are, I assign the course to that room (not implemented yet) and remove those bins from the room (so that I won't double book anything).
So far the proof of concept is working fine when I do it manually (e.g., Physics 101 is 1,2,3,62,63,64 instead of MW 8-930A), but ideally I would want to convert the 8-930A request into bins within the program.
I was thinking about using Excel and some vlookups, but I would prefer to do it directly in python if possible so I don't have to do a bunch of manual work on the file each time I want to run this.
This is what I have so far:
from csv import reader
room_input_file = r"C:\Downloads\rooms.csv"
course_input_file = r"C:\Downloads\course.csv"
room_file = reader(open(room_input_file), delimiter='\t')
course_file = reader(open(course_input_file), delimiter='\t')
class Room(object):
room_instances = [] # list of all Room objects that have been created
def __init__(self, name, capacity):
self.times = []
self.times.extend(range(1,204))
self.name = name
self.capacity = capacity
Room.room_instances.append(self) # adding this to list of Room objects that have been created
class Course(object):
course_instances = [] # list of all Course objects that have been created
def __init__(self, name, capacity):
self.name = name
self.times = []
self.capacity = capacity
Course.course_instances.append(self) # adding this to list of Course objects that have been created
for name, cap in room_file:
x = Room(name,cap)
for name, times, cap in course_file:
x = Course(name,cap)
times = times.split(",")
for i in range(0,len(times)):
times[i] = int(times[i])
x.times = times
for course in Course.course_instances:
print(course.name)
print(course.times)
for room in Room.room_instances:
if set(course.times).issubset(room.times):
if room.capacity >= course.capacity:
for x in course.times:
room.times.remove(x)
print(room.name)
break
The files are:
course.csv # unique course identifier, day/time-bins, course-capacity
00001 1,2,3,62,63,64 71
00002 1,2,3,62,63,64 41
00003 1,2,3,62,63,64 31
rooms.csv # room-name, room-capacity
A110 47
A210 62
A220 62
A230 130
A250 23
A320 31
B100 141
B170 57
B270 57
B300 76
B370 74
B470 74
B570 74
B500 78
Ideally at this stage the course file would include the below instead (obviously there would be a ton of different times, those were just my proof of concept examples to see if it was correctly assigning rooms, then removing list values):
00001 MW 8:00am-9:30AM 71
00002 MW 8:00am-9:30AM 41
00003 MW 8:00am-9:30AM 31
I'm relatively comfortable with Pandas dataframes and numpy (obviously I'm not an expert based on my code above lol), so if those offer solutions that's fine. Doesn't have to be strictly standard library.
Also important to note that the end time should not be inclusive: If a course ends at 930A it DOES NOT remove the 930-10A bin.
Edit: Realized I probably should have included the bins--
Day Times bin
m 8:00am 1
m 8:30am 2
m 9:00am 3
m 9:30am 4
m 10:00am 5
m 10:30am 6
m 11:00am 7
m 11:30am 8
m 12:00pm 9
m 12:30pm 10
m 1:00pm 11
m 1:30pm 12
m 2:00pm 13
m 2:30pm 14
m 3:00pm 15
m 3:30pm 16
m 4:00pm 17
m 4:30pm 18
m 5:00pm 19
m 5:30pm 20
m 6:00pm 21
m 6:30pm 22
m 7:00pm 23
m 7:30pm 24
m 8:00pm 25
m 8:30pm 26
m 9:00pm 27
m 9:30pm 28
m 10:00pm 29
t 8:00am 30
t 8:30am 31
t 9:00am 32
t 9:30am 33
t 10:00am 34
t 10:30am 35
t 11:00am 36
t 11:30am 37
t 12:00pm 38
t 12:30pm 39
t 1:00pm 40
t 1:30pm 41
t 2:00pm 42
t 2:30pm 43
t 3:00pm 44
t 3:30pm 45
t 4:00pm 46
t 4:30pm 47
t 5:00pm 48
t 5:30pm 49
t 6:00pm 50
t 6:30pm 51
t 7:00pm 52
t 7:30pm 53
t 8:00pm 54
t 8:30pm 55
t 9:00pm 56
t 9:30pm 57
t 10:00pm 58
w 8:00am 59
w 8:30am 60
w 9:00am 61
w 9:30am 62
w 10:00am 63
w 10:30am 64
w 11:00am 65
w 11:30am 66
w 12:00pm 67
w 12:30pm 68
w 1:00pm 69
w 1:30pm 70
w 2:00pm 71
w 2:30pm 72
w 3:00pm 73
w 3:30pm 74
w 4:00pm 75
w 4:30pm 76
w 5:00pm 77
w 5:30pm 78
w 6:00pm 79
w 6:30pm 80
w 7:00pm 81
w 7:30pm 82
w 8:00pm 83
w 8:30pm 84
w 9:00pm 85
w 9:30pm 86
w 10:00pm 87
th 8:00am 88
th 8:30am 89
th 9:00am 90
th 9:30am 91
th 10:00am 92
th 10:30am 93
th 11:00am 94
th 11:30am 95
th 12:00pm 96
th 12:30pm 97
th 1:00pm 98
th 1:30pm 99
th 2:00pm 100
th 2:30pm 101
th 3:00pm 102
th 3:30pm 103
th 4:00pm 104
th 4:30pm 105
th 5:00pm 106
th 5:30pm 107
th 6:00pm 108
th 6:30pm 109
th 7:00pm 110
th 7:30pm 111
th 8:00pm 112
th 8:30pm 113
th 9:00pm 114
th 9:30pm 115
th 10:00pm 116
f 8:00am 117
f 8:30am 118
f 9:00am 119
f 9:30am 120
f 10:00am 121
f 10:30am 122
f 11:00am 123
f 11:30am 124
f 12:00pm 125
f 12:30pm 126
f 1:00pm 127
f 1:30pm 128
f 2:00pm 129
f 2:30pm 130
f 3:00pm 131
f 3:30pm 132
f 4:00pm 133
f 4:30pm 134
f 5:00pm 135
f 5:30pm 136
f 6:00pm 137
f 6:30pm 138
f 7:00pm 139
f 7:30pm 140
f 8:00pm 141
f 8:30pm 142
f 9:00pm 143
f 9:30pm 144
f 10:00pm 145
s 8:00am 146
s 8:30am 147
s 9:00am 148
s 9:30am 149
s 10:00am 150
s 10:30am 151
s 11:00am 152
s 11:30am 153
s 12:00pm 154
s 12:30pm 155
s 1:00pm 156
s 1:30pm 157
s 2:00pm 158
s 2:30pm 159
s 3:00pm 160
s 3:30pm 161
s 4:00pm 162
s 4:30pm 163
s 5:00pm 164
s 5:30pm 165
s 6:00pm 166
s 6:30pm 167
s 7:00pm 168
s 7:30pm 169
s 8:00pm 170
s 8:30pm 171
s 9:00pm 172
s 9:30pm 173
s 10:00pm 174
su 8:00am 175
su 8:30am 176
su 9:00am 177
su 9:30am 178
su 10:00am 179
su 10:30am 180
su 11:00am 181
su 11:30am 182
su 12:00pm 183
su 12:30pm 184
su 1:00pm 185
su 1:30pm 186
su 2:00pm 187
su 2:30pm 188
su 3:00pm 189
su 3:30pm 190
su 4:00pm 191
su 4:30pm 192
su 5:00pm 193
su 5:30pm 194
su 6:00pm 195
su 6:30pm 196
su 7:00pm 197
su 7:30pm 198
su 8:00pm 199
su 8:30pm 200
su 9:00pm 201
su 9:30pm 202
su 10:00pm 203
Edit 2:
Some samples of default day/time data that I convert to "bins":
TTh 11:00AM-12:30PM
TTh 12:30PM-2:00PM
MW 4:00PM-5:30PM
TTh 6:00PM-7:30PM
MW 12:30PM-2:00PM
M 12:00PM-2:00PM

This took rather more code than I thought.
import re
tests = """\
TTh 11:00AM-12:30PM
TTh 12:30PM-2:00PM
MW 4:00PM-5:30PM
TTh 6:00PM-7:30PM
MW 12:30PM-2:00PM
M 12:00PM-2:00PM""".splitlines()
pattern = r"([SMTWFhu]*)\s*(\d*):(\d*)([AP]M)-(\d*):(\d*)([AP]M)"
dayorder = ["M","T","W","Th","F","S","Su"]
def convertTime(code):
m = re.match( pattern, code )
days, starthh, startmm, startampm, endhh, endmm, endampm = m.groups()
starthh, startmm, endhh, endmm = (int(k) for k in (starthh, startmm, endhh, endmm))
if startampm == 'PM' and starthh != 12:
starthh += 12
if endampm == 'PM' and endhh != 12:
endhh += 12
starthh = (starthh - 8) * 2 + startmm//30
endhh = (endhh - 8) * 2 + endmm//30
slots = []
for day in re.findall("[A-Z][hu]?", days ):
offset = dayorder.index(day) * 29 + 1
slots.extend( [offset+k for k in range(starthh,endhh)] )
return slots
for test in tests:
print( test, convertTime(test) )
[timr#Tims-Pro:~/src]$ python x.py
TTh 11:00AM-12:30PM [36, 37, 38, 94, 95, 96]
TTh 12:30PM-2:00PM [39, 40, 41, 97, 98, 99]
MW 4:00PM-5:30PM [17, 18, 19, 75, 76, 77]
TTh 6:00PM-7:30PM [50, 51, 52, 108, 109, 110]
MW 12:30PM-2:00PM [10, 11, 12, 68, 69, 70]
M 12:00PM-2:00PM [9, 10, 11, 12]
[timr#Tims-Pro:~/src]$

Getting the nlargest of each group in a Multiindex Pandas Series

I have a DataFrame that consists of information about every NFL play that has occurred since 2009. My goal is to find out which teams had the most "big plays" in each season. To do this, I found all plays which gained over 20 yards, grouped them by year and team, and got the size of each of those group.
big_plays = (df[df['yards_gained'] >= 20]
.groupby([df['game_date'].dt.year, 'posteam'])
.size())
This results in the following Series:
game_date posteam
2009 ARI 55
ATL 51
BAL 55
BUF 37
CAR 52
CHI 58
CIN 51
CLE 31
DAL 68
DEN 42
DET 42
GB 65
HOU 63
IND 67
JAC 51
KC 44
MIA 34
MIN 64
NE 48
NO 72
NYG 69
NYJ 54
OAK 38
PHI 68
PIT 72
SD 71
SEA 45
SF 51
STL 42
TB 51
..
2018 BAL 44
BUF 55
CAR 64
CHI 66
CIN 69
CLE 70
DAL 51
DEN 59
DET 51
GB 63
HOU 53
IND 57
JAX 51
KC 88
LA 80
LAC 77
MIA 47
MIN 56
NE 64
NO 66
NYG 70
NYJ 49
OAK 63
PHI 54
PIT 66
SEA 62
SF 69
TB 73
TEN 51
WAS 46
Length: 323, dtype: int64
So far, this is exactly what I want. However, I am stuck on the next step. I want the n-largest values for each group in the MultiIndex, or the n-teams with the most number of "big plays" per season.
I have semi-successfully solved this task in a cumbersome way. If I groupby the 0th level of the MultiIndex, then run the nlargest function on that groupby, I get the following (truncated to the first two years for brevity):
big_plays.groupby(level=0).nlargest(5)
returns
game_date game_date posteam
2009 2009 NO 72
PIT 72
SD 71
NYG 69
DAL 68
2010 2010 PHI 81
NYG 78
PIT 78
SD 75
DEN 73
This (rather inelegantly) solves the problem, but I'm wondering how I can better achieve more or less the same results.

In my opinion your code is nice, only a bit changed by group_keys=False in Series.groupby for avoid duplicated MultiIndex levels:
s = big_plays.groupby(level=0, group_keys=False).nlargest(5)
print (s)
game_date posteam
2009 NO 72
PIT 72
SD 71
NYG 69
DAL 68
2018 KC 88
LA 80
LAC 77
TB 73
CLE 70
Name: a, dtype: int64
df = big_plays.groupby(level=0, group_keys=False).nlargest(5).reset_index(name='count')
print (df)
game_date posteam count
0 2009 NO 72
1 2009 PIT 72
2 2009 SD 71
3 2009 NYG 69
4 2009 DAL 68
5 2018 KC 88
6 2018 LA 80
7 2018 LAC 77
8 2018 TB 73
9 2018 CLE 70
Alternative is more complicated:
df = (big_plays.reset_index(name='count')
.sort_values(['game_date','count'], ascending=[True, False])
.groupby('game_date')
.head(5))
print (df)
game_date posteam count
19 2009 NO 72
24 2009 PIT 72
25 2009 SD 71
20 2009 NYG 69
8 2009 DAL 68
43 2018 KC 88
44 2018 LA 80
45 2018 LAC 77
57 2018 TB 73
35 2018 CLE 70

Removing rows from one DataFrame based on rows from another DataFrame

I have two different dataframes with two different lengths of rows. I want df1 to match df2 but I don't want to create a new dataframe in the process (no merge).
df1
0 Alameda
1 Alpine
2 Amador
3 Butte
4 Calaveras
5 Colusa
6 Contra Costa
7 Del Norte
8 El Dorado
9 Fresno
10 Glenn
11 Humboldt
12 Imperial
13 Inyo
14 Kern
15 Kings
16 Lake
17 Lassen
18 Los Angeles
19 Madera
20 Marin
21 Mariposa
22 Mendocino
23 Merced
24 Modoc
25 Mono
26 Monterey
27 Napa
28 Nevada
29 Orange
30 Placer
31 Plumas
32 Riverside
33 Sacramento
34 San Benito
35 San Bernardino
36 San Diego
37 San Francisco
38 San Joaquin
39 San Luis Obispo
40 San Mateo
41 Santa Barbara
42 Santa Clara
43 Santa Cruz
44 Shasta
45 Sierra
46 Siskiyou
47 Solano
48 Sonoma
49 Stanislaus
50 Sutter
51 Tehama
52 Trinity
53 Tulare
54 Tuolumne
55 Ventura
56 Yolo
57 Yuba
df2
0 Alameda
1 Amador
2 Butte
3 Calaveras
4 Colusa
5 Contra Costa
6 Del Norte
7 El Dorado
8 Fresno
9 Glenn
10 Humboldt
11 Imperial
12 Inyo
13 Kern
14 Kings
15 Lake
16 Lassen
17 Los Angeles
18 Madera
19 Marin
20 Mariposa
21 Mendocino
22 Merced
23 Mono
24 Monterey
25 Napa
26 Nevada
27 Orange
28 Placer
29 Plumas
30 Riverside
31 Sacramento
32 San Benito
33 San Bernardino
34 San Diego
35 San Francisco
36 San Joaquin
37 San Luis Obispo
38 San Mateo
39 Santa Barbara
40 Santa Clara
41 Santa Cruz
42 Shasta
43 Siskiyou
44 Solano
45 Sonoma
46 Stanislaus
47 Sutter
48 Tehama
49 Tulare
50 Ventura
51 Yolo
52 Yuba
Is there a way to modify a column's rows in a dataframe using a column's rows from a different dataframe? Again I want to keep the dataframes separate, but the goal is to get the dataframes to have the same number of rows containing the same values.

Since you just want common rows, you can compute them quickly using np.intersect1d:
i = df1.values.squeeze()
j = df2.values.squeeze()
df1 = pd.DataFrame(np.intersect1d(i, j))
And have df2 just become a copy of df1:
df2 = df1.copy(deep=True)

Using duplicated
s=pd.concat([df1,df2],keys=[1,2])
df1,df2=s[s.duplicated(keep=False)].loc[1],s[s.duplicated(keep=False)].loc[1]

Combine certain rows values of duplicate rows Pandas

I have a dataframe based on football players. I am finding duplicate rows for when a player has transferred mid-season. My aim is to add the points the accumalted in both leagues and add them together to make just one row.
Here is a sample of the data:
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 76 26 1
85 S. Mustafi Shkodran Mustafi Arsenal -2 0 1
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 16 5 8
120 Oscar Oscar dos Santos Emboaba NaN 1 0 2
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 16 5 8
122 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 1 0 2
188 C. Bravo Claudio Bravo Manchester City 61 22 8
189 C. Bravo Claudio Bravo Manchester City 1 1 0
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 79 23 6
201 G. Castro Gonzalo Castro Malaga CF 79 23 6
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
210 Juanfran Juan Francisco Torres Belen Atletico Madrid 74 34 2
211 Juanfran Juan Francisco Moreno Fuertes RC Coruna 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
My goal dataframe would have players like for example Mustafi's Points Start and Sum values added together to give just one player.
Players like Bruno are clearly not the same person so I don't want to add the two brunos together.
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 17 5 10
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
188 C. Bravo Claudio Bravo Manchester City 62 23 8
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
Any help would be great!

You need:
df[['name','full_name','club']] = df[['name','full_name','club']].fillna('')
d = {'Points':'sum', 'Start':'sum', 'Sub':'sum', 'club':'first'}
df = (df.groupby(['name','full_name'], sort=False, as_index=False)
.agg(d)
.reindex(columns=df.columns))
with pd.option_context('display.expand_frame_repr', False):
print (df)
name full_name club Points Start Sub
0 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
1 Bruno Bruno SorianoLlido Villarreal CF 43 15 16
2 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
3 Oscar Oscar dos Santos Emboaba 17 5 10
4 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
5 C. Bravo Claudio Bravo Manchester City 62 23 8
6 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
7 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
8 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
9 Juanfran Juan Francisco Torres Belen Atletico Madrid 160 55 10
10 Juanfran Juan Francisco Moreno Fuertes RC Coruna 160 55 10
Explanation:
First replace NaNs to '' by fillna for avoid omit rows with them in groupby
Aggregate by groupby, agg with dictionary with specify columns and their aggregating functions
Last for display all rows together temporarly use with

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.