Capturing data regex - python
I am trying to create variables using text imported from .doc files. For the given text:
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 CAL STRIPE INC STRIPING (PARTIAL)
SAN BERNARDINO CA
HUBBS CONSTRUCTION ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
YUCAIPA CA
J F L ELECTRIC INC ELECTRICAL (PARTIAL)
LOS ANGELES CA BID245
08-SBD-15-4 PAGE 7
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
02 C W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
COOPER ENGINEERING INCORORATED ITEMS 60 THRU 65
TUSTIN CA
HIGHLIGHT ELECTRIC ITEMS 2 AND 68 THRU 70 (PARTIAL)
P O BOX 7339
RIVERSIDE CA 92513
P R S I ITEM 28 (PARTIAL)
P O BOX 1266
RIVERSIDE CA 92501
R DUGAN ITEMS 31, 46, 51 AND 56 (PARTIAL)
6157 MARLATT STREET
MIRA LOMA CA 91752
STATEWIDE SAFETY AND SIGNS ITEMS 12, 14, 16, 19 AND 57 (PARTIAL)
POWAY CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47 (PARTIAL)
P O BOX 12833
SCOTTSDALE AZ 85267
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 9
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 SULLY MILLER ITEMS 40 THRU 45
VICTORVILLE CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
06 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
VENTURA CA
F B D VANGUARD CONSTRUCTION INC ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 C O N T R A C T P R O P O S A L O F L O W B I D D E R PAGE 10
11/21/08 11/26/08
------------------------------------------------------------------------------------------------------------------------------------
ITEM ITEM UNIT OF ESTIMATED
NO. CODE ITEM DESCRIPTION MEASURE QUANTITY BID AMOUNT
------------------------------------------------------------------------------------------------------------------------------------
1 070012 PROGRESS SCHEDULE (CRITICAL PATH METHOD) LS LUMP SUM 4,000.00 4,000.00
2 070018 TIME-RELATED OVERHEAD WDAY 150 1,000.00 150,000.00
I am trying to build a dataset of the following form (with all bidder IDs as in the text):
bidder-id
number_subcontractors
items
01
5
26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
03
3
26, 27, 58, 59, 68, 69, 47
I think we can do by:
splitting the text into different bidder-id texts
text 1:
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
text 2:
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
Then capture each bidder-id specific string as a dataset. So, the number of subcontractors would be the number of observations for each bidder-id, and we can also concate all the item numbers.
I believe I am struggling at step 1 (splitting the big string into the small strings we want). Right now, I have the following code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/small-test')
# import text
txt = "
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 CAL STRIPE INC STRIPING (PARTIAL)
SAN BERNARDINO CA
HUBBS CONSTRUCTION ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
YUCAIPA CA
J F L ELECTRIC INC ELECTRICAL (PARTIAL)
LOS ANGELES CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 6
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
04 PAVEMENT RECYCLING SYSTEM INC GRINDING (PARTIAL)
RIVERSIDE CA
VANGUARD CONSTRUCTION ITEMS 60 AND 61 (PARTIAL)
OAKLAND CA
VISUAL POLLUTION TECHNOLOGIES PAINTING (PARTIAL)
SCOTTSDALE AZ
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
08 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 57 THRU 59
VENTURA CA
CAL STRIPE INC ITEMS 15, 22, 23, 66 AND 67
SAN BERNARDINO CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
MARINA LANDSCAPE INC ITEM 38
ANAHEIM CA
MATICH CORPORATION ITEMS 40 THRU 43
SAN BERNARDINO CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 7
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
02 C W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59
2419 PALMA DRIVE
VENTURA CA 93003
COOPER ENGINEERING INCORORATED ITEMS 60 THRU 65
TUSTIN CA
HIGHLIGHT ELECTRIC ITEMS 2 AND 68 THRU 70 (PARTIAL)
P O BOX 7339
RIVERSIDE CA 92513
P R S I ITEM 28 (PARTIAL)
P O BOX 1266
RIVERSIDE CA 92501
R DUGAN ITEMS 31, 46, 51 AND 56 (PARTIAL)
6157 MARLATT STREET
MIRA LOMA CA 91752
STATEWIDE SAFETY AND SIGNS ITEMS 12, 14, 16, 19 AND 57 (PARTIAL)
POWAY CA
VISUAL POLLUTION TECHNOLOGIES ITEM 47 (PARTIAL)
P O BOX 12833
SCOTTSDALE AZ 85267
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
07 C AND W FENCE ITEMS 26, 27, 58 AND 59
VENTURA CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 8
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
07 COOPER ENGINEERING INCORPORATED ITEMS 60 THRU 65
TUSTIN CA
MOORE ELECTRIC ITEMS 68 AND 69
CORONA CA
PRS CONSTRUCTION ITEM 29
RIVERSIDE CA
STERNDAHL ENTERPRISES INC ITEMS 15, 66 AND 67
SUN VALLEY CA
TRAFFIC LOOPS ITEM 69
ANAHEIM CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65
MURRIETA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28
RIVERSIDE CA
PERRIS TRAFFIC CONTROL ITEMS 12 AND 13 (PARTIAL)
MURRIETA CA
STERNDAHL ENTERPRISES INC ITEMS 15, 66 AND 67
SUN VALLEY CA
CONTINUED ON NEXT PAGE
08-0C8104 BID245
08-SBD-15-4 PAGE 9
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
05 SULLY MILLER ITEMS 40 THRU 45
VICTORVILLE CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
06 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27 AND 57 THRU 59 (PARTIAL)
VENTURA CA
F B D VANGUARD CONSTRUCTION INC ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 C O N T R A C T P R O P O S A L O F L O W B I D D E R PAGE 10
11/21/08 11/26/08
------------------------------------------------------------------------------------------------------------------------------------
ITEM ITEM UNIT OF ESTIMATED
NO. CODE ITEM DESCRIPTION MEASURE QUANTITY BID AMOUNT
------------------------------------------------------------------------------------------------------------------------------------
1 070012 PROGRESS SCHEDULE (CRITICAL PATH METHOD) LS LUMP SUM 4,000.00 4,000.00
2 070018 TIME-RELATED OVERHEAD WDAY 150 1,000.00 150,000.00 "
# splitting
txt = txt.split('DESCRIPTION OF PORTION OF WORK SUBCONTRACTED')
del txt[0]
data = txt[0]
Any help or lead would be appreciated! Thank you so much!
Reference regex101
If document.txt contains your document, you can try (regex101):
import re
import pandas as pd
with open("document.txt", "r") as f_in:
document = f_in.read()
data = []
for id_, group in re.findall(
r"(?s)BIDDER ID\D+DESCRIPTION OF PORTION OF WORK SUBCONTRACTED\D+(\d+)(.*?)(?=BIDDER ID|-{5,}|\Z)",
document,
):
items = re.findall(r"ITEMS? (.*)", group)
data.append(
{
"bidder-id": id_,
"number_subcontractors": group.count('\n\n'),
"items": ", ".join(
i.replace(" (PARTIAL)", "").replace(" AND", ",").strip() for i in items
),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
bidder-id number_subcontractors items
0 01 5 26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
1 03 3 26, 27, 58, 59, 68, 69, 47
2 10 5 26, 27, 58, 59, 60 THRU 65, 68, 69, 28, 47
3 09 7 60 THRU 65, 68 THRU 69, 28, 47
4 04 3 26, 27, 57 THRU 59
5 04 3 60, 61
6 08 6 57 THRU 59, 15, 22, 23, 66, 67, 68, 69, 38, 40 THRU 43, 47
7 02 7 26, 27, 58, 59, 60 THRU 65, 2, 68 THRU 70, 28, 31, 46, 51, 56, 12, 14, 16, 19, 57, 47
8 07 1 26, 27, 58, 59
9 07 5 60 THRU 65, 68, 69, 29, 15, 66, 67, 69
10 05 5 60 THRU 65, 68, 69, 28, 12, 13, 15, 66, 67
11 05 1 40 THRU 45
12 06 3 26, 27, 57 THRU 59, 60 THRU 65, 47
Related
How do you add or strip a new line based on a condition in Python?
I have the the following list: b = ['B-PER 0 3 Joe', 'B-LOC 13 20 Angola', 'B-ORG 28 35 ABC', 'I-ORG 37 52 Financial', 'I-ORG 54 59 Center', 'B-LOC 72 80 Angola'] I want to write each item in the list to a string and create a new line if the item in the list does not start with "I-". If the item does start with "I-" I want to join that line to the previous line. The closest I can get is the below: a = '' b = ['B-PER 0 3 Joe ', 'B-LOC 13 20 Angola ', 'B-ORG 28 35 ABC ', 'I-ORG 37 52 Financial ', 'I-ORG 54 59 Center ', 'B-LOC 72 80 Angola '] for item in b: if not re.match(r'^I-.*', item): a += item + '\n' else: a += item.strip('\n') print(a) OUTPUT Received: B-PER 0 3 Joe B-LOC 13 20 Angola B-ORG 28 35 ABC I-ORG 37 52 Financial I-ORG 54 59 Center B-LOC 72 80 Angola OUTPUT Desired: B-PER 0 3 Joe B-LOC 13 20 Angola B-ORG 28 35 ABC I-ORG 37 52 Financial I-ORG 54 59 Center B-LOC 72 80 Angola I'm guessing it's a combination of my conditional logic and order. Any solutions would be greatly appreciated!
For stuff like this, it's sometimes easier to create a nested list, add items as appropriate, and then use str.join() afterwards once it's fully-constructed. a = [] b = ['B-PER 0 3 Joe ', 'B-LOC 13 20 Angola ', 'B-ORG 28 35 ABC ', 'I-ORG 37 52 Financial ', 'I-ORG 54 59 Center ', 'B-LOC 72 80 Angola '] for item in b: if not re.match(r'^I-.*', item): a.append([]) a[-1].append(item) print('\n'.join(''.join(line) for line in a)) # B-PER 0 3 Joe # B-LOC 13 20 Angola # B-ORG 28 35 ABC I-ORG 37 52 Financial I-ORG 54 59 Center # B-LOC 72 80 Angola
Convert a time-range (e.g., 1230-2P) in python to pre-assigned list of bins?
I'm writing a course scheduling algorithm for myself using python. I was having a lot of trouble working directly with time, so I decided to create "bins" representing each possible half-hour increment in the day. (For example, Monday, 8-830A is 1, while Wednesday 9-930A is 63 and Sunday 6-630P is 195.) Each room is given a list of all of the bins in the week (which, for my scheduling hours, is 1-203 representing each day, Monday-Sunday, 8A-10P). Then I just loop through each course to see if its requested "bins" are available in a certain room: If they are, I assign the course to that room (not implemented yet) and remove those bins from the room (so that I won't double book anything). So far the proof of concept is working fine when I do it manually (e.g., Physics 101 is 1,2,3,62,63,64 instead of MW 8-930A), but ideally I would want to convert the 8-930A request into bins within the program. I was thinking about using Excel and some vlookups, but I would prefer to do it directly in python if possible so I don't have to do a bunch of manual work on the file each time I want to run this. This is what I have so far: from csv import reader room_input_file = r"C:\Downloads\rooms.csv" course_input_file = r"C:\Downloads\course.csv" room_file = reader(open(room_input_file), delimiter='\t') course_file = reader(open(course_input_file), delimiter='\t') class Room(object): room_instances = [] # list of all Room objects that have been created def __init__(self, name, capacity): self.times = [] self.times.extend(range(1,204)) self.name = name self.capacity = capacity Room.room_instances.append(self) # adding this to list of Room objects that have been created class Course(object): course_instances = [] # list of all Course objects that have been created def __init__(self, name, capacity): self.name = name self.times = [] self.capacity = capacity Course.course_instances.append(self) # adding this to list of Course objects that have been created for name, cap in room_file: x = Room(name,cap) for name, times, cap in course_file: x = Course(name,cap) times = times.split(",") for i in range(0,len(times)): times[i] = int(times[i]) x.times = times for course in Course.course_instances: print(course.name) print(course.times) for room in Room.room_instances: if set(course.times).issubset(room.times): if room.capacity >= course.capacity: for x in course.times: room.times.remove(x) print(room.name) break The files are: course.csv # unique course identifier, day/time-bins, course-capacity 00001 1,2,3,62,63,64 71 00002 1,2,3,62,63,64 41 00003 1,2,3,62,63,64 31 rooms.csv # room-name, room-capacity A110 47 A210 62 A220 62 A230 130 A250 23 A320 31 B100 141 B170 57 B270 57 B300 76 B370 74 B470 74 B570 74 B500 78 Ideally at this stage the course file would include the below instead (obviously there would be a ton of different times, those were just my proof of concept examples to see if it was correctly assigning rooms, then removing list values): 00001 MW 8:00am-9:30AM 71 00002 MW 8:00am-9:30AM 41 00003 MW 8:00am-9:30AM 31 I'm relatively comfortable with Pandas dataframes and numpy (obviously I'm not an expert based on my code above lol), so if those offer solutions that's fine. Doesn't have to be strictly standard library. Also important to note that the end time should not be inclusive: If a course ends at 930A it DOES NOT remove the 930-10A bin. Edit: Realized I probably should have included the bins-- Day Times bin m 8:00am 1 m 8:30am 2 m 9:00am 3 m 9:30am 4 m 10:00am 5 m 10:30am 6 m 11:00am 7 m 11:30am 8 m 12:00pm 9 m 12:30pm 10 m 1:00pm 11 m 1:30pm 12 m 2:00pm 13 m 2:30pm 14 m 3:00pm 15 m 3:30pm 16 m 4:00pm 17 m 4:30pm 18 m 5:00pm 19 m 5:30pm 20 m 6:00pm 21 m 6:30pm 22 m 7:00pm 23 m 7:30pm 24 m 8:00pm 25 m 8:30pm 26 m 9:00pm 27 m 9:30pm 28 m 10:00pm 29 t 8:00am 30 t 8:30am 31 t 9:00am 32 t 9:30am 33 t 10:00am 34 t 10:30am 35 t 11:00am 36 t 11:30am 37 t 12:00pm 38 t 12:30pm 39 t 1:00pm 40 t 1:30pm 41 t 2:00pm 42 t 2:30pm 43 t 3:00pm 44 t 3:30pm 45 t 4:00pm 46 t 4:30pm 47 t 5:00pm 48 t 5:30pm 49 t 6:00pm 50 t 6:30pm 51 t 7:00pm 52 t 7:30pm 53 t 8:00pm 54 t 8:30pm 55 t 9:00pm 56 t 9:30pm 57 t 10:00pm 58 w 8:00am 59 w 8:30am 60 w 9:00am 61 w 9:30am 62 w 10:00am 63 w 10:30am 64 w 11:00am 65 w 11:30am 66 w 12:00pm 67 w 12:30pm 68 w 1:00pm 69 w 1:30pm 70 w 2:00pm 71 w 2:30pm 72 w 3:00pm 73 w 3:30pm 74 w 4:00pm 75 w 4:30pm 76 w 5:00pm 77 w 5:30pm 78 w 6:00pm 79 w 6:30pm 80 w 7:00pm 81 w 7:30pm 82 w 8:00pm 83 w 8:30pm 84 w 9:00pm 85 w 9:30pm 86 w 10:00pm 87 th 8:00am 88 th 8:30am 89 th 9:00am 90 th 9:30am 91 th 10:00am 92 th 10:30am 93 th 11:00am 94 th 11:30am 95 th 12:00pm 96 th 12:30pm 97 th 1:00pm 98 th 1:30pm 99 th 2:00pm 100 th 2:30pm 101 th 3:00pm 102 th 3:30pm 103 th 4:00pm 104 th 4:30pm 105 th 5:00pm 106 th 5:30pm 107 th 6:00pm 108 th 6:30pm 109 th 7:00pm 110 th 7:30pm 111 th 8:00pm 112 th 8:30pm 113 th 9:00pm 114 th 9:30pm 115 th 10:00pm 116 f 8:00am 117 f 8:30am 118 f 9:00am 119 f 9:30am 120 f 10:00am 121 f 10:30am 122 f 11:00am 123 f 11:30am 124 f 12:00pm 125 f 12:30pm 126 f 1:00pm 127 f 1:30pm 128 f 2:00pm 129 f 2:30pm 130 f 3:00pm 131 f 3:30pm 132 f 4:00pm 133 f 4:30pm 134 f 5:00pm 135 f 5:30pm 136 f 6:00pm 137 f 6:30pm 138 f 7:00pm 139 f 7:30pm 140 f 8:00pm 141 f 8:30pm 142 f 9:00pm 143 f 9:30pm 144 f 10:00pm 145 s 8:00am 146 s 8:30am 147 s 9:00am 148 s 9:30am 149 s 10:00am 150 s 10:30am 151 s 11:00am 152 s 11:30am 153 s 12:00pm 154 s 12:30pm 155 s 1:00pm 156 s 1:30pm 157 s 2:00pm 158 s 2:30pm 159 s 3:00pm 160 s 3:30pm 161 s 4:00pm 162 s 4:30pm 163 s 5:00pm 164 s 5:30pm 165 s 6:00pm 166 s 6:30pm 167 s 7:00pm 168 s 7:30pm 169 s 8:00pm 170 s 8:30pm 171 s 9:00pm 172 s 9:30pm 173 s 10:00pm 174 su 8:00am 175 su 8:30am 176 su 9:00am 177 su 9:30am 178 su 10:00am 179 su 10:30am 180 su 11:00am 181 su 11:30am 182 su 12:00pm 183 su 12:30pm 184 su 1:00pm 185 su 1:30pm 186 su 2:00pm 187 su 2:30pm 188 su 3:00pm 189 su 3:30pm 190 su 4:00pm 191 su 4:30pm 192 su 5:00pm 193 su 5:30pm 194 su 6:00pm 195 su 6:30pm 196 su 7:00pm 197 su 7:30pm 198 su 8:00pm 199 su 8:30pm 200 su 9:00pm 201 su 9:30pm 202 su 10:00pm 203 Edit 2: Some samples of default day/time data that I convert to "bins": TTh 11:00AM-12:30PM TTh 12:30PM-2:00PM MW 4:00PM-5:30PM TTh 6:00PM-7:30PM MW 12:30PM-2:00PM M 12:00PM-2:00PM
This took rather more code than I thought. import re tests = """\ TTh 11:00AM-12:30PM TTh 12:30PM-2:00PM MW 4:00PM-5:30PM TTh 6:00PM-7:30PM MW 12:30PM-2:00PM M 12:00PM-2:00PM""".splitlines() pattern = r"([SMTWFhu]*)\s*(\d*):(\d*)([AP]M)-(\d*):(\d*)([AP]M)" dayorder = ["M","T","W","Th","F","S","Su"] def convertTime(code): m = re.match( pattern, code ) days, starthh, startmm, startampm, endhh, endmm, endampm = m.groups() starthh, startmm, endhh, endmm = (int(k) for k in (starthh, startmm, endhh, endmm)) if startampm == 'PM' and starthh != 12: starthh += 12 if endampm == 'PM' and endhh != 12: endhh += 12 starthh = (starthh - 8) * 2 + startmm//30 endhh = (endhh - 8) * 2 + endmm//30 slots = [] for day in re.findall("[A-Z][hu]?", days ): offset = dayorder.index(day) * 29 + 1 slots.extend( [offset+k for k in range(starthh,endhh)] ) return slots for test in tests: print( test, convertTime(test) ) [timr#Tims-Pro:~/src]$ python x.py TTh 11:00AM-12:30PM [36, 37, 38, 94, 95, 96] TTh 12:30PM-2:00PM [39, 40, 41, 97, 98, 99] MW 4:00PM-5:30PM [17, 18, 19, 75, 76, 77] TTh 6:00PM-7:30PM [50, 51, 52, 108, 109, 110] MW 12:30PM-2:00PM [10, 11, 12, 68, 69, 70] M 12:00PM-2:00PM [9, 10, 11, 12] [timr#Tims-Pro:~/src]$
Getting the nlargest of each group in a Multiindex Pandas Series
I have a DataFrame that consists of information about every NFL play that has occurred since 2009. My goal is to find out which teams had the most "big plays" in each season. To do this, I found all plays which gained over 20 yards, grouped them by year and team, and got the size of each of those group. big_plays = (df[df['yards_gained'] >= 20] .groupby([df['game_date'].dt.year, 'posteam']) .size()) This results in the following Series: game_date posteam 2009 ARI 55 ATL 51 BAL 55 BUF 37 CAR 52 CHI 58 CIN 51 CLE 31 DAL 68 DEN 42 DET 42 GB 65 HOU 63 IND 67 JAC 51 KC 44 MIA 34 MIN 64 NE 48 NO 72 NYG 69 NYJ 54 OAK 38 PHI 68 PIT 72 SD 71 SEA 45 SF 51 STL 42 TB 51 .. 2018 BAL 44 BUF 55 CAR 64 CHI 66 CIN 69 CLE 70 DAL 51 DEN 59 DET 51 GB 63 HOU 53 IND 57 JAX 51 KC 88 LA 80 LAC 77 MIA 47 MIN 56 NE 64 NO 66 NYG 70 NYJ 49 OAK 63 PHI 54 PIT 66 SEA 62 SF 69 TB 73 TEN 51 WAS 46 Length: 323, dtype: int64 So far, this is exactly what I want. However, I am stuck on the next step. I want the n-largest values for each group in the MultiIndex, or the n-teams with the most number of "big plays" per season. I have semi-successfully solved this task in a cumbersome way. If I groupby the 0th level of the MultiIndex, then run the nlargest function on that groupby, I get the following (truncated to the first two years for brevity): big_plays.groupby(level=0).nlargest(5) returns game_date game_date posteam 2009 2009 NO 72 PIT 72 SD 71 NYG 69 DAL 68 2010 2010 PHI 81 NYG 78 PIT 78 SD 75 DEN 73 This (rather inelegantly) solves the problem, but I'm wondering how I can better achieve more or less the same results.
In my opinion your code is nice, only a bit changed by group_keys=False in Series.groupby for avoid duplicated MultiIndex levels: s = big_plays.groupby(level=0, group_keys=False).nlargest(5) print (s) game_date posteam 2009 NO 72 PIT 72 SD 71 NYG 69 DAL 68 2018 KC 88 LA 80 LAC 77 TB 73 CLE 70 Name: a, dtype: int64 df = big_plays.groupby(level=0, group_keys=False).nlargest(5).reset_index(name='count') print (df) game_date posteam count 0 2009 NO 72 1 2009 PIT 72 2 2009 SD 71 3 2009 NYG 69 4 2009 DAL 68 5 2018 KC 88 6 2018 LA 80 7 2018 LAC 77 8 2018 TB 73 9 2018 CLE 70 Alternative is more complicated: df = (big_plays.reset_index(name='count') .sort_values(['game_date','count'], ascending=[True, False]) .groupby('game_date') .head(5)) print (df) game_date posteam count 19 2009 NO 72 24 2009 PIT 72 25 2009 SD 71 20 2009 NYG 69 8 2009 DAL 68 43 2018 KC 88 44 2018 LA 80 45 2018 LAC 77 57 2018 TB 73 35 2018 CLE 70
Removing rows from one DataFrame based on rows from another DataFrame
I have two different dataframes with two different lengths of rows. I want df1 to match df2 but I don't want to create a new dataframe in the process (no merge). df1 0 Alameda 1 Alpine 2 Amador 3 Butte 4 Calaveras 5 Colusa 6 Contra Costa 7 Del Norte 8 El Dorado 9 Fresno 10 Glenn 11 Humboldt 12 Imperial 13 Inyo 14 Kern 15 Kings 16 Lake 17 Lassen 18 Los Angeles 19 Madera 20 Marin 21 Mariposa 22 Mendocino 23 Merced 24 Modoc 25 Mono 26 Monterey 27 Napa 28 Nevada 29 Orange 30 Placer 31 Plumas 32 Riverside 33 Sacramento 34 San Benito 35 San Bernardino 36 San Diego 37 San Francisco 38 San Joaquin 39 San Luis Obispo 40 San Mateo 41 Santa Barbara 42 Santa Clara 43 Santa Cruz 44 Shasta 45 Sierra 46 Siskiyou 47 Solano 48 Sonoma 49 Stanislaus 50 Sutter 51 Tehama 52 Trinity 53 Tulare 54 Tuolumne 55 Ventura 56 Yolo 57 Yuba df2 0 Alameda 1 Amador 2 Butte 3 Calaveras 4 Colusa 5 Contra Costa 6 Del Norte 7 El Dorado 8 Fresno 9 Glenn 10 Humboldt 11 Imperial 12 Inyo 13 Kern 14 Kings 15 Lake 16 Lassen 17 Los Angeles 18 Madera 19 Marin 20 Mariposa 21 Mendocino 22 Merced 23 Mono 24 Monterey 25 Napa 26 Nevada 27 Orange 28 Placer 29 Plumas 30 Riverside 31 Sacramento 32 San Benito 33 San Bernardino 34 San Diego 35 San Francisco 36 San Joaquin 37 San Luis Obispo 38 San Mateo 39 Santa Barbara 40 Santa Clara 41 Santa Cruz 42 Shasta 43 Siskiyou 44 Solano 45 Sonoma 46 Stanislaus 47 Sutter 48 Tehama 49 Tulare 50 Ventura 51 Yolo 52 Yuba Is there a way to modify a column's rows in a dataframe using a column's rows from a different dataframe? Again I want to keep the dataframes separate, but the goal is to get the dataframes to have the same number of rows containing the same values.
Since you just want common rows, you can compute them quickly using np.intersect1d: i = df1.values.squeeze() j = df2.values.squeeze() df1 = pd.DataFrame(np.intersect1d(i, j)) And have df2 just become a copy of df1: df2 = df1.copy(deep=True)
Using duplicated s=pd.concat([df1,df2],keys=[1,2]) df1,df2=s[s.duplicated(keep=False)].loc[1],s[s.duplicated(keep=False)].loc[1]
Combine certain rows values of duplicate rows Pandas
I have a dataframe based on football players. I am finding duplicate rows for when a player has transferred mid-season. My aim is to add the points the accumalted in both leagues and add them together to make just one row. Here is a sample of the data: name full_name club Points Start Sub 84 S. Mustafi Shkodran Mustafi Arsenal 76 26 1 85 S. Mustafi Shkodran Mustafi Arsenal -2 0 1 89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16 90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 119 Oscar Oscar dos Santos Emboaba NaN 16 5 8 120 Oscar Oscar dos Santos Emboaba NaN 1 0 2 121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 16 5 8 122 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 1 0 2 188 C. Bravo Claudio Bravo Manchester City 61 22 8 189 C. Bravo Claudio Bravo Manchester City 1 1 0 193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 200 G. Castro Gonzalo Castro Borussia Dortmund 79 23 6 201 G. Castro Gonzalo Castro Malaga CF 79 23 6 209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8 210 Juanfran Juan Francisco Torres Belen Atletico Madrid 74 34 2 211 Juanfran Juan Francisco Moreno Fuertes RC Coruna 86 21 8 212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2 My goal dataframe would have players like for example Mustafi's Points Start and Sum values added together to give just one player. Players like Bruno are clearly not the same person so I don't want to add the two brunos together. name full_name club Points Start Sub 84 S. Mustafi Shkodran Mustafi Arsenal 74 26 2 89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16 90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 119 Oscar Oscar dos Santos Emboaba NaN 17 5 10 121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10 188 C. Bravo Claudio Bravo Manchester City 62 23 8 193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 200 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12 209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8 212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2 Any help would be great!
You need: df[['name','full_name','club']] = df[['name','full_name','club']].fillna('') d = {'Points':'sum', 'Start':'sum', 'Sub':'sum', 'club':'first'} df = (df.groupby(['name','full_name'], sort=False, as_index=False) .agg(d) .reindex(columns=df.columns)) with pd.option_context('display.expand_frame_repr', False): print (df) name full_name club Points Start Sub 0 S. Mustafi Shkodran Mustafi Arsenal 74 26 2 1 Bruno Bruno SorianoLlido Villarreal CF 43 15 16 2 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 3 Oscar Oscar dos Santos Emboaba 17 5 10 4 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10 5 C. Bravo Claudio Bravo Manchester City 62 23 8 6 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 7 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 8 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12 9 Juanfran Juan Francisco Torres Belen Atletico Madrid 160 55 10 10 Juanfran Juan Francisco Moreno Fuertes RC Coruna 160 55 10 Explanation: First replace NaNs to '' by fillna for avoid omit rows with them in groupby Aggregate by groupby, agg with dictionary with specify columns and their aggregating functions Last for display all rows together temporarly use with