Related
I need to write a program that does the following:
First, find the County that has the highest turnout, i.e. the highest percentage of the
population who voted, using the objects’ population and voters attributes
Then, return a tuple containing the name of the County with the highest turnout and the
percentage of the population who voted, in that order; the percentage should be
represented as a number between 0 and 1.
I took a crack at it, but am getting the following error:
Error on line 19:
allegheny = County("allegheny", 1000490, 645469)
TypeError: object() takes no parameters
Here is what I've done so far. Thank you so much for your help.
class County:
def __innit__(self, innit_name, innit_population, innit_voters) :
self.name = innit_name
self.population = innit_population
self.voters = innit_voters
def highest_turnout(data) :
highest_turnout = data[0]
for County in data:
if (county.voters / county.population) > (highest_turnout.voters / highest_turnout.population):
highest_turnout = county
return highest_turnout
# your program will be evaluated using these objects
# it is okay to change/remove these lines but your program
# will be evaluated using these as inputs
allegheny = County("allegheny", 1000490, 645469)
philadelphia = County("philadelphia", 1134081, 539069)
montgomery = County("montgomery", 568952, 399591)
lancaster = County("lancaster", 345367, 230278)
delaware = County("delaware", 414031, 284538)
chester = County("chester", 319919, 230823)
bucks = County("bucks", 444149, 319816)
data = [allegheny, philadelphia, montgomery, lancaster, delaware, chester, bucks]
result = highest_turnout(data) # do not change this line!
print(result) # prints the output of the function
# do not remove this line!
def __innit__(self, innit_name, innit_population, innit_voters) :
You mispelled __init__
Excel table = this is the excel file screenshot which is how final result should be. Please take closer look at "Lifestyle" section.
I can't figure out how to make my python just like the excel picture screenshot. "Lifestyle" section needs to have 2 more sub-columns combined just like in a picture below. Any help would be appreciated.
I'm gonna post picture below PyCharm screenshot:
Here is my code:
#convert inches to feet-inches
def inch_to_feet(x):
feet = x // 12
inch = x % 12
return str(feet)+"'"+str(inch)+'"'
#file opened
print("Hello")
roster = input("Please enter a roster file: ")
if roster != "roster_extended.csv":
print("Invalid name")
elif roster == "roster_extended.csv":
additional_name = input("There are 13 lines in this file. Would you like to enter an additional names? (Y/N): ")
if additional_name == "Y":
input("How many more names?: ")
infile = open("roster_extended.csv", 'r')
b = infile.readline()
b = infile.readlines()
header = '{0:>12} {1:>35} {2:>3} {3:>16} {4:>5} {5:>3} {6:>9}'.format("FirstName","LastName","Age","Occupation","Ht","Wt","lifestyle")
print(header)
with open("roster_extended.csv", "a+") as infile:
b = infile.write(input("Enter first name: "))
for person in b:
newperson = person.replace("\n", "").split(",")
newperson[4] = eval(newperson[4])
newperson[4] = inch_to_feet(newperson[4])
newperson
formatted='{0:>12} {1:>35} {2:>3} {3:>16} {4:>5} {5:>3} {6:>9}'.format(newperson[0],newperson[1],newperson[2],newperson[3],newperson[4],newperson[5],newperson[6])
print(formatted)
Here is the output I get:
FirstName LastName Age Occupation Ht Wt lifestyle
Anna Barbara 35 nurse 5'3" 129
Catherine Do 45 physicist 5'5" 135
Eric Frederick 28 teacher 5'5" 140
Gabriel Hernandez 55 surgeon 5'7" 150 x
Ivy Joo 31 engineer 5'2" 126 x
Kelly Marks 21 student 5'4" 132
Nancy Owens 60 immunologist 5'8" 170 x
Patricia Qin 36 dental assistant 4'11" 110 x
Roderick Stevenson 51 bus driver 5'6" 160 x
Tracy Umfreville 42 audiologist 5'7" 156 x
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158
Lucy Xi 49 professor 5'9" 161
Yolanda Zachary 58 secretary 5'10" 164 x
Brief explanation of the solution:
You gave tabulated data as input (there are several ways to tabulate: check here). Since you're starting with python the solution keeps within standard library (thus not resorting to external libraries). Only format() and class variables are used to keep track of column width (if you delete elements you'll want to update the variables.) This programmatically automates tabulation.
Since you are starting out, I recommend putting a breakpoint in __init__() and __new__() to observe their behavior.
I used Enum because conceptually it's the right tool for the job. You only need to understand Enum.name and Enum.value, as for everything else consider it a normal class.
There are 2 output files, one in tabulated form and the other in barebone csv.
(For the most part the solution is "canonical" (or close). The procedural part was rushed, but gives a sufficient idea.)
import csv
import codecs
from enum import Enum
from pathlib import Path
IN_FILE = Path("C:\\your_path\\input.csv")
OUT_FILE = Path("C:\\your_path\\output1.csv")
OUT_FILE_TABULATE = Path("C:\\your_path\\output2.csv")
def read_csv(file) -> list:
with open(file) as csv_file:
reader_csv = csv.reader(csv_file, delimiter=',')
for row in reader_csv:
yield row
def write_file(file, result_ordered):
with codecs.open(file, "w+", encoding="utf-8") as file_out:
for s in result_ordered:
file_out.write(s + '\n')
class LifeStyle(Enum):
Sedentary = 1
Active = 2
Moderate = 3
def to_list(self):
list_life_style = list()
for one_style in LifeStyle:
if one_style is self:
list_life_style.append('x')
else:
list_life_style.append('')
return list_life_style
def tabulate(self):
str_list_life_style = list()
for one_style in LifeStyle:
if one_style is not self:
str_list_life_style.append('{: ^{width}}'.format(' ', width=len(one_style.name)))
else:
str_list_life_style.append('{: ^{width}}'.format('x', width=len(self.name)))
return str_list_life_style
def tabulate_single_column(self):
return '{: >{width}}'.format(str(self.name), width=len(LifeStyle.Sedentary.name))
#staticmethod
def header_single_column():
return ' {}'.format(LifeStyle.__name__)
#staticmethod
def header():
return ' {} {} {}'.format(
LifeStyle.Sedentary.name,
LifeStyle.Active.name,
LifeStyle.Moderate.name,
)
class Person:
_FIRST_NAME = "First Name"
_LAST_NAME = "Last Name"
_AGE = "Age"
_OCCUPATION = "Occupation"
_HEIGHT = "Height"
_WEIGHT = "Weight"
max_len_first_name = len(_FIRST_NAME)
max_len_last_name = len(_LAST_NAME)
max_len_occupation = len(_OCCUPATION)
def __new__(cls, first_name, last_name, age, occupation, height, weight, lifestyle):
cls.max_len_first_name = max(cls.max_len_first_name, len(first_name))
cls.max_len_last_name = max(cls.max_len_last_name, len(last_name))
cls.max_len_occupation = max(cls.max_len_occupation, len(occupation))
return super().__new__(cls)
def __init__(self, first_name, last_name, age, occupation, height, weight, lifestyle):
self.first_name = first_name
self.last_name = last_name
self.age = age
self.occupation = occupation
self.height = height
self.weight = weight
self.lifestyle = lifestyle
#classmethod
def _tabulate_(cls, first_name, last_name, age, occupation, height, weight):
first_part = '{: >{m_first}} {: >{m_last}} {: >{m_age}} {: <{m_occup}} {: <{m_height}} {: >{m_weight}}'.format(
first_name,
last_name,
age,
occupation,
height,
weight,
m_first=Person.max_len_first_name,
m_last=Person.max_len_last_name,
m_occup=Person.max_len_occupation,
m_age=len(Person._AGE),
m_height=len(Person._HEIGHT),
m_weight=len(Person._WEIGHT))
return first_part
#classmethod
def header(cls, header_life_style):
first_part = Person._tabulate_(Person._FIRST_NAME, Person._LAST_NAME, Person._AGE, Person._OCCUPATION,
Person._HEIGHT, Person._WEIGHT)
return '{}{}'.format(first_part, header_life_style)
def __str__(self):
first_part = Person._tabulate_(self.first_name, self.last_name, self.age, self.occupation, self.height,
self.weight)
return '{}{}'.format(first_part, ' '.join(self.lifestyle.tabulate()))
def single_column(self):
first_part = Person._tabulate_(self.first_name, self.last_name, self.age, self.occupation, self.height,
self.weight)
return '{} {}'.format(first_part, self.lifestyle.tabulate_single_column())
def populate(persons_populate):
for line in read_csv(IN_FILE):
life_style = ''
if line[6] == 'x':
life_style = LifeStyle.Sedentary
elif line[7] == 'x':
life_style = LifeStyle.Moderate
elif line[8] == 'x':
life_style = LifeStyle.Active
persons_populate.append(Person(line[0], line[1], line[2], line[3], line[4], line[5], life_style))
return persons_populate
persons = populate(list())
print(Person.header(LifeStyle.header()))
for person in persons:
print(person)
write_file(OUT_FILE_TABULATE, [str(item) for item in persons])
# add new persons here
persons.append(Person("teste", "teste", "22", "worker", "5'8\"", "110", LifeStyle.Active))
final_list = list()
for person in persons:
one_list = [person.first_name, person.last_name, person.age, person.occupation, person.height,
person.weight]
one_list.extend([item.strip() for item in person.lifestyle.tabulate()])
final_list.append(','.join(one_list))
write_file(OUT_FILE, final_list)
print("\n", Person.header(LifeStyle.header_single_column()))
for person in persons:
print(person.single_column())
output1.csv:
Anna,Barbara,35,nurse,5'3",129,,,x
Catherine,Do,45,physicist,5'5",135,,x,
Eric,Frederick,28,teacher,5'5",140,,,x
Gabriel,Hernandez,55,surgeon,5'7",150,x,,
Ivy,Joo,31,engineer,5'2",126,x,,
Kelly,Marks,21,student,5'4",132,,x,
Nancy,Owens,60,immunologist,5'8",170,x,,
Patricia,Qin,36,dental assistant,4'11",110,x,,
Roderick,Stevenson,51,bus driver,5'6",160,x,,
Tracy,Umfreville,42,audiologist,5'7",156,x,,
Victoria,Wolfeschlegelsteinhausenbergerdorff,38,data analyst ,5'8",158,,,x
Lucy,Xi,49,professor,5'9",161,,,x
Yolanda,Zachary,58,secretary,5'10",164,x,,
teste,teste,22,worker,5'8",110,,x,
output2.csv:
Anna Barbara 35 nurse 5'3" 129 x
Catherine Do 45 physicist 5'5" 135 x
Eric Frederick 28 teacher 5'5" 140 x
Gabriel Hernandez 55 surgeon 5'7" 150 x
Ivy Joo 31 engineer 5'2" 126 x
Kelly Marks 21 student 5'4" 132 x
Nancy Owens 60 immunologist 5'8" 170 x
Patricia Qin 36 dental assistant 4'11" 110 x
Roderick Stevenson 51 bus driver 5'6" 160 x
Tracy Umfreville 42 audiologist 5'7" 156 x
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158 x
Lucy Xi 49 professor 5'9" 161 x
Yolanda Zachary 58 secretary 5'10" 164 x
single_column:
Anna Barbara 35 nurse 5'3" 129 Moderate
Catherine Do 45 physicist 5'5" 135 Active
Eric Frederick 28 teacher 5'5" 140 Moderate
Gabriel Hernandez 55 surgeon 5'7" 150 Sedentary
Ivy Joo 31 engineer 5'2" 126 Sedentary
Kelly Marks 21 student 5'4" 132 Active
Nancy Owens 60 immunologist 5'8" 170 Sedentary
Patricia Qin 36 dental assistant 4'11" 110 Sedentary
Roderick Stevenson 51 bus driver 5'6" 160 Sedentary
Tracy Umfreville 42 audiologist 5'7" 156 Sedentary
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158 Moderate
Lucy Xi 49 professor 5'9" 161 Moderate
Yolanda Zachary 58 secretary 5'10" 164 Sedentary
teste teste 22 worker 5'8" 110 Active
I am trying to solve a vehicle routing problem with 5 drivers for deliveries. I am using haversine and lat-long to calculate the distance matrix. I am new to OR tools, so following the vrp example.
The issues is that the out 0f 5 drivers, only routes are generated for 2 drivers and these routes are very long. I want to generate multiple shorter routes so that all the drivers are utilized. Can please check if I am setting some constraint wrong.
Can someone please explain, how to set "Distance" dimension and SetGlobalSpanCostCoefficient in google OR-tools. Here is the code and output.
from __future__ import print_function
import pandas as pd
import numpy as np
import googlemaps
import math
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
gmaps = googlemaps.Client(key='API Key')
def calculate_geocodes():
df = pd.read_csv("banglore_zone.csv")
df['lat'] = pd.Series(np.repeat(0, df.size), dtype=float)
df['long'] = pd.Series(np.repeat(0, df.size), dtype=float)
result = np.zeros([df.size, 2])
for index, row in df.iterrows():
# print(row['Address'])
geocode_result = gmaps.geocode(row['Address'])[0]
lat = (geocode_result['geometry']['location']['lat'])
lng = (geocode_result['geometry']['location']['lng'])
result[index] = lat, lng
df.lat[index] = lat
df.long[index] = lng
print("First step", df)
coords = df.as_matrix(columns=['lat', 'long'])
return coords, df
def calculate_distance_matrix(coordinates, gmaps):
distance_matrix = np.zeros(
(np.size(coordinates, 0), np.size(coordinates, 0))) # create an empty matrix for distance between all locations
for index in range(0, np.size(coordinates, 0)):
src = coordinates[index]
for ind in range(0, np.size(coordinates, 0)):
dst = coordinates[ind]
distance_matrix[index, ind] = distance(src[0], src[1], dst[0], dst[1])
return distance_matrix
def distance(lat1, long1, lat2, long2):
# Note: The formula used in this function is not exact, as it assumes
# the Earth is a perfect sphere.
# Mean radius of Earth in miles
radius_earth = 3959
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi / 180.0
phi1 = lat1 * degrees_to_radians
phi2 = lat2 * degrees_to_radians
lambda1 = long1 * degrees_to_radians
lambda2 = long2 * degrees_to_radians
dphi = phi2 - phi1
dlambda = lambda2 - lambda1
a = haversine(dphi) + math.cos(phi1) * math.cos(phi2) * haversine(dlambda)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = radius_earth * c
return d
def haversine(angle):
h = math.sin(angle / 2) ** 2
return h
def create_data_model(distance_matrix, number_of_vehicles, depot):
"""Stores the data for the problem."""
data = {}
data['distance_matrix'] = distance_matrix
print(distance_matrix)
data['num_vehicles'] = number_of_vehicles
data['depot'] = depot
return data
def print_solution(data, manager, routing, solution, address_dataframe):
"""Prints solution on console."""
max_route_distance = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
route_distance = 0
while not routing.IsEnd(index):
plan_output += ' {} ---> '.format(address_dataframe.iloc[manager.IndexToNode(index), 0])
previous_index = index
index = solution.Value(routing.NextVar(index))
route_distance += routing.GetArcCostForVehicle(
previous_index, index, vehicle_id)
plan_output += '{}\n'.format(manager.IndexToNode(index))
plan_output += 'Distance of the route: {}m\n'.format(route_distance)
print(plan_output)
max_route_distance = max(route_distance, max_route_distance)
print('Maximum of the route distances: {}m'.format(max_route_distance))
def main():
coordinates, address_dataframe = calculate_geocodes()
distance_matrix = calculate_distance_matrix(coordinates, gmaps)
data = create_data_model(distance_matrix, 5, 0)
# Create the routing index manager.
manager = pywrapcp.RoutingIndexManager(
len(data['distance_matrix']), data['num_vehicles'], data['depot'])
# Create Routing Model.
routing = pywrapcp.RoutingModel(manager)
# Create and register a transit callback.
def distance_callback(from_index, to_index):
"""Returns the distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['distance_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
# Define cost of each arc.
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# Add Distance constraint.
dimension_name = 'Distance'
routing.AddDimension(
transit_callback_index,
0, # no slack
80, # vehicle maximum travel distance
True, # start cumul to zero
dimension_name)
distance_dimension = routing.GetDimensionOrDie(dimension_name)
distance_dimension.SetGlobalSpanCostCoefficient(100)
# Setting first solution heuristic.
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GUIDED_LOCAL_SEARCH)
search_parameters.time_limit.seconds = 120
search_parameters.log_search = False
# Solve the problem.
solution = routing.SolveWithParameters(search_parameters)
# Print solution on console.
if solution:
print_solution(data, manager, routing, solution, address_dataframe)
if __name__ == '__main__':
main()
And the distance matrix and output is -
> [[ 0. 0.31543319 3.36774402 ... 8.79765925 8.94261055
> 8.83759758] [ 0.31543319 0. 3.09418962 ... 8.81074289 8.95034082
> 8.84901702] [ 3.36774402 3.09418962 0. ... 10.87348059 10.97329493
> 10.89962072] ... [ 8.79765925 8.81074289 10.87348059 ... 0. 0.20726879
> 0.06082994] [ 8.94261055 8.95034082 10.97329493 ... 0.20726879 0.
> 0.1465572 ] [ 8.83759758 8.84901702 10.89962072 ... 0.06082994 0.1465572
> 0. ]] Route for vehicle 0: 3Embed software 10th Cross St, RBI Colony, Ganganagar, Bengaluru, Karnataka 560024 ---> 0 Distance of
> the route: 0m
>
> Route for vehicle 1: 3Embed software 10th Cross St, RBI Colony,
> Ganganagar, Bengaluru, Karnataka 560024 ---> 0 Distance of the route:
> 0m
>
> Route for vehicle 2: 3Embed software 10th Cross St, RBI Colony,
> Ganganagar, Bengaluru, Karnataka 560024 ---> Sindhi High School,
> Kempapura ---> Hoppers stop Building No.12, Krishnaja Avenue, Near
> Kogilu Cross, International Airport Road, Yelahanka, Bengaluru --->
> Kempegowda International Airport Bengaluru ---> mvit International
> Airport Road, Hunasamaranahalli, Yelahanka, Krishnadeveraya Nagar,
> Bengaluru ---> Canadian International School 4 & 20, Manchenahalli,
> Yelahanka, Bengaluru ---> brick factory RMZ Galleria, Office Block,
> Ground Floor, B.B. Road, Yelahanka ---> Jakkur Aerodrome Bellary
> Road, Post, Yelahanka, Bengaluru, Karnataka ---> Godrej Platinum
> International Airport Road, Hebbal, Bengaluru, Karnataka ---> Vidya
> Niketan School 30, Kempapura Main Road, Kempapura, Hebbal, Bengaluru,
> Karnataka ---> Atria Institute of Technology, 1st Main Rd, Ags
> Colony, Anandnagar, Hebbal, Bengaluru, Karnataka ---> 0 Distance of
> the route: 26m
>
> Route for vehicle 3: 3Embed software 10th Cross St, RBI Colony,
> Ganganagar, Bengaluru, Karnataka 560024 ---> Caffe cofee day CBI road
> banglore ---> 0 Distance of the route: 0m
>
> Route for vehicle 4: 3Embed software 10th Cross St, RBI Colony,
> Ganganagar, Bengaluru, Karnataka 560024 ---> RT Nagar Police station
> ---> bus stop mekhri circle banglore ---> Truffles 80 Feet Road, Jaladarsini Layout, Sanjaynagar Banglore ---> BEL circle banglore
> ---> Paragon Outlet Shivapura, Peenya, Bengaluru ---> Taj vivanta Yeshwantpur, Bengaluru ---> Orion Mall A Block, Brigade Gateway, Dr
> Rajkumar Rd, Malleshwaram, Bengaluru ---> brand factory Malleshwaram
> Banglore ---> Mantri Square Mall, Sampige Road, Malleshwaram,
> Bengaluru, Karnataka ---> Krantivira Sangolli Rayanna Bengaluru, M.G.
> Railway Colony, Majestic, Bengaluru ---> UB city banglore --->
> Brigade road banglore ---> MG Road metro station Banglore --->
> commercial street bangalore ---> Infantry Road, Beside Prestige
> Building, Tasker Town, Shivaji Nagar, Bengaluru, Karnataka --->
> Garuda Mall Magrath Rd, Ashok Nagar, Bengaluru ---> Brand Factory -
> Home Town Above HomeTown, Vanshee Towers, Survey No.92/4 3rd and 4th
> Floors, Outer Ring Rd, Marathahalli, Bengaluru ---> KLM Shopping
> Mall, Marathahalli Bridge Marathahalli ---> Favourite Shop HAL Old
> Airport Rd, Subbaiah Reddy Colony, Marathahalli Village, Marathahalli,
> Bengaluru ---> Max RPR Plaza, Varthur Rd, Marathahalli, Bengaluru
> ---> Pick 'n' Move Shop No. 102, Ground Floor, Varthur Rd, Marathahalli Village, Marathahalli, Bengaluru ---> Lotto Shoes 45/2,
> Varthur Rd, Marathahalli Village, Marathahalli, Bengaluru, Karnataka
> ---> The Raymond Shop Opp. Mga Hospital, Marathalli Main Road Near Ring Road Junction, Bengaluru, Karnataka ---> chinnaswamy stadium
> banglore ---> fun cinemas cunningham road Banglore ---> Cant station
> banglore ---> Radhakrishna Theatre 25, 1st Main Rd, Mattdahally, RT
> Nagar, Bengaluru ---> Presidency School Near R T Nagar, HMT Layout,
> Bengaluru, Karnataka ---> 0 Distance of the route: 20m
>
> Maximum of the route distances: 26m
You should reduce vehicle maximum travel distance. currently you set it to 80.
and your routes distances are 20 and 26.
routing.AddDimension(
transit_callback_index,
0, # no slack
80, # vehicle maximum travel distance
True, # start cumul to zero
dimension_name)
You can use global span cost which would reduce the longest route traveled by any vehicle.
Eg. use it on the distance dimension like distance_dimension.SetGlobalSpanCostCoefficient().
Pass a high integer value as an argument, which is greater than the sum of all costs. Try setting it to 1000 or so.
I have the following Raw data from a HTML source code file
{$deletedFields:[courses,projects,description,degreeName,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),school:urn:li:fs_miniSchool:11709,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,schoolName:Charles University in Prague,fieldOfStudy:Economics, Politics,schoolUrn:urn:li:fs_miniSchool:11709,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[courses,projects,description,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),school:urn:li:fs_miniSchool:17888,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,degreeName:BA,schoolName:Occidental College,fieldOfStudy:Economics,schoolUrn:urn:li:fs_miniSchool:17888,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[],profileId:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,elements:[urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717)],paging:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView,paging,$type:com.linkedin.voyager.identity.profile.EducationView,$id:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView},
{$deletedFields:[],start:501,end:1000,$type:com.linkedin.voyager.identity.profile.EmployeeCountRange,$id:urn:li:fs_position:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,323432440),company,employeeCountRange}
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,startDate},
{$deletedFields:[month,day],year:2004,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,startDate},
{$deletedFields:[month,day],year:2008,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,endDate},
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,endDate},
What i need is extract some data out of it using.
schoolname = re.findall(r',schoolname:(.*?),' , page_html)
fieldofstudy = skills = re.findall(r'fieldOfStudy:(.*?),s' , page_html)
degreename = re.findall(r'degreeName:(.*?),' , page_html)
Needed Output
schoolName:Charles University in Prague
fieldOfStudy:Economics, Politics
Start : Year 2007
End : 2007
schoolName:Occidental College
fieldOfStudy:Economics
degreeName:BA
start : 2004
End : 2008
Question: What i need is extract some data out of it using
Define a Data Container class School:
class School(object):
def __init__(self, raw_data):
key = None
year = '?'
for kv in raw_data:
i = kv.find(':')
if i >= 0:
key = kv[0:i]
value = kv[i + 1:]
if key in ['schoolName', 'fieldOfStudy', 'startDate', 'endDate', 'degreeName']:
object.__setattr__(self, key, value)
if key in ['year']:
year = value
else:
if key in ['entityUrn', '$id']:
if kv[:-1].isdigit():
self.entity = kv[:-1]
elif key in ['fieldOfStudy']:
self.fieldOfStudy += ', '+kv
elif kv in ['startDate', 'endDate']:
object.__setattr__(self, kv, year)
key = ''
if not hasattr(self, 'degreeName'):
self.degreeName = 'unknown'
def __repr__(self):
return "entity:\t\t{s.entity:>28}\n" \
"schoolName:\t{s.schoolName:>28}\n" \
"fieldOfStudy:{s.fieldOfStudy:>27}\n" \
"degreeName:\t{s.degreeName:>28}\n" \
"startDate:\t{s.startDate:>28}\n" \
"endDate:\t{s.endDate:>28}\n".format(s=self)
Read the file line by line:
with open('<path to file>') as fh:
degreeUrn = {}
for line in fh:
match = re.findall(r'\{(.*?)\:\[(.*?)\],(.*)\}', line)
m2 = match[0][2].split(',')
school = School(m2)
if hasattr(school, 'entity'):
if hasattr(school, 'startDate'):
degreeUrn[school.entity].startDate = school.startDate
del school
elif hasattr(school, 'endDate'):
degreeUrn[school.entity].endDate = school.endDate
del school
elif hasattr(school, 'schoolName'):
degreeUrn[school.entity] = school
else:
del school
for entity in degreeUrn:
print(degreeUrn[entity])
Output:
entity: 75863717
schoolName: Charles University in Prague
fieldOfStudy: Economics, Politics
degreeName: unknown
startDate: 2007
endDate: 2007
entity: 26812055
schoolName: Occidental College
fieldOfStudy: Economics
degreeName: BA
startDate: 2004
endDate: 2008
Tested with Python: 3.4.2
Below is the code, which i was trying out, but i am not getting the expected results.
import re
def multiwordReplace(text, wordDic):
"""
take a text and replace words that match a key in a dictionary with
the associated value, return the changed text
"""
rc = re.compile('|'.join(map(re.escape, wordDic)))
def translate(match):
return wordDic[match.group(0)]
return rc.sub(translate, text)
wordDic = {
'ANGLO': 'ANGLO IRISH BANK',
'ANGLO IRISH': 'ANGLO IRISH BANK'
}
def replace(match):
return wordDic[match.group(0)]
#return ''.join(y for y in match.group(0).split())
str1 = {'ANGLO IRISH CORP PLC - THIS FOLLOWS THE BANK NATIONALIZATION BY THE GOVT OF THE REPUBLIC OF IRELAND'
'ANGLO CORP PLC - THIS FOLLOWS THE BANKS NATIONALIZATION BY THE GOVT OF THE REPUBLIC OF IRELAND'}
for item in str1:
str2 = multiwordReplace(item, wordDic)
print str2
print re.sub('|'.join(r'\b%s\b' % re.escape(s) for s in wordDic),
replace, item)
Output:
ANGLO IRISH BANK IRISH CORP PLC - THIS FOLLOWS THE BANK NATIONALIZATION BY THE GOVT OF THE REPUBLIC OF IRELAND
ANGLO IRISH BANK CORP PLC - THIS FOLLOWS THE BANKS NATIONALIZATION BY THE GOVT OF THE REPUBLIC OF IRELAND
the first one has to give only 'ANGLO IRISH BANK' and not ANGLO IRISH BANK IRISH.
Sort so that the longest possible match appears first.
longest_first = sorted(wordDic, key=len, reverse=True)
rc = re.compile('|'.join(map(re.escape, longest_first)))