Python list manipulation based on indexing - python

I have two lists:
The first list consists of all the titles of various publications where as the second list consists of all the author names.
list B = ['Moe Terry M 2005 ', 'March James G and Johan P Olsen 2006 ', 'Kitschelt Herbert 2000 ', 'Bates Robert H 1981 ' , .......]
list A = ['"Linkages between Citizens and Politicians in Democratic Polities,"', '"Winners Take All: The Politics of Partial Reform in Postcommunist \n\nTransitions,"', '"Inequality, Social Insurance, and \n\nRedistribution."', '"Majoritarian Electoral Systems and \nConsumer Power: Price-Level Evidence from the OECD Countries."']
I am running scholar.py as a bash command. The syntax goes like this
scholar = "python scholar.py -c 1 --author " + str(name) + "--phrase " + str(title)
Now, what I am trying to do is get each title and author in order so that I can use them with scholar.
But I am not able to figure out how can I get the first author name with first title .
I would have used indexing if the lists were small.

Is this is what you are looking for?
list B = ['Moe Terry M 2005 ', 'March James G and Johan P Olsen 2006 ', 'Kitschelt Herbert 2000 ', 'Bates Robert H 1981 ' , .......]
list A = ['"Linkages between Citizens and Politicians in Democratic Polities,"', '"Winners Take All: The Politics of Partial Reform in Postcommunist \n\nTransitions,"', '"Inequality, Social Insurance, and \n\nRedistribution."', '"Majoritarian Electoral Systems and \nConsumer Power: Price-Level Evidence from the OECD Countries."']
for i,j in zip(B,A):
print i, j #python 2.x
print(i , j) #python3.x

Related

how to assign a variable to each line from a repeating text pattern in python?

I have a python scraping script to get infos about some upcomming concerts and it's the same text pattern everytime no matter how many concerts will appear, it means that each line will always be referring to a certain information such as this example (please note that there are no spaces between concerts, my data is exactly in this format):
01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City ect...
For each line I need to assign it to a variable, so 4 variables in total:
time = all the 1st lines
name = all the 2nd lines
location = all the 3rd lines
city = all the 4th lines
Then loop through all the lines to catch the informations corresponding to each variables, such as getting all the dates from the 1st lines, all the names from the 2nd lines ect...
so far I haven't found any solutions yet, and I barely know anything about regex
I hope that you see the idea, don't hesitate if you have any questions thanks
No need to use regex:
string = '''01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City
'''
lines = string.split('\n')
dates = [i for i in lines [0::4]]
bands = [i for i in lines [1::4]]
places = [i for i in lines [2::4]]
cities = [i for i in lines [3::4]]
This will give you a list of dates/bands/places/cities, which will be easier to work with.
If you want to turn them back into a string, you could do:
'; '.join(dates) #Do the same for all 4 variables
Which brings:
'01/01/99 9PM; 01/01/99 9.30PM; 01/02/99 8.45PM; 01/02/99 9PM; '
You could replace '; ' with ' ' if you only want them to be space separated, or with whichever you like.
I would personally use namedtuples. Note that I put your data in a file called input.txt.
from collections import namedtuple
Entry = namedtuple("Entry", "time name location city")
with open('input.txt') as f:
lines = [line.strip() for line in f]
objects = [Entry(*lines[i:i+4]) for i in range(0, len(lines), 4)]
print(*objects, sep='\n')
for obj in objects:
print(obj.name)
Output:
Entry(time='01/01/99 9PM', name='Iron Maiden', location='Madison Square Garden', city='New York City')
Entry(time='01/01/99 9.30PM', name='The Doors', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 8.45PM', name='Dr Dre & Snoop Dogg', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 9PM', name='Diana Ross', location='City Hall', city='New York City')
Iron Maiden
The Doors
Dr Dre & Snoop Dogg
Diana Ross
This calls for slicing:
times = lines[0::4]
names = lines[1::4]
locations = lines[2::4]
cities = lines[3::4]
And now we can zip those lists into tuples:
events = zip(*[times, names, locations, cities])
With your sample data, this gives us
>>> list(events)
[('01/01/99 9PM', 'Iron Maiden', 'Madison Square Garden ', 'New York City'), ('01/01/99 9.30PM', 'The Doors', 'Staples Center', 'Los Angeles'), ('01/02/99 8.45PM', 'Dr Dre & Snoop Dogg', 'Staples Center', 'Los Angeles'), ('01/02/99 9PM', 'Diana Ross', 'City Hall', 'New York City')]
You can now process these tuples into any data structure that suits your use case best.

Is there any better way to do name categorisation based on uppercase and lowercase?

I want categorized the free text written name and make a categorical variable after this
Only first : Only first letter is capital
Standard usage : First letter every words is capital
All capital : Every letter is in capital letter
All small : Every letter is in lover case
Unidentified : Not in any of 4 category above
Here's my data
Id Name
1 Donald trump
2 Barack Obama
3 Hillary ClintoN
4 BILL GATES
5 jeff bezoz
6 Mark Zuckerberg
What I want
Id Name Category
1 Donald trump Only first
2 Barack Obama Standard usage
3 Hillary ClintoN Unidentified
4 BILL GATES All capital
5 jeff bezoz All small
6 Mark Zuckerberg Standard usage
What I did is
df['Uppercase'] = df['Name'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['Name'].str.findall(r'[a-z]').str.len()
df['WordCount'] = df['Name'].str.count(' ') + 1
Then do some logic using map function, such us:
`df['Lowercase'] = 0` for `All capital`
`df['Uppercase'] = 0` for `All small`
`df['Uppercase'] - df['WordCount'] = 0` for `Standard usage`
`df['Uppercase'] = 1 and `df['WordCount']` for `Only first`
If this does't belong to anything it labelled as Unidentified
But, naBih baWazir will be recorded as Standard usage based on standard rule, not Unidentified, I think there's any better way to do so
Use functions Series.str.islower
Series.str.isupper
Series.str.istitle and for new column numpy.select:
#test all letters without first for lower and first value for upper
m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
m2 = df['Name'].str.istitle()
m3 = df['Name'].str.islower()
m4 = df['Name'].str.isupper()
df['Category'] = np.select([m1, m2, m3, m4],
['Only first','Standard usage','All small','All capital'],
default='Unidentified ')
print (df)
Id Name Category
0 1 Donald trump Only first
1 2 Barack Obama Standard usage
2 3 Hillary ClintoN Unidentified
3 4 BILL GATES All capital
4 5 jeff bezoz All small
5 6 Mark Zuckerberg Standard usage
Idea by #Jon Clements, thank you:
m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
df1 = df.Name.agg([str.istitle, str.islower, str.isupper])
df['Category'] = np.select(
[m1, *df1.values.T],
['Only first','Standard usage','All small','All capital'],
default='Unidentified '
)
You might need to modify the function as per your requirement. But this would give you a rough idea to do it using the python built-in functions.
You can use something like this.
name_list = ['Donald trump','Barack Obama','Hillary Clinton','BILL GATES','jeff bezoz','Mark Zuckerberg']
for name in name_list:
if name.isupper():
print(name, 'All capital')
elif name.islower():
print(name, 'All small')
elif name.istitle():
print(name, 'Standard usage')
elif (name[0].isupper() and name[1:].islower()):
print(name, 'Only first')
else:
print(name, 'Unidentified')

Removing \n \t from data scraped from website

I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.
Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla
You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())
As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']
// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content

how to get string values using index method in Python

I need to parse text of text file into two categories:
University
Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad)
but the text file contain following text:
"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
Code:
for l in content:
rep = l.replace('"','')
if ',' in rep:
uni = rep.split(',')[0]
loc = rep.split(',')[-1].strip()
else:
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
It Return following Output, Where 3 and 5 are index value before cities:
3 represents Government College University
5 represents Imperial College of Business Studies
Uni: Imperial College of Business Studies Loc: Lahore
Uni: 3 Loc: Faisalabad
Uni: 5 Loc: Lahore
Uni: University of Peshawar Loc: Peshawar
Uni: University of Sindh Loc: Jamshoro
I want the Program to return me the string value against index value 3 & 5.
In the case where there is no comma, the lines
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
first find the location as the last element of the string, then tell you at what index the string occurs. What you want is everything but the last word in the string which you can get as
uni = ' '.join(rep.split()[:-1])
It might be better just to replace the , by '' to begin with so that there was only one case to deal with. Also, my inclination is to split the string only once.
words = rep.split() # the default is to split at whitespace
loc = words[-1]
uni = ' '.join(words[:-1])
So, I would write the loop like this:
for l in content:
rep = l.strip('"').replace(',','')
words = rep.split()
loc = words[-1]
uni = ' '.join(words[:-1])
print(uni, loc)
This prints
('Imperial College of Business Studies', 'Lahore')
('Government College University', 'Faisalabad')
('Imperial College of Business Studies', 'Lahore')
('University of Peshawar', 'Peshawar')
('University of Sindh', 'Jamshoro')
which I take it is what you want.
just cycle thru and take the last element as location:
content = [
"Government College University Faisalabad",
"Imperial College of Business Studies Lahore",
"University of Peshawar, Peshawar",
"University of Sindh, Jamshoro"]
locs = [l.split()[-1] for l in content]
print locs
['Faisalabad', 'Lahore', 'Peshawar', 'Jamshoro']

Python Regular Expression to Identify City Names Out Of Strings

Using regular expression in Python 3.4, how would I extract the city names from the following text below?
replacement windows in seattle wa
basement remodeling houston texas
siding contractor new york ny
windows in elk grove village
Sometimes the city name before it has \sin\s, sometimes it doesn't. Sometimes it has a general word like 'windows', 'remodeling', ... anything. Sometimes there is no state full name or state abbreviation at the end.
Is there a single regular expression that can capture these above conditions?
Here's what I've tried so far but it only captures 'seattle'.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor new york ny',
'windows in elk grove village'
]
for i in l:
m = re.search(r'(?<=\sin\s)(.+)(?=\s(wa|texas|ny))', i)
m.group(1)
What you are after is not possible with regular expressions. Regular expressions need string patterns to work. In your case, it would seem that the pattern either does not exist or can take a myriad of forms.
What you could do would be to use a search efficient data structure and split your string in words. You would then go through each word and see if it is in your search efficient data structure.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor newyork ny',
'windows in elk grove village']
p = re.compile(r"(\w+)\s(?:(wa | texas | ny | village))", re.VERBOSE)
for words in l:
print p.search(words).expand(r"\g<1> <-- the code is --> \g<2>")

Categories