how to get string values using index method in Python - python

I need to parse text of text file into two categories:
University
Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad)
but the text file contain following text:
"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
Code:
for l in content:
rep = l.replace('"','')
if ',' in rep:
uni = rep.split(',')[0]
loc = rep.split(',')[-1].strip()
else:
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
It Return following Output, Where 3 and 5 are index value before cities:
3 represents Government College University
5 represents Imperial College of Business Studies
Uni: Imperial College of Business Studies Loc: Lahore
Uni: 3 Loc: Faisalabad
Uni: 5 Loc: Lahore
Uni: University of Peshawar Loc: Peshawar
Uni: University of Sindh Loc: Jamshoro
I want the Program to return me the string value against index value 3 & 5.

In the case where there is no comma, the lines
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
first find the location as the last element of the string, then tell you at what index the string occurs. What you want is everything but the last word in the string which you can get as
uni = ' '.join(rep.split()[:-1])
It might be better just to replace the , by '' to begin with so that there was only one case to deal with. Also, my inclination is to split the string only once.
words = rep.split() # the default is to split at whitespace
loc = words[-1]
uni = ' '.join(words[:-1])
So, I would write the loop like this:
for l in content:
rep = l.strip('"').replace(',','')
words = rep.split()
loc = words[-1]
uni = ' '.join(words[:-1])
print(uni, loc)
This prints
('Imperial College of Business Studies', 'Lahore')
('Government College University', 'Faisalabad')
('Imperial College of Business Studies', 'Lahore')
('University of Peshawar', 'Peshawar')
('University of Sindh', 'Jamshoro')
which I take it is what you want.

just cycle thru and take the last element as location:
content = [
"Government College University Faisalabad",
"Imperial College of Business Studies Lahore",
"University of Peshawar, Peshawar",
"University of Sindh, Jamshoro"]
locs = [l.split()[-1] for l in content]
print locs
['Faisalabad', 'Lahore', 'Peshawar', 'Jamshoro']

Related

Making a single line of fcommand to correct paragraph upper/lowwer cases

I've been solving a challenge of fixing the input paragraph into properly capitalized letters. This is my code:
return ''.join([in_text[0].upper()] + [in_text[i].lower() if in_text[i-1] != '.' or in_text[i-2] != ' ' or in_text[i].islower() else in_text[i].upper() for i in range(1, len(in_text))])
However, it was only able to capitalize the first letter of the paragraph and not the first letter of the sentence.
Let's say you had the following paragraph:
text_example = "australia is a country and continent located in the southern hemisphere.it is the world's sixth-largest country by total area and has the world's largest coral reef system, the Great Barrier Reef. australia is known for its diverse landscapes and unique wildlife, such as kangaroos and koalas."
As can be seen, the starting word for each sentence is in lowercase.
We can fix this by:
Splitting the paragraphs into sentences (using regex)
using the capitalize function to capitalize the first word of each sentence
Joining the sentences back into a paragraph
The regex (?<=[.!?])[\s]* will split when it finds either
a period .,
an exclamation mark !
or a question mark ?
followed by zero or more spaces [\s]*
Here is the code:
Code:
import re
def fix_paragraph(pgraph):
# split the paragraph into sentences
sentences = re.split(r"(?<=[.!?])[\s]*", pgraph)
# capitalize each sentence
sentences = [sentence[0].capitalize() + sentence[1:] if len(sentence) > 0 else "" for sentence in sentences]
# join the sentences back into a single paragraph and return
text = " ".join(sentences)
return text
text_example = "australia is a country and continent located in the southern hemisphere.it is the world's sixth-largest country by total area and has the world's largest coral reef system, the Great Barrier Reef. australia is known for its diverse landscapes and unique wildlife, such as kangaroos and koalas."
fixed_text = fix_paragraph(text_example)
print(fixed_text)
Output:
Australia is a country and continent located in the southern hemisphere. It is the world's sixth-largest country by total area and has the world's largest coral reef system, the Great Barrier Reef. Australia is known for its diverse landscapes and unique wildlife, such as kangaroos and koalas.

Find whether a sentence has the starting words of another sentence or the ending words of the same sentence

For example , I have a set of sentences like this :
New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country.
Lets take a bus to get to point b from point a.
And another sentence like this :
is cool in the south of that country
The output should be : The weather is cool in the south of that country.
If I have an input like of United States The weather is cool the output should be :
D.C. is the capital of United States The weather is cool in the south of that country.
So far i tried difflib and got the overlap but that doesn't quite solve the problem in all cases.
You could build a dictionary of starting expressions and ending expressions from the sentences. Then find a prefix and suffix for the sentence to extend in these dictionaries. In both cases you would need to build/check one key for each substring of words starting from the beginning and from the end:
sentences="""New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country
Lets take a bus to get to point b from point a""".split("\n")
ends = { tuple(sWords[i:]):sWords[:i] for s in sentences
for sWords in [s.split()] for i in range(len(sWords)) }
starts = { tuple(sWords[:i]):sWords[i:] for s in sentences
for sWords in [s.split()] for i in range(1,len(sWords)+1) }
def extendSentence(sentence):
sWords = sentence.split(" ")
prefix = next( (ends[p] for i in range(1,len(sWords)+1)
for p in [tuple(sWords[:i])] if p in ends),
[])
suffix = next( (starts[p] for i in range(len(sWords))
for p in [tuple(sWords[i:])] if p in starts),
[])
return " ".join(prefix + [sentence] + suffix)
output:
print(extendSentence("of United States The weather is cool"))
# D.C. is the capital of United States The weather is cool in the south of that country
print(extendSentence("is cool in the south of that country"))
# The weather is cool in the south of that country
note that I had to remove the periods at the end of sentences because they prevent matching. You will need to clean these up in the dictionary building step

how to assign a variable to each line from a repeating text pattern in python?

I have a python scraping script to get infos about some upcomming concerts and it's the same text pattern everytime no matter how many concerts will appear, it means that each line will always be referring to a certain information such as this example (please note that there are no spaces between concerts, my data is exactly in this format):
01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City ect...
For each line I need to assign it to a variable, so 4 variables in total:
time = all the 1st lines
name = all the 2nd lines
location = all the 3rd lines
city = all the 4th lines
Then loop through all the lines to catch the informations corresponding to each variables, such as getting all the dates from the 1st lines, all the names from the 2nd lines ect...
so far I haven't found any solutions yet, and I barely know anything about regex
I hope that you see the idea, don't hesitate if you have any questions thanks
No need to use regex:
string = '''01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City
'''
lines = string.split('\n')
dates = [i for i in lines [0::4]]
bands = [i for i in lines [1::4]]
places = [i for i in lines [2::4]]
cities = [i for i in lines [3::4]]
This will give you a list of dates/bands/places/cities, which will be easier to work with.
If you want to turn them back into a string, you could do:
'; '.join(dates) #Do the same for all 4 variables
Which brings:
'01/01/99 9PM; 01/01/99 9.30PM; 01/02/99 8.45PM; 01/02/99 9PM; '
You could replace '; ' with ' ' if you only want them to be space separated, or with whichever you like.
I would personally use namedtuples. Note that I put your data in a file called input.txt.
from collections import namedtuple
Entry = namedtuple("Entry", "time name location city")
with open('input.txt') as f:
lines = [line.strip() for line in f]
objects = [Entry(*lines[i:i+4]) for i in range(0, len(lines), 4)]
print(*objects, sep='\n')
for obj in objects:
print(obj.name)
Output:
Entry(time='01/01/99 9PM', name='Iron Maiden', location='Madison Square Garden', city='New York City')
Entry(time='01/01/99 9.30PM', name='The Doors', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 8.45PM', name='Dr Dre & Snoop Dogg', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 9PM', name='Diana Ross', location='City Hall', city='New York City')
Iron Maiden
The Doors
Dr Dre & Snoop Dogg
Diana Ross
This calls for slicing:
times = lines[0::4]
names = lines[1::4]
locations = lines[2::4]
cities = lines[3::4]
And now we can zip those lists into tuples:
events = zip(*[times, names, locations, cities])
With your sample data, this gives us
>>> list(events)
[('01/01/99 9PM', 'Iron Maiden', 'Madison Square Garden ', 'New York City'), ('01/01/99 9.30PM', 'The Doors', 'Staples Center', 'Los Angeles'), ('01/02/99 8.45PM', 'Dr Dre & Snoop Dogg', 'Staples Center', 'Los Angeles'), ('01/02/99 9PM', 'Diana Ross', 'City Hall', 'New York City')]
You can now process these tuples into any data structure that suits your use case best.

regular expressions (regex) save parts of sentence

New to python and regular expressions, I have been trying to find a way that I can parse a sentence so that I can take parts of it and assign them to their own variables.
An example sentence is: Laura Compton, a Stock Broker from Los Angeles, California
My objective is to have: name = "Laura Compton" ( this one is the easy one, I can target the anchor link no problem), position = "Stock Broker", city = Los Angeles, state = California
All of the sentences I need to iterate over follow the same pattern, name is always in the anchor tag, the position always follows the , after the closing anchor, sometimes its uses "a" or "an" so I would like to strip those off. The city and state always follow the word "from" .
You can use named groups within patterns to capture substrings, which makes referring to them easier and the code doing so slightly more readable:
import re
data = ['Laura Compton, a Stock Broker from Los Angeles, California',
'Miles Miller, a Soccer Player from Seattle, Washington']
pattern = (r'^(?P<name>[^,]+)\, an? (?P<position>.+) from '
r'(?P<city>[^,]+)\, +(?P<state>.+)')
FIELDS = 'name', 'position', 'city', 'state'
for sentence in data:
matches = re.search(pattern, sentence)
name, position, city, state = matches.group(*FIELDS)
print(', '.join([name, position, city, state]))
Output produced from sample data:
Laura Compton, Stock Broker, Los Angeles, California
Miles Miller, Soccer Player, Seattle, Washington
A.M. Kuchling wrote a good tutorial titled Regular Expression HOWTO you ought to check-out.
You can try this:
import re
s = "Laura Compton, a Stock Broker from Los Angeles, California"
new_s = re.findall('^[a-zA-Z\s]+|(?<=a\s)[a-zA-Z\s]+(?=from)|(?<=an\s)[a-zA-Z\s]+(?=from)|(?<=from\s)[a-zA-Z\s]+(?=,)|(?<=,\s)[a-zA-Z\s]+$', s)
headers = ['name', 'title', 'city', 'state']
data = {a:b for a, b in zip(headers, new_s)}
Output:
{'city': 'Los Angeles', 'state': 'California', 'name': 'Laura Compton', 'title': 'Stock Broker '}

Python list manipulation based on indexing

I have two lists:
The first list consists of all the titles of various publications where as the second list consists of all the author names.
list B = ['Moe Terry M 2005 ', 'March James G and Johan P Olsen 2006 ', 'Kitschelt Herbert 2000 ', 'Bates Robert H 1981 ' , .......]
list A = ['"Linkages between Citizens and Politicians in Democratic Polities,"', '"Winners Take All: The Politics of Partial Reform in Postcommunist \n\nTransitions,"', '"Inequality, Social Insurance, and \n\nRedistribution."', '"Majoritarian Electoral Systems and \nConsumer Power: Price-Level Evidence from the OECD Countries."']
I am running scholar.py as a bash command. The syntax goes like this
scholar = "python scholar.py -c 1 --author " + str(name) + "--phrase " + str(title)
Now, what I am trying to do is get each title and author in order so that I can use them with scholar.
But I am not able to figure out how can I get the first author name with first title .
I would have used indexing if the lists were small.
Is this is what you are looking for?
list B = ['Moe Terry M 2005 ', 'March James G and Johan P Olsen 2006 ', 'Kitschelt Herbert 2000 ', 'Bates Robert H 1981 ' , .......]
list A = ['"Linkages between Citizens and Politicians in Democratic Polities,"', '"Winners Take All: The Politics of Partial Reform in Postcommunist \n\nTransitions,"', '"Inequality, Social Insurance, and \n\nRedistribution."', '"Majoritarian Electoral Systems and \nConsumer Power: Price-Level Evidence from the OECD Countries."']
for i,j in zip(B,A):
print i, j #python 2.x
print(i , j) #python3.x

Categories