Related
I'm very new to webscraping and I'm grabbing from a website from Billboard that compiled the top 10 summer songs for each year from 1958 to 2021. My main goal is to end up with a dictionary with the year number as the key and a list with the 10 songs as the associated value.
{"1958": ["NEL BLU DIPINTO DI BLU (VOLARÉ)", ...], "1959": ["LONELY BOY", ...]}
What I have so far is a list of each year and their songs, where each value in the list is multiple lines and appears as follows:
1958Rank, Title, Artist
1, NEL BLU DIPINTO DI BLU (VOLARÉ), Domenico Modugno
2, POOR LITTLE FOOL, Ricky Nelson
3, PATRICIA, Perez Prado And His Orchestra
4, LITTLE STAR, The Elegants
5, MY TRUE LOVE, Jack Scott
6, JUST A DREAM, Jimmy Clanton And His Rockets
7, WHEN, Kalin Twins
8, BIRD DOG, The Everly Brothers
9, SPLISH SPLASH, Bobby Darin
10, REBEL-‘ROUSER, Duane Eddy His Twangy Guitar And The Rebels
Is there any way to extract just the song titles and add them to a separate list? I'm thinking it could be either done by somehow checking if the substring is fully capitalized, since the song titles are in all caps, or if the substring is between two commas, as the titles are placed inbetween a comma after its place value and at the end of the song title.
The link for the Billboard website is attached here:
https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/
There is no need for regex - To get your expected output select only the <p> that has an <strong> and iterate over its texts [s.split(', ')[1] for s in p.find_all(text=True)[2:]]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
doc = BeautifulSoup(requests.get(https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/).text)
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
data.append({
p.strong.text:[s.split(', ')[1] for s in p.find_all(text=True)[2:]]
})
print(data)
Output:
[{'1958': ['NEL BLU DIPINTO DI BLU (VOLARÉ)', 'POOR LITTLE FOOL', 'PATRICIA', 'LITTLE STAR', 'MY TRUE LOVE', 'JUST A DREAM', 'WHEN', 'BIRD DOG', 'SPLISH SPLASH', 'REBEL-‘ROUSER']}, {'1959': ['LONELY BOY', 'THE BATTLE OF NEW ORLEANS', 'A BIG HUNK O’ LOVE', 'MY HEART IS AN OPEN BOOK', 'THE THREE BELLS', 'PERSONALITY', 'THERE GOES MY BABY', 'LAVENDER-BLUE', 'WATERLOO', 'TIGER']}, {'1960': ['I’M SORRY', 'IT’S NOW OR NEVER', 'EVERYBODY’S SOMEBODY’S FOOL', 'ALLEY-OOP', 'ITSY BITSY TEENIE WEENIE YELLOW POLKADOT BIKINI', 'ONLY THE LONELY (KNOW HOW I FEEL)', 'WALK — DON’T RUN', 'CATHY’S CLOWN', 'MULE SKINNER BLUES', 'BECAUSE THEY’RE YOUNG']},...]
One approach to get a bit more structured data including rank and artist that you can use to build a dataframe easily could be:
...
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
for s in [dict(zip(p.find_all(text=True)[1].split(','),s.strip().split(', '))) for s in p.find_all(text=True)[2:]]:
s.update({'year':p.strong.text})
data.append(s)
pd.DataFrame(data)
Rank
Title
Artist
year
1
NEL BLU DIPINTO DI BLU (VOLARÉ)
Domenico Modugno
1958
2
POOR LITTLE FOOL
Ricky Nelson
1958
3
PATRICIA
Perez Prado And His Orchestra
1958
4
LITTLE STAR
The Elegants
1958
5
MY TRUE LOVE
Jack Scott
1958
....
I have this string and want to turn it into two arrays, one has the film title and the other one has the year. Their positions in the array need to correspond with each other. Is there a way to do this?
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
First split the input string on comma to generate a list, then use comprehensions to get the title and year as separate lists.
films_list = re.split(r',\s*', films)
titles = [re.split(r'\s*(?=\(\d+\))', x)[0] for x in films_list]
years = [re.split(r'\s*(?=\(\d+\))', x)[1] for x in films_list]
Answer of Tim is well enough. I will try to write an alternative someone who would like to solve the problem without using regex.
a = films.split(",")
years = []
for i in a:
years.append(i[i.find("(")+1:i.find(")")])
Same approach can be applied for titles.
You can do something like this (without any kind of import or extra module needed, or regex complexity):
delimeter = ", "
movies_with_year = pfilms.split(delimeter)
movies = []
years = []
for movie_with_year in movies_with_year:
movie = movie_with_year[:-6]
year = movie_with_year[-6:].replace("(","").replace(")","")
movies.append(movie)
years.append(year)
This script will result in something like this:
movies : ['Endless Love ', ...]
years : ['1981', ...]
You shuold clear all "new line" (|n) and use try/except to pass over the last elemet issue.
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
movies = []
years = []
for item in films.replace("\n", "").split("),"):
try:
movies.append(item.split(" (")[0])
years.append(item.split(" (")[-1])
except:
...
I'm trying to append a string of words into a list however when I try to index that list, it gives back individual letters.
For example:
url = 'https://almostginger.com/famous-movie-locations/'
titles = []
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.select('h3')
for t in titles:
tt = t.text.strip()
for s in range(len(tt)):
print(s)
Shows that only individual letters are indexed, whereas if I'm trying to create a list, I get the error:
titles.append(tt)
AttributeError: 'str' object has no attribute 'text'
Expected outcome:
'Café des Deux Moulins as seen in Amélie (2001)',
'Royal Palace of Caserta as seen in Angels and Demons (2009)'
You get an error simply because of a duplicate variable name. Change one of the titles to something else.
import requests
from bs4 import BeautifulSoup
url = 'https://almostginger.com/famous-movie-locations/'
titles_ = []
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.select('h3')
for t in titles:
tt = t.text.strip()
titles_.append(tt)
titles_
Output:
['Café des Deux Moulins as seen in Amélie (2001)',
'Royal Palace of Caserta as seen in Angels and Demons (2009)',
'Odesa Steps as seen in Battleship Potemkin (1926)',
'Promenade Plantée as seen in Before Sunset (2004)',
'Curracloe\xa0Beach as seen in Brooklyn (2015)',
'Belfry of Bruges as seen in In Bruges (2008)',
'Sirmione as seen in Call Me By Your Name (2017)',
'Villa del Balbianello as seen in Casino Royale (2006)',
'Neuschwanstein Castle as seen in Chitty Chitty Bang Bang (1968)',
'Nyhavn Harbour as seen in The Danish Girl (2015)',
'Rosslyn Chapel as seen in The Da Vinci Code (2006)',
'Highclere Castle as seen in Downton Abbey (2010-2019)',
'Juvet Landscape Hotel as seen in Ex Machina (2014)',
'Mini Hollywood as seen in For a Few Dollars More (1965)',
'The Dark Hedges as seen in Game of Thrones (2011-2019)',
'Kaufhaus Görlitz as seen in The Grand Budapest Hotel (2014)',
'Bar Vitelli as seen in The Godfather (1972)',
'Glenfinnan Viaduct as seen in Harry Potter and the Chamber of Secrets (2002)',
'Old Royal Naval College as seen in The King’s Speech (2010)',
'Trevi Fountain as seen in La Dolce Vita (1960)',
'Juliet’s House as seen in Letters to Juliet (2010)',
'Church of Agios Ioannis Kastri as seen in Mamma Mia! (2008)',
'Palace of Versailles as seen in Marie Antoinette (2006)',
'Shakespeare & Company Bookshop as seen in Midnight in Paris (2011)',
'Doune Castle as seen in Monty Python and The Holy Grail (1975)',
'The Notting Hill Bookshop as seen in Notting Hill (1999)',
'Belchite as seen in Pan’s Labyrinth (2006)',
'Umschlagplatz as seen in The Pianist (2002)',
'Popeye Village as seen in Popeye (1980)',
'Cliffs of Moher as seen in The Princess Bride (1987)',
'Wicklow Mountains in P.S. I Love You (2007)',
'Mouth of Truth as seen in Roman Holiday (1953)',
'Piłsudskiego Bridge as seen in Schindler’s List (1993)',
'Kirkjufell Mountain as seen in The Secret Life of Walter Mitty (2013)',
'Residenzplatz as seen in The Sound of Music (1965)',
'The Fairy Glen as seen in Stardust (2007)',
'Skellig Michael as seen in Star Wars Episode VIII: The Last Jedi (2017)',
'Spanish Steps as seen in The Talented Mr Ripley (1999)',
'Riesenrad Ferris Wheel as seen in The Third Man (1949)',
'Hotel Carlton as seen in To Catch a Thief (1955)',
'Tibidabo Amusement Park as seen in Vicky Cristina Barcelona (2008)',
'Haweswater Reservoir as seen in Withnail & I (1989)',
'Aït Benhaddou as seen in Gladiator (2000)',
'Masai Mara as seen in Out of Africa (1985)',
'Sidi Idriss Hotel as seen in Star Wars Episode III: A New Hope (1977)',
'Maya Bay as seen in The Beach (2000)',
'Hongcun Ancient Village as seen in Crouching Tiger Hidden Dragon (2000)',
'Lebua State Tower as seen in The Hangover Part II (2011)',
'Petra as seen in Indiana Jones and the Last Crusade (1989)',
'Angkor Thom as seen in Lara Croft: Tomb Raider (2001)',
'Park Hyatt Hotel as seen in Lost in Translation (2003)',
'Phang Nga Bay as seen in The Man with the Golden Gun (1974)',
'Burj Khalifa as seen in Mission: Impossible – Ghost Protocol (2011)',
'Chhatrapati Shivaji Maharaj Terminus as seen in Slumdog Millionaire (2008)',
'King’s Canyon as seen in The Adventures of Priscilla, Queen of the Desert (1994)',
'Hobbiton as seen in The Lord of the Rings: The Fellowship of the Ring (2001)',
'Pine Oak Court as seen in Neighbours (1985-Present)',
'Devil’s Tower as seen in Close Encounters of the Third Kind (1977)',
'Route 66 as seen in Easy Rider (1969)',
'Art Institute of Chicago as seen in Ferris Bueller’s Day Off (1986)',
'Monument Valley as seen in Forrest Gump (1994)',
'New York Public Library as seen in Ghostbusters (1984)',
'Salvation Mountain as seen in Into the Wild (2007)',
'Martha’s Vineyard as seen in Jaws (1975)',
'Griffith Observatory as seen in Rebel Without a Cause (1955)',
'Philadelphia Museum of Art as seen in Rocky (1976)',
'Edmund Pettus Bridge as seen in Selma (2014)',
'Timberline Lodge as seen in The Shining (1980)',
'Dead Horse Point State Park as seen in Thelma & Louise (1991)',
'Golden Gate Bridge as seen in Vertigo (1958)',
'Katz’s Delicatessen as seen in When Harry Met Sally (1989)',
'Prairie Mountain as seen in Brokeback Mountain (2005)',
'Iguazu Falls as seen in Indiana Jones and the Crystal Skull (2008)',
'Machu Picchu as seen in The Motorcycle Diaries (2004)',
'Bahia De Cacaluta Beach as seen in Y Tu Mamá También (2001)',
'3 thoughts on “75+ Famous Movie Locations You Can Actually Visit”',
'Leave a Reply Cancel reply']
This is a good use-case for a list comprehension:
import requests
from bs4 import BeautifulSoup as BS
with requests.Session() as session:
r = session.get('https://almostginger.com/famous-movie-locations/')
r.raise_for_status()
soup = BS(r.text, 'lxml')
titles = [title.text+'\n' for title in soup.select('h3')]
print(*titles)
I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)
I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))
How do I parse sentence case phrases from a passage.
For example from this passage
Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882.
We need to generate stuff like Conan Doyle, Holmes, Dr Joseph Bell, Wendell Scherr etc.
I would prefer a Pythonic Solution if possible
This kind of processing can be very tricky. This simple code does almost the right thing:
for s in re.finditer(r"([A-Z][a-z]+[. ]+)+([A-Z][a-z]+)?", text):
print s.group(0)
produces:
Conan Doyle
Holmes
Dr. Joseph Bell
Doyle
Edinburgh Royal Infirmary. Like Holmes
Bell
Michael Harrison
Ellery Queen
Mystery Magazine
Wendell Scherer
England
To include "Dr. Joseph Bell", you need to be ok with the period in the string, which allows in "Edinburgh Royal Infirmary. Like Holmes".
I had a similar problem: Separating Sentences.
The "re" approach runs out of steam very quickly. Named entity recognition is a very complicated topic, way beyond the scope of an SO answer. If you think you have a good approach to this problem, please point it at Flann O'Brien a.k.a. Myles na cGopaleen, Sukarno, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Köfering und Schönberg.
Update Following is an "re"-based approach that finds a lot more valid cases. I still don't think that this is a good approach, though. N.B. I've asciified the Bavarian count's name in my text sample. If anyone really wants to use something like this, they should work in Unicode, and normalise whitespace at some stage (either on input or on output).
import re
text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""
text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""
pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"
joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()
pattern2 = r"""(?x)
(?:
(?:[ .]|\b%s\b)*
(?:\b[a-z]*[A-Z][a-z]*\b)?
)+
""" % r'\b|\b'.join(joiners)
def get_names(pattern, text):
for m in re.finditer(pattern, text):
s = m.group(0).strip(" .'-")
if s:
yield s
for t in (text1, text2):
print "*** text: ", t[:20], "..."
print "=== Ned B"
for s in re.finditer(pattern1):
print repr(s.group(0))
print "=== John M =="
for name in get_names(pattern2, t):
print repr(name)
Output:
C:\junk\so>\python26\python extract_names.py
*** text: Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text: Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'