How do I extract a substring from a larger string? - python

I'm very new to webscraping and I'm grabbing from a website from Billboard that compiled the top 10 summer songs for each year from 1958 to 2021. My main goal is to end up with a dictionary with the year number as the key and a list with the 10 songs as the associated value.
{"1958": ["NEL BLU DIPINTO DI BLU (VOLARÉ)", ...], "1959": ["LONELY BOY", ...]}
What I have so far is a list of each year and their songs, where each value in the list is multiple lines and appears as follows:
1958Rank, Title, Artist
1, NEL BLU DIPINTO DI BLU (VOLARÉ), Domenico Modugno
2, POOR LITTLE FOOL, Ricky Nelson
3, PATRICIA, Perez Prado And His Orchestra
4, LITTLE STAR, The Elegants
5, MY TRUE LOVE, Jack Scott
6, JUST A DREAM, Jimmy Clanton And His Rockets
7, WHEN, Kalin Twins
8, BIRD DOG, The Everly Brothers
9, SPLISH SPLASH, Bobby Darin
10, REBEL-‘ROUSER, Duane Eddy His Twangy Guitar And The Rebels
Is there any way to extract just the song titles and add them to a separate list? I'm thinking it could be either done by somehow checking if the substring is fully capitalized, since the song titles are in all caps, or if the substring is between two commas, as the titles are placed inbetween a comma after its place value and at the end of the song title.
The link for the Billboard website is attached here:
https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/

There is no need for regex - To get your expected output select only the <p> that has an <strong> and iterate over its texts [s.split(', ')[1] for s in p.find_all(text=True)[2:]]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
doc = BeautifulSoup(requests.get(https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/).text)
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
data.append({
p.strong.text:[s.split(', ')[1] for s in p.find_all(text=True)[2:]]
})
print(data)
Output:
[{'1958': ['NEL BLU DIPINTO DI BLU (VOLARÉ)', 'POOR LITTLE FOOL', 'PATRICIA', 'LITTLE STAR', 'MY TRUE LOVE', 'JUST A DREAM', 'WHEN', 'BIRD DOG', 'SPLISH SPLASH', 'REBEL-‘ROUSER']}, {'1959': ['LONELY BOY', 'THE BATTLE OF NEW ORLEANS', 'A BIG HUNK O’ LOVE', 'MY HEART IS AN OPEN BOOK', 'THE THREE BELLS', 'PERSONALITY', 'THERE GOES MY BABY', 'LAVENDER-BLUE', 'WATERLOO', 'TIGER']}, {'1960': ['I’M SORRY', 'IT’S NOW OR NEVER', 'EVERYBODY’S SOMEBODY’S FOOL', 'ALLEY-OOP', 'ITSY BITSY TEENIE WEENIE YELLOW POLKADOT BIKINI', 'ONLY THE LONELY (KNOW HOW I FEEL)', 'WALK — DON’T RUN', 'CATHY’S CLOWN', 'MULE SKINNER BLUES', 'BECAUSE THEY’RE YOUNG']},...]
One approach to get a bit more structured data including rank and artist that you can use to build a dataframe easily could be:
...
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
for s in [dict(zip(p.find_all(text=True)[1].split(','),s.strip().split(', '))) for s in p.find_all(text=True)[2:]]:
s.update({'year':p.strong.text})
data.append(s)
pd.DataFrame(data)
Rank
Title
Artist
year
1
NEL BLU DIPINTO DI BLU (VOLARÉ)
Domenico Modugno
1958
2
POOR LITTLE FOOL
Ricky Nelson
1958
3
PATRICIA
Perez Prado And His Orchestra
1958
4
LITTLE STAR
The Elegants
1958
5
MY TRUE LOVE
Jack Scott
1958
....

Related

How do i turn this string into 2 arrays in python?

I have this string and want to turn it into two arrays, one has the film title and the other one has the year. Their positions in the array need to correspond with each other. Is there a way to do this?
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
First split the input string on comma to generate a list, then use comprehensions to get the title and year as separate lists.
films_list = re.split(r',\s*', films)
titles = [re.split(r'\s*(?=\(\d+\))', x)[0] for x in films_list]
years = [re.split(r'\s*(?=\(\d+\))', x)[1] for x in films_list]
Answer of Tim is well enough. I will try to write an alternative someone who would like to solve the problem without using regex.
a = films.split(",")
years = []
for i in a:
years.append(i[i.find("(")+1:i.find(")")])
Same approach can be applied for titles.
You can do something like this (without any kind of import or extra module needed, or regex complexity):
delimeter = ", "
movies_with_year = pfilms.split(delimeter)
movies = []
years = []
for movie_with_year in movies_with_year:
movie = movie_with_year[:-6]
year = movie_with_year[-6:].replace("(","").replace(")","")
movies.append(movie)
years.append(year)
This script will result in something like this:
movies : ['Endless Love ', ...]
years : ['1981', ...]
You shuold clear all "new line" (|n) and use try/except to pass over the last elemet issue.
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
movies = []
years = []
for item in films.replace("\n", "").split("),"):
try:
movies.append(item.split(" (")[0])
years.append(item.split(" (")[-1])
except:
...

How to index a whole string as opposed to letter in list

I'm trying to append a string of words into a list however when I try to index that list, it gives back individual letters.
For example:
url = 'https://almostginger.com/famous-movie-locations/'
titles = []
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.select('h3')
for t in titles:
tt = t.text.strip()
for s in range(len(tt)):
print(s)
Shows that only individual letters are indexed, whereas if I'm trying to create a list, I get the error:
titles.append(tt)
AttributeError: 'str' object has no attribute 'text'
Expected outcome:
'Café des Deux Moulins as seen in Amélie (2001)',
'Royal Palace of Caserta as seen in Angels and Demons (2009)'
You get an error simply because of a duplicate variable name. Change one of the titles to something else.
import requests
from bs4 import BeautifulSoup
url = 'https://almostginger.com/famous-movie-locations/'
titles_ = []
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.select('h3')
for t in titles:
tt = t.text.strip()
titles_.append(tt)
titles_
Output:
['Café des Deux Moulins as seen in Amélie (2001)',
'Royal Palace of Caserta as seen in Angels and Demons (2009)',
'Odesa Steps as seen in Battleship Potemkin (1926)',
'Promenade Plantée as seen in Before Sunset (2004)',
'Curracloe\xa0Beach as seen in Brooklyn (2015)',
'Belfry of Bruges as seen in In Bruges (2008)',
'Sirmione as seen in Call Me By Your Name (2017)',
'Villa del Balbianello as seen in Casino Royale (2006)',
'Neuschwanstein Castle as seen in Chitty Chitty Bang Bang (1968)',
'Nyhavn Harbour as seen in The Danish Girl (2015)',
'Rosslyn Chapel as seen in The Da Vinci Code (2006)',
'Highclere Castle as seen in Downton Abbey (2010-2019)',
'Juvet Landscape Hotel as seen in Ex Machina (2014)',
'Mini Hollywood as seen in For a Few Dollars More (1965)',
'The Dark Hedges as seen in Game of Thrones (2011-2019)',
'Kaufhaus Görlitz as seen in The Grand Budapest Hotel (2014)',
'Bar Vitelli as seen in The Godfather (1972)',
'Glenfinnan Viaduct as seen in Harry Potter and the Chamber of Secrets (2002)',
'Old Royal Naval College as seen in The King’s Speech (2010)',
'Trevi Fountain as seen in La Dolce Vita (1960)',
'Juliet’s House as seen in Letters to Juliet (2010)',
'Church of Agios Ioannis Kastri as seen in Mamma Mia! (2008)',
'Palace of Versailles as seen in Marie Antoinette (2006)',
'Shakespeare & Company Bookshop as seen in Midnight in Paris (2011)',
'Doune Castle as seen in Monty Python and The Holy Grail (1975)',
'The Notting Hill Bookshop as seen in Notting Hill (1999)',
'Belchite as seen in Pan’s Labyrinth (2006)',
'Umschlagplatz as seen in The Pianist (2002)',
'Popeye Village as seen in Popeye (1980)',
'Cliffs of Moher as seen in The Princess Bride (1987)',
'Wicklow Mountains in P.S. I Love You (2007)',
'Mouth of Truth as seen in Roman Holiday (1953)',
'Piłsudskiego Bridge as seen in Schindler’s List (1993)',
'Kirkjufell Mountain as seen in The Secret Life of Walter Mitty (2013)',
'Residenzplatz as seen in The Sound of Music (1965)',
'The Fairy Glen as seen in Stardust (2007)',
'Skellig Michael as seen in Star Wars Episode VIII: The Last Jedi (2017)',
'Spanish Steps as seen in The Talented Mr Ripley (1999)',
'Riesenrad Ferris Wheel as seen in The Third Man (1949)',
'Hotel Carlton as seen in To Catch a Thief (1955)',
'Tibidabo Amusement Park as seen in Vicky Cristina Barcelona (2008)',
'Haweswater Reservoir as seen in Withnail & I (1989)',
'Aït Benhaddou as seen in Gladiator (2000)',
'Masai Mara as seen in Out of Africa (1985)',
'Sidi Idriss Hotel as seen in Star Wars Episode III: A New Hope (1977)',
'Maya Bay as seen in The Beach (2000)',
'Hongcun Ancient Village as seen in Crouching Tiger Hidden Dragon (2000)',
'Lebua State Tower as seen in The Hangover Part II (2011)',
'Petra as seen in Indiana Jones and the Last Crusade (1989)',
'Angkor Thom as seen in Lara Croft: Tomb Raider (2001)',
'Park Hyatt Hotel as seen in Lost in Translation (2003)',
'Phang Nga Bay as seen in The Man with the Golden Gun (1974)',
'Burj Khalifa as seen in Mission: Impossible – Ghost Protocol (2011)',
'Chhatrapati Shivaji Maharaj Terminus as seen in Slumdog Millionaire (2008)',
'King’s Canyon as seen in The Adventures of Priscilla, Queen of the Desert (1994)',
'Hobbiton as seen in The Lord of the Rings: The Fellowship of the Ring (2001)',
'Pine Oak Court as seen in Neighbours (1985-Present)',
'Devil’s Tower as seen in Close Encounters of the Third Kind (1977)',
'Route 66 as seen in Easy Rider (1969)',
'Art Institute of Chicago as seen in Ferris Bueller’s Day Off (1986)',
'Monument Valley as seen in Forrest Gump (1994)',
'New York Public Library as seen in Ghostbusters (1984)',
'Salvation Mountain as seen in Into the Wild (2007)',
'Martha’s Vineyard as seen in Jaws (1975)',
'Griffith Observatory as seen in Rebel Without a Cause (1955)',
'Philadelphia Museum of Art as seen in Rocky (1976)',
'Edmund Pettus Bridge as seen in Selma (2014)',
'Timberline Lodge as seen in The Shining (1980)',
'Dead Horse Point State Park as seen in Thelma & Louise (1991)',
'Golden Gate Bridge as seen in Vertigo (1958)',
'Katz’s Delicatessen as seen in When Harry Met Sally (1989)',
'Prairie Mountain as seen in Brokeback Mountain (2005)',
'Iguazu Falls as seen in Indiana Jones and the Crystal Skull (2008)',
'Machu Picchu as seen in The Motorcycle Diaries (2004)',
'Bahia De Cacaluta Beach as seen in Y Tu Mamá También (2001)',
'3 thoughts on “75+ Famous Movie Locations You Can Actually Visit”',
'Leave a Reply Cancel reply']
This is a good use-case for a list comprehension:
import requests
from bs4 import BeautifulSoup as BS
with requests.Session() as session:
r = session.get('https://almostginger.com/famous-movie-locations/')
r.raise_for_status()
soup = BS(r.text, 'lxml')
titles = [title.text+'\n' for title in soup.select('h3')]
print(*titles)

Python taking specific data's from websites

I am new at python and i'm working on interface.I should take top 250 movies from imdb website.
def clicked(self):
movie=self.movie_name.text()
url="https://www.imdb.com/chart/top/"
response=requests.get(url)
html_content=response.content
soup=BeautifulSoup(html_content,"html.parser")
movie_name = soup.find_all("td",{"class":"titleColumn"})
for i in movie_name:
i=i.text
i=i.strip()
i=i.replace("\n","")
if (movie == i):
self.yazialani.setText(i)
and with this code output is like this:
6. Schindler's List(1993)
7. The Lord of the Rings: The Return of the King(2003)
8. Pulp Fiction(1994)
but for my project i just wanna take movies names not years and rankings.How should i change my code?
One primitive solution could be (considering your string is of the tipe digits+. +name_of_movie+(YEAR) is taking just
a=["6. Schindler's List(1993)", "7. The Lord of the Rings: The Return of the King(2003)", "8. Pulp Fiction(1994)"]
just_names=[]
for name in a:
i=0
while True:
if name[i]=='.':
just_names.append(name[i+2:-6]) # To delete the space after the point
break
i+=1
Only the name of the movie is contained in the anchor tag. So select anchor tag text for each td
import requests
from bs4 import BeautifulSoup
url="https://www.imdb.com/chart/top/"
response=requests.get(url)
html_content=response.content
soup=BeautifulSoup(html_content,"html.parser")
movie_name = soup.find_all("td",{"class":"titleColumn"})
for i in movie_name:
print(i.find("a").get_text(strip=True))
Output:
The Shawshank Redemption
The Godfather
The Godfather: Part II
The Dark Knight
12 Angry Men
Schindler's List
The Lord of the Rings: The Return of the King
Pulp Fiction
Il buono, il brutto, il cattivo
The Lord of the Rings: The Fellowship of the Ring
Fight Club
Forrest Gump
Inception
Star Wars: Episode V - The Empire Strikes Back
The Lord of the Rings: The Two Towers
The Matrix
Goodfellas
One Flew Over the Cuckoo's Nest
Shichinin no samurai
Se7en
La vita è bella
Cidade de Deus
The Silence of the Lambs
Hamilton
It's a Wonderful Life
Star Wars
Saving Private Ryan
Sen to Chihiro no kamikakushi
Gisaengchung
The Green Mile
Interstellar
Léon
The Usual Suspects
Seppuku
The Lion King
Back to the Future
The Pianist
Terminator 2: Judgment Day
American History X
Modern Times
Psycho
Gladiator
City Lights
The Departed
The Intouchables
Whiplash
The Prestige
...
...
..

Create tuples of (lemma, NER type) in python , Nlp problem

I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)
I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))

Regex to parse Product Size (Count, Pack size) from string

I have a long list of strings that are different products in my database, each with a product label and product sizes (including packaging size) of the product. The product label is slightly variable and I want a regex to account for all those: I want the regex to split the strings into 3 substrings: 1. Parse out a string if the label has "Pack" or "pck" or "Count" 2. Parse out the size--12, 24, etc 3. The size of each pack (12.5 oz, 10-oz). I have the following strings for example:
str1 = "Greenies Original Regular Natural Dental Dog Treats, 12 oz. Pack (12 Treats)"
str2 = "Greenies Pill Pockets Capsule Size Dog Treats Chicken Flavor, 7.9 oz. Pack (30 Treats)"
str3 = "Blue Buffalo Family Favorites Natural Adult Wet Dog Food, Sunday Chicken 12.5-oz can (Pack of 12)"
str4 = "Blue Buffalo Dental Bones Natural Adult Dental Chew Regular Dog Treat, 12-oz bag (12 Count)"
str5 = "Purina ONE Natural Dry Dog Food, SmartBlend Chicken & Rice Formula - 8 lb. Bag"
str6 = "Rachael Ray Nutrish Natural Dry Dog Food, Real Chicken & Veggies Recipe, 14 lbs"
str7 = "(12 Pack) Purina ALPO Gravy Wet Dog Food, Gravy Cravers Roast Beef Flavor in Gravy, 13.2 oz. Cans"
str8 = "Ol' Roy Munchy Bones Dog Treats, Greek Yogurt Flavor, 20 ounce, 7 Count"
What I would like to get is:
str1_group = ['12 oz', 'Pack', '12 Treats']
str2_group = ['7.9 oz', 'Pack', '30 Treats']
str3_group = ['12.5-oz', 'can', '12 Count']
str4_group = ['12-oz', 'bag', '12 Count']
str5_group = ['8 lb', 'Bag', '']
str6_group = ['14 lbs', '', '']
str7_group = ['13.2 oz', 'Cans', '12 Pack']
str8_group = ['20 ounce', '', '7 Count']
The challenge here is that a lot of products have different descriptions: some have Pack details at the start of string and some at the end, some have weight details in oz., Ounces and some have lbs.
What I have tried:
re.search(r'(\d+(\.\d+)?\s[ol]?[zbs])', text)
However, the above is only accounting fro "12 oz" "12.5 oz" "0.5 oz" (Not even "12.5-oz") types of string. It is getting a little challenging for me to write the best regex to account for every scenario.
Can someone please help me with the best regex to solve this issues? Thanks in advance!!

Categories