I am trying to write a function in python that opens a file and parses it into a dictionary. I am trying to make the first item in the list block the key for each item in the dictionary data. Then each item is supposed to be the rest of the list block less the first item. For some reason though, when I run the following function, it parses it incorrectly. I have provided the output below. How would I be able to parse it like I stated above? Any help would be greatly appreciated.
Function:
def parseData() :
filename="testdata.txt"
file=open(filename,"r+")
block=[]
for line in file:
block.append(line)
if line in ('\n', '\r\n'):
album=block.pop(1)
data[block[1]]=album
block=[]
print data
Input:
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
Output:
{'-Rainy Day Women #12 & 35\n': '1966 Blonde on Blonde\n',
'-Whole Lotta Love\n': '1969 II\n', '-In the Evening\n': '1979 In Through the Outdoor\n'}
You can use groupby to group the data using the empty lines as delimiters, use a defaultdict for repeated keys extending the rest of the values from each val returned from groupby after extracting the key/first element.
from itertools import groupby
from collections import defaultdict
d = defaultdict(list)
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
# if k is True we have a section
if k:
# get key "k" which is the first line
# from each section, val will be the remaining lines
k,*v = val
# add or add to the existing key/value pairing
d[k].extend(map(str.rstrip,v))
from pprint import pprint as pp
pp(d)
Output:
{'Bob Dylan\n': ['1966 Blonde on Blonde',
'-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands'],
'Led Zeppelin\n': ['1979 In Through the Outdoor',
'-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl",
'1969 II',
'-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home']}
For python2 the unpack syntax is slightly different:
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, v = next(val), val
d[k].extend(map(str.rstrip, v))
If you want to keep the newlines remove the map(str.rstrip..
If you want the album and songs separately for each artist:
from itertools import groupby
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, alb, songs = next(val),next(val), val
d[k.rstrip()][alb.rstrip()] = list(map(str.rstrip, songs))
from pprint import pprint as pp
pp(d)
{'Bob Dylan': {'1966 Blonde on Blonde': ['-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or '
'Later)',
'-I Want You',
'-Stuck Inside of Mobile with the '
'Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
'-Most Likely You Go Your Way '
"(And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': {'1969 II': ['-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': ['-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
I guess this is what you want?
Even if this is not the format you wanted, there are a few things you might learn from the answer:
use with for file handling
nice to have:
PEP8 compilant code, see http://pep8online.com/
a shebang
numpydoc
if __name__ == '__main__'
And SE does not like a list being continued by code...
#!/usr/bin/env python
""""Parse text files with songs, grouped by album and artist."""
def add_to_data(data, block):
"""
Parameters
----------
data : dict
block : list
Returns
-------
dict
"""
artist = block[0]
album = block[1]
songs = block[2:]
if artist in data:
data[artist][album] = songs
else:
data[artist] = {album: songs}
return data
def parseData(filename='testdata.txt'):
"""
Parameters
----------
filename : string
Path to a text file.
Returns
-------
dict
"""
data = {}
with open(filename) as f:
block = []
for line in f:
line = line.strip()
if line == '':
data = add_to_data(data, block)
block = []
else:
block.append(line)
data = add_to_data(data, block)
return data
if __name__ == '__main__':
data = parseData()
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(data)
which gives:
{ 'Bob Dylan': { '1966 Blonde on Blonde': [ '-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': { '1969 II': [ '-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': [ '-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
Related
When I execute this code...
from bs4 import BeautifulSoup
with open("games.html", "r") as page:
doc = BeautifulSoup(page, "html.parser")
titles = doc.select("a.title")
prices = doc.select("span.price-inner")
for game_soup in doc.find_all("div", {"class": "game-options-wrapper"}):
game_ids = (game_soup.button.get("data-game-id"))
for title, price_official, price_lowest in zip(titles, prices[::2], prices[1::2]):
print(title.text + ',' + str(price_official.text.replace('$', '').replace('~', '')) + ',' + str(
price_lowest.text.replace('$', '').replace('~', '')))
The output is...
153356
80011
130187
119003
73502
156474
96592
154207
155123
152790
165013
110837
Call of Duty: Modern Warfare II (2022),69.99,77.05
Red Dead Redemption 2,14.85,13.79
God of War,28.12,22.03
ELDEN RING,50.36,48.10
Cyberpunk 2077,29.99,28.63
EA SPORTS FIFA 23,41.99,39.04
Warhammer 40,000: Darktide,39.99,45.86
Marvels Spider-Man Remastered,30.71,27.07
Persona 5 Royal,37.79,43.32
The Callisto Protocol,59.99,69.41
Need for Speed Unbound,69.99,42.29
Days Gone,15.00,9.01
But I'm trying to get the value next to the other ones on the same line
Expected output:
Call of Duty: Modern Warfare II (2022),69.99,77.05,153356
Red Dead Redemption 2,14.85,13.79,80011
...
Even when adding game_ids to print(), it spams the same game id for each line.
How can I go about resolving this issue?
HTML file: https://jsfiddle.net/m3hqy54x/
I feel like all 3 details (title, price_official, price_lowest) are probably all in a shared container. It would be better to loop through these containers and select the details as sets from each container to make sure the wight prices and titles are being paired up, but I can't tell you how to do that without seeing at least a snippet from (or all of) "games.html"....
Anyway, assuming that '110837\nCall of Duty: Modern Warfare II (2022)' is from the first title here, you can rewrite your last loop as something like:
for z in zip(titles, prices[::2], prices[1::2]):
z, lw = list(z), ''
for i in len(z):
if i == 0: # title
z[0] = ' '.join(w for w in z[0].text.split('\n', 1)[-1] if w)
if '\n' in z[0].text: lw = z[0].text.split('\n', 1)[0]
continue
z[i] = z[i].text.replace('$', '').replace('~', '')
print(','.join(z+[lw]))
Added EDIT: After seeing the html, this is my suggested solution:
for g in doc.select('div[data-container-game-id]'):
gameId = g.get('data-container-game-id')
title = g.select_one('a.title')
if title: title = ' '.join(w for w in title.text.split() if w)
price_official = g.select_one('.price-wrap > div:first-child span.price')
price_lowest = g.select_one('.price-wrap > div:first-child+div span.price')
if price_official:
price_official = price_official.text.replace('$', '').replace('~', '')
if price_lowest:
price_lowest = price_lowest.text.replace('$', '').replace('~', '')
print(', '.join([title, price_official, price_lowest, gameId]))
prints
Call of Duty: Modern Warfare II (2022), 69.99, 77.05, 153356
Red Dead Redemption 2, 14.85, 13.79, 80011
God of War, 28.12, 22.03, 130187
ELDEN RING, 50.36, 48.10, 119003
Cyberpunk 2077, 29.99, 28.63, 73502
EA SPORTS FIFA 23, 41.99, 39.04, 156474
Warhammer 40,000: Darktide, 39.99, 45.86, 96592
Marvel's Spider-Man Remastered, 30.71, 27.07, 154207
Persona 5 Royal, 37.79, 43.32, 155123
The Callisto Protocol, 59.99, 69.41, 152790
Need for Speed Unbound, 69.99, 42.29, 165013
Days Gone, 15.00, 9.01, 110837
Btw, this might look ok for just four values, but if you have a large amount of details that you want to extract, you might want to consider using a function like this.
Trying to sort with an order of the last name from the list of author names, and books like this. Does anyone know how to get an index value right before the ',' this delimiter? Which are the last names.
I need to put the index value in the lambda x:x[here]
Also what if the author names are the same how do I order them in alphabetical order of book titles?
name_list= ["Dan Brown,The Da Vinci Code",
"Cornelia Funke,Inkheart",
"H G Wells,The War Of The Worlds",
"William Goldman,The Princess Bride",
"Harper Lee,To Kill a Mockingbird",
"Gary Paulsen,Hatchet",
"Jodi Picoult,My Sister's Keeper",
"Philip Pullman,The Golden Compass",
"J R R Tolkien,The Lord of the Rings",
"J R R Tolkien,The Hobbit",
"J.K. Rowling,Harry Potter Series",
"C S Lewis,The Lion the Witch and the Wardrobe",
"Louis Sachar,Holes",
"F. Scott Fitzgerald,The Great Gatsby",
"Eric Walters,Shattered",
"John Wyndham,The Chrysalids"]
def sorting(name):
last_name =[]
name_list = book_rec(name)
for i in name_list:
last_name.append(i.split())
name_list = []
for i in sorted(last_name, key=lambda x: x[]):
name_list.append(' '.join(i))
return name_list
split on comma, keep first part; split on white space, keep last:
name_list.sort(key=lambda x: x.split(',')[0].split()[-1])
If you also want to sort by book titles for the same author last name, then maybe it's better to use a function that throws key:
def sorting_key(author_title):
author, title = author_title.split(',')
# first by author last name, then by book title
return author.split()[-1], title
name_list.sort(key=sorting_key)
print(name_list)
Output:
['Dan Brown,The Da Vinci Code',
'F. Scott Fitzgerald,The Great Gatsby',
'Cornelia Funke,Inkheart',
'William Goldman,The Princess Bride',
'Harper Lee,To Kill a Mockingbird',
'C S Lewis,The Lion the Witch and the Wardrobe',
'Gary Paulsen,Hatchet',
"Jodi Picoult,My Sister's Keeper",
'Philip Pullman,The Golden Compass',
'J.K. Rowling,Harry Potter Series',
'Louis Sachar,Holes',
'J R R Tolkien,The Hobbit',
'J R R Tolkien,The Lord of the Rings',
'Eric Walters,Shattered',
'H G Wells,The War Of The Worlds',
'John Wyndham,The Chrysalids']
I am trying to save this scraped data to file (pickle it) but I cannot figure out why I cannot pickle it with this code:
url = "https://www.imdb.com/list/ls016522954/?ref_=nv_tvv_dvd"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'html5lib')
dvdNames = html_soup.find_all("div", class_="lister-item-content")
for dvd in dvdNames:
dvdArray.append(dvd.a.string)
viewtitles = input("Finished!, do you want to view the DVD titles? (Y/N): ")
if viewtitles == "y".casefold():
num = 1
for name in dvdArray:
print(""+ str(num) + " - " + name)
num += 1
elif viewtitles == "n".casefold():
print("Not Showing TItles!")
else:
print("that is not an option!")
saveToFile = input("Do you want to save / update the data? (Y/N): ")
if saveToFile == "y".casefold():
with open("IMDBDVDNames.dat", "wb") as f:
pickle.dump(dvdArray, f)
continue
elif saveToFile == "n".casefold():
print("Data Not Saved!")
continue
else:
print("That's not one of the option!")
continue
I've tried adding the sys.setrecursionlimit(1000000) and it doesn't make a difference (FYI) and am getting this error "maximum recursion depth exceeded while pickling an object" but when I run this code:
import pickle
testarray = []
if input("1 or 2?: ") == "1":
testarray = ['1917', 'Onward', 'The Hunt', 'The Invisible Man', 'Human Capital', 'Dolittle', 'Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn', 'The Gentlemen', 'Bloodshot', 'The Way Back', 'Clemency', 'The Grudge', 'I Still Believe', 'The Song of Names', 'Treadstone', 'Vivarium', 'Star Wars: Episode IX - The Rise of Skywalker', 'The Current War', 'Downhill', 'The Call of the Wild', 'Resistance', 'Banana Split', 'Bad Boys for Life', 'Sonic the Hedgehog', 'Mr. Robot', 'The Purge', 'VFW', 'The Other Lamb', 'Slay the Dragon', 'Clover', 'Lazy Susan', 'Rogue Warfare: The Hunt', 'Like a Boss', 'Little Women', 'Cats', 'Madam Secretary', 'Escape from Pretoria', 'The Cold Blue', 'The Night Clerk', 'Same Boat', 'The 420 Movie: Mary & Jane', 'Manou the Swift', 'Gold Dust', 'Sea Fever', 'Miles Davis: Birth of the Cool', 'The Lost Husband', 'Stray Dolls', 'Mortal Kombat Legends: Scorpions Revenge', 'Just Mercy', 'The Righteous Gemstones', 'Criminal Minds', 'Underwater', 'Final Kill', 'Green Rush', 'Butt Boy', 'The Quarry', 'Abe', 'Bad Therapy', 'Yip Man 4', 'The Last Full Measure', 'Looking for Alaska', 'The Turning', 'True History of the Kelly Gang', 'To the Stars', 'Robert the Bruce', 'Papa, sdokhni', 'The Rhythm Section', 'Arrow', 'The Assistant', 'Guns Akimbo', 'The Dark Red', 'Dreamkatcher', 'Fantasy Island', 'The Etruscan Smile', "A Nun's Curse", 'Allagash']
with open("test.dat", "wb") as f:
pickle.dump(testarray, f)
else:
with open("test.dat", "rb") as f:
testarray = pickle.load(f)
print(testarray)
with the exact same (at least I hope it's the same, I did a print(dvdArray) and got the list that way FYI) information but it WILL let me pickle it when i do it like that
can someone let me know why and how I can fix it?
I know I'm scraping the data from a website and converting it into a list but cannot figure out what is causing the error in example 1 vs example 2
any help would be appreciated
Thanks,
lttlejiver
In Case anyone is curious, I added "strip()" to when I was appending the dvdArray and it worked!
dvdArray.append(dvd.a.string.strip())
BeautifulSoup objects are highly recursive, and so are very difficult to pickle. When you do dvdArray.append(dvd.a.string), dvd.a.string is not a python string, but a bs4.element.NavigableString - one of these complex objects. By using strip(), you're actually converting the bs4.element.NavigableString to a python string, which is easily pickled. The same would be true if you used dvd.a.getText().
For future reference, when pickling, always remember to convert (where possible) BeautifulSoup objects to simpler python objects.
I'm making a program that counts how many times a band has played a song from a webpage of all their setlists. I have grabbed the webpage and converted all the songs played into one big list so all I wanted to do was see if the song name was in the list and add to a counter but it isn't working and I can't seem to figure out why.
I've tried using the count function instead and that didn't work
sugaree_counter = 0
link = 'https://www.cs.cmu.edu/~mleone/gdead/dead-sets/' + year + '/' + month+ '-' + day + '-' + year + '.txt'
page = requests.get(link)
page_text = page.text
page_list = [page_text.split('\n')]
print(page_list)
This code returns the list:
[['Winterland Arena, San Francisco, CA (1/2/72)', '', "Truckin'", 'Sugaree',
'Mr. Charlie', 'Beat it on Down the Line', 'Loser', 'Jack Straw',
'Chinatown Shuffle', 'Your Love At Home', 'Tennessee Jed', 'El Paso',
'You Win Again', 'Big Railroad Blues', 'Mexicali Blues',
'Playing in the Band', 'Next Time You See Me', 'Brown Eyed Women',
'Casey Jones', '', "Good Lovin'", 'China Cat Sunflower', 'I Know You Rider',
"Good Lovin'", 'Ramble On Rose', 'Sugar Magnolia', 'Not Fade Away',
"Goin' Down the Road Feeling Bad", 'Not Fade Away', '',
'One More Saturday Night', '', '']]
But when I do:
sugaree_counter = int(sugaree_counter)
if 'Sugaree' in page_list:
sugaree_counter += 1
print(str(sugaree_counter))
It will always be zero.
It should add 1 to that because 'Sugaree' is in that list
Your page_list is a list of lists, so you need two for loops to get the pages, you need to do
for page in page_list:
for item in page:
sugaree_counter += 1
Use sum() and list expressions:
sugaree_counter = sum([page.count('Sugaree') for page in page_list])
I need to turn this file content into a dictionary, so that every key in the dict is a name of a movie and every value is the name of the actors that plays in it inside a set.
Example of file content:
Brad Pitt, Sleepers, Troy, Meet Joe Black, Oceans Eleven, Seven, Mr & Mrs Smith
Tom Hanks, You have got mail, Apollo 13, Sleepless in Seattle, Catch Me If You Can
Meg Ryan, You have got mail, Sleepless in Seattle
Diane Kruger, Troy, National Treasure
Dustin Hoffman, Sleepers, The Lost City
Anthony Hopkins, Hannibal, The Edge, Meet Joe Black, Proof
This should get you started:
line = "a, b, c, d"
result = {}
names = line.split(", ")
actor = names[0]
movies = names[1:]
result[actor] = movies
Try the following:
res_dict = {}
with open('my_file.txt', 'r') as f:
for line in f:
my_list = [item.strip() for item in line.split(',')]
res_dict[my_list[0]] = my_list[1:] # To make it a set, use: set(my_list[1:])
Explanation:
split() is used to split each line to form a list using , separator
strip() is used to remove spaces around each element of the previous list
When you use with statement, you do not need to close your file explicitly.
[item.strip() for item in line.split(',')] is called a list comprehension.
Output:
>>> res_dict
{'Diane Kruger': ['Troy', 'National Treasure'], 'Brad Pitt': ['Sleepers', 'Troy', 'Meet Joe Black', 'Oceans Eleven', 'Seven', 'Mr & Mrs Smith'], 'Meg Ryan': ['You have got mail', 'Sleepless in Seattle'], 'Tom Hanks': ['You have got mail', 'Apollo 13', 'Sleepless in Seattle', 'Catch Me If You Can'], 'Dustin Hoffman': ['Sleepers', 'The Lost City'], 'Anthony Hopkins': ['Hannibal', 'The Edge', 'Meet Joe Black', 'Proof']}