File content into dictionary - python

I need to turn this file content into a dictionary, so that every key in the dict is a name of a movie and every value is the name of the actors that plays in it inside a set.
Example of file content:
Brad Pitt, Sleepers, Troy, Meet Joe Black, Oceans Eleven, Seven, Mr & Mrs Smith
Tom Hanks, You have got mail, Apollo 13, Sleepless in Seattle, Catch Me If You Can
Meg Ryan, You have got mail, Sleepless in Seattle
Diane Kruger, Troy, National Treasure
Dustin Hoffman, Sleepers, The Lost City
Anthony Hopkins, Hannibal, The Edge, Meet Joe Black, Proof

This should get you started:
line = "a, b, c, d"
result = {}
names = line.split(", ")
actor = names[0]
movies = names[1:]
result[actor] = movies

Try the following:
res_dict = {}
with open('my_file.txt', 'r') as f:
for line in f:
my_list = [item.strip() for item in line.split(',')]
res_dict[my_list[0]] = my_list[1:] # To make it a set, use: set(my_list[1:])
Explanation:
split() is used to split each line to form a list using , separator
strip() is used to remove spaces around each element of the previous list
When you use with statement, you do not need to close your file explicitly.
[item.strip() for item in line.split(',')] is called a list comprehension.
Output:
>>> res_dict
{'Diane Kruger': ['Troy', 'National Treasure'], 'Brad Pitt': ['Sleepers', 'Troy', 'Meet Joe Black', 'Oceans Eleven', 'Seven', 'Mr & Mrs Smith'], 'Meg Ryan': ['You have got mail', 'Sleepless in Seattle'], 'Tom Hanks': ['You have got mail', 'Apollo 13', 'Sleepless in Seattle', 'Catch Me If You Can'], 'Dustin Hoffman': ['Sleepers', 'The Lost City'], 'Anthony Hopkins': ['Hannibal', 'The Edge', 'Meet Joe Black', 'Proof']}

Related

How do I add lines to a key and different lines as values?

So I start put with a file that lists title, actor, title, actor, etc.
12 Years a Slave
Topsy Chapman
12 Years a Slave
Devin Maurice Evans
12 Years a Slave
Brad Pitt
12 Years a Slave
Jay Huguley
12 Years a Slave
Devyn A. Tyler
12 Years a Slave
Willo Jean-Baptiste
American Hustle
Christian Bale
American Hustle
Bradley Cooper
American Hustle
Amy Adams
American Hustle
Jeremy Renner
American Hustle
Jennifer Lawrence
I need to make a dictionary that looks like what's below and lists all actors in the movie
{'Movie Title': ['All actors'], 'Movie Title': ['All Actors]}
So far I only have this
d = {}
with open(file), 'r') as f:
for key in f:
d[key.strip()] = next(f).split()
print(d)
Using a defaultdict is usually a better choice:
from collections import defaultdict
data = defaultdict(list)
with open("filename.txt", 'r') as f:
stripped = map(str.strip, f)
for movie, actor in zip(stripped, stripped):
data[movie].append(actor)
print(data)
So you need to switch between reading the title and reading the actor from the input data. You also need to store the title, so you can use it in the actor line.
You can use the setting of the title for switching between reading the title and reading the actor.
Some key checking and you have working logic.
# pretty printer to make the output nice
from pprint import pprint
data = """ 12 Years a Slave
Topsy Chapman
12 Years a Slave
Devin Maurice Evans
12 Years a Slave
Brad Pitt
12 Years a Slave
Jay Huguley
12 Years a Slave
Devyn A. Tyler
12 Years a Slave
Willo Jean-Baptiste
American Hustle
Christian Bale
American Hustle
Bradley Cooper
American Hustle
Amy Adams
American Hustle
Jeremy Renner
American Hustle
Jennifer Lawrence"""
result = {}
title = None
for line in data.splitlines():
# clean here once
line = line.strip()
if not title:
# store the title
title = line
else:
# check if title already exists
if title in result:
# if yes, append actor
result[title].append(line)
else:
# if no, create it with new list for actors
# and of course, add the current line as actor
result[title] = [line]
# reset title to None
title = None
pprint(result)
output
{'12 Years a Slave': ['Topsy Chapman',
'Devin Maurice Evans',
'Brad Pitt',
'Jay Huguley',
'Devyn A. Tyler',
'Willo Jean-Baptiste'],
'American Hustle': ['Christian Bale',
'Bradley Cooper',
'Amy Adams',
'Jeremy Renner',
'Jennifer Lawrence']}
EDIT
when reading from a file, you need to do it slightly different.
from pprint import pprint
result = {}
title = None
with open("somefile.txt") as infile:
for line in infile.read().splitlines():
line = line.strip()
if not title:
title = line
else:
if title in result:
result[title].append(line)
else:
result[title] = [line]
title = None
pprint(result)

sort the list by alphabetically with last name, if names are same with book title

Trying to sort with an order of the last name from the list of author names, and books like this. Does anyone know how to get an index value right before the ',' this delimiter? Which are the last names.
I need to put the index value in the lambda x:x[here]
Also what if the author names are the same how do I order them in alphabetical order of book titles?
name_list= ["Dan Brown,The Da Vinci Code",
"Cornelia Funke,Inkheart",
"H G Wells,The War Of The Worlds",
"William Goldman,The Princess Bride",
"Harper Lee,To Kill a Mockingbird",
"Gary Paulsen,Hatchet",
"Jodi Picoult,My Sister's Keeper",
"Philip Pullman,The Golden Compass",
"J R R Tolkien,The Lord of the Rings",
"J R R Tolkien,The Hobbit",
"J.K. Rowling,Harry Potter Series",
"C S Lewis,The Lion the Witch and the Wardrobe",
"Louis Sachar,Holes",
"F. Scott Fitzgerald,The Great Gatsby",
"Eric Walters,Shattered",
"John Wyndham,The Chrysalids"]
def sorting(name):
last_name =[]
name_list = book_rec(name)
for i in name_list:
last_name.append(i.split())
name_list = []
for i in sorted(last_name, key=lambda x: x[]):
name_list.append(' '.join(i))
return name_list
split on comma, keep first part; split on white space, keep last:
name_list.sort(key=lambda x: x.split(',')[0].split()[-1])
If you also want to sort by book titles for the same author last name, then maybe it's better to use a function that throws key:
def sorting_key(author_title):
author, title = author_title.split(',')
# first by author last name, then by book title
return author.split()[-1], title
name_list.sort(key=sorting_key)
print(name_list)
Output:
['Dan Brown,The Da Vinci Code',
'F. Scott Fitzgerald,The Great Gatsby',
'Cornelia Funke,Inkheart',
'William Goldman,The Princess Bride',
'Harper Lee,To Kill a Mockingbird',
'C S Lewis,The Lion the Witch and the Wardrobe',
'Gary Paulsen,Hatchet',
"Jodi Picoult,My Sister's Keeper",
'Philip Pullman,The Golden Compass',
'J.K. Rowling,Harry Potter Series',
'Louis Sachar,Holes',
'J R R Tolkien,The Hobbit',
'J R R Tolkien,The Lord of the Rings',
'Eric Walters,Shattered',
'H G Wells,The War Of The Worlds',
'John Wyndham,The Chrysalids']

Python file parsing, can't catch strings in new line

So Parsing a large text file with 56,900 book titles with authors and a etext no.
Trying to find the authors. By parsing the file.
The file is a like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
def search_by_author():
fhand = open('GUTINDEX.ALL')
print("Search by Author:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("TITLE"):
if not line.startswith("~"):
words = line.rstrip()
words = line.lstrip()
words = words[:-6]
if ", by" in words:
words = words[words.find(', by'):]
words = words[5:]
print (words)
else:
words = words[words.find(', '):]
words = words[5:]
if "," in words:
words = words[words.find(', '):]
if words.startswith(','):
words =words[words.find(','):]
print (words)
else:
print (words)
else:
print (words)
if " by" in words:
words = words[words.find('by')]
print(words)
search_by_author()
However it can't seem to find the author name for lines like
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
As per your file, info about a book can be spread across multiple lines. There is a blank line after each book info. I used that to gather all info about a book and then parse it to get the author info.
import re
def search_by_author():
fhand = open('GUTINDEX.ALL')
book_info = ''
for line in fhand:
line = line.rstrip()
if (line.startswith('TITLE') or line.startswith('~')):
continue
if (len(line) == 0):
# remove info in square bracket from book_info
book_info = re.sub(r'\[.*$', '', book_info)
if ('by ' in book_info):
tokens = book_info.split('by ')
else:
tokens = book_info.split(',')
if (len(tokens) > 1):
authors = tokens[-1].strip()
print(authors)
book_info = ''
else:
# remove ETEXT NO. from line
line = re.sub(r'\d+$', '', line)
book_info += ' ' + line.rstrip()
search_by_author()
Output:
Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky

Python regex to capture a comma-delimited list of items

I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:
Some Examples:
If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff,
Salt Lake City, Park City, Denver, Estes Park, Colorado Springs,
Pueblo, or Albuquerque, the week will...
If you have vacation or wedding plans for Miami, Jacksonville, Macon,
Charlotte, or Charleston, expect a couple systems...
If you have vacation or wedding plans in Pittsburgh, Philadelphia,
Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or
Dover, expect the week...
The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.
I've tried this:
>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>
I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.
Alternative solution here (probably just for sharing and educational purposes).
If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print entity_names
It prints exactly the desired result:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:
Discard the last cell, it is not a city name
Remove 'If ...' from the first cell
Remove or 'or ' from the last cell (used to be next-to-last)
Here is the code:
import csv
def cleanup(row):
new_row = row[:-1]
new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
new_row[-1] = new_row[-1].replace('or ', '')
return new_row
if __name__ == '__main__':
with open('data.csv') as f:
reader = csv.reader(f, skipinitialspace=True)
for row in reader:
row = cleanup(row)
print row
Output:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']
import re
s = "If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems"
p = re.compile(r"If you have vacation or wedding plans (in|for) ((\w+, )+)or (\w+)")
m = p.match(s)
print m.group(2) # output: Miami, Jacksonville, Macon, Charlotte,
cities = m.group(2).split(", ") # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', '']
cities[-1] = m.group(4) # add the city after or
print cities # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
the city can be matched by pattern (\w+, ) and or (\w+)
and split cities by pattern ,
btw, as the pattern is used to many data, it is preferred to work with the compiled object
PS: the word comes after plan can be for or in, according to examples you provide
How about this
>>> text = 'If you have vacation or wedding plans for Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will'
>>> match = re.search(r'^If you have vacation or wedding plans (in?|for?) ([\w+ ,]+)',text).groups()[1].split(", ")
Output
>>> match
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'or Albuquerque', 'the week will']

Append items to dictionary Python

I am trying to write a function in python that opens a file and parses it into a dictionary. I am trying to make the first item in the list block the key for each item in the dictionary data. Then each item is supposed to be the rest of the list block less the first item. For some reason though, when I run the following function, it parses it incorrectly. I have provided the output below. How would I be able to parse it like I stated above? Any help would be greatly appreciated.
Function:
def parseData() :
filename="testdata.txt"
file=open(filename,"r+")
block=[]
for line in file:
block.append(line)
if line in ('\n', '\r\n'):
album=block.pop(1)
data[block[1]]=album
block=[]
print data
Input:
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
Output:
{'-Rainy Day Women #12 & 35\n': '1966 Blonde on Blonde\n',
'-Whole Lotta Love\n': '1969 II\n', '-In the Evening\n': '1979 In Through the Outdoor\n'}
You can use groupby to group the data using the empty lines as delimiters, use a defaultdict for repeated keys extending the rest of the values from each val returned from groupby after extracting the key/first element.
from itertools import groupby
from collections import defaultdict
d = defaultdict(list)
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
# if k is True we have a section
if k:
# get key "k" which is the first line
# from each section, val will be the remaining lines
k,*v = val
# add or add to the existing key/value pairing
d[k].extend(map(str.rstrip,v))
from pprint import pprint as pp
pp(d)
Output:
{'Bob Dylan\n': ['1966 Blonde on Blonde',
'-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands'],
'Led Zeppelin\n': ['1979 In Through the Outdoor',
'-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl",
'1969 II',
'-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home']}
For python2 the unpack syntax is slightly different:
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, v = next(val), val
d[k].extend(map(str.rstrip, v))
If you want to keep the newlines remove the map(str.rstrip..
If you want the album and songs separately for each artist:
from itertools import groupby
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, alb, songs = next(val),next(val), val
d[k.rstrip()][alb.rstrip()] = list(map(str.rstrip, songs))
from pprint import pprint as pp
pp(d)
{'Bob Dylan': {'1966 Blonde on Blonde': ['-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or '
'Later)',
'-I Want You',
'-Stuck Inside of Mobile with the '
'Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
'-Most Likely You Go Your Way '
"(And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': {'1969 II': ['-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': ['-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
I guess this is what you want?
Even if this is not the format you wanted, there are a few things you might learn from the answer:
use with for file handling
nice to have:
PEP8 compilant code, see http://pep8online.com/
a shebang
numpydoc
if __name__ == '__main__'
And SE does not like a list being continued by code...
#!/usr/bin/env python
""""Parse text files with songs, grouped by album and artist."""
def add_to_data(data, block):
"""
Parameters
----------
data : dict
block : list
Returns
-------
dict
"""
artist = block[0]
album = block[1]
songs = block[2:]
if artist in data:
data[artist][album] = songs
else:
data[artist] = {album: songs}
return data
def parseData(filename='testdata.txt'):
"""
Parameters
----------
filename : string
Path to a text file.
Returns
-------
dict
"""
data = {}
with open(filename) as f:
block = []
for line in f:
line = line.strip()
if line == '':
data = add_to_data(data, block)
block = []
else:
block.append(line)
data = add_to_data(data, block)
return data
if __name__ == '__main__':
data = parseData()
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(data)
which gives:
{ 'Bob Dylan': { '1966 Blonde on Blonde': [ '-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': { '1969 II': [ '-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': [ '-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}

Categories