Creating a Dictionary from File [duplicate]

Creating a Dictionary from File [duplicate] - python

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Reading a File into a Dictionary And Keeping Count
I am trying to create a dictionary with two values: the first value is the text:
<NEW ARTICLE>
Take a look at
what I found.
<NEW ARTICLE>
It looks like something
dark and shiny.
<NEW ARTICLE>
But how can something be dark
and shiny at the same time?
<NEW ARTICLE>
I have no idea.
and the second value is the count of how many times the word "ARTICLE>" is used.
I tried different methods and one method I received this error:
The erorr I receive is this:
(key, val) = line.split()
ValueError: need more than 1 value to unpack
I've tried a few different methods but to no avail, one method I tried said it gave too many values to unpack..
I want to be able to search for a key/word in the dictionary later on and find its appropriate count.
Using Python 3.

this should do it:
>>> with open("data1.txt") as f:
... lines=f.read()
... spl=lines.split("<NEW ARTICLE>")[1:]
... dic=dict((i,x.strip()) for i,x in enumerate(spl))
... print dic
...
{0: 'Take a look at \nwhat I found.',
1: 'It looks like something\ndark and shiny.',
2: 'But how can something be dark\nand shiny at the same time?',
3: 'I have no idea.'}

Make sure you don't have an empty line somewhere:
if newdoc == True and line != "ARTICLE>" and line:
(key, val) = line.split()
(an empty line would be splitted as [], which cannot be parsed as a tuple with two elements...)

Related

Comparing items through a tuple in Python

I am given an assignment when I am supposed to define a function that returns the second element of a tuple if the first element of a tuple matches with the argument of a function.
Specifically, let's say that I have a list of student registration numbers that goes by:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
And I have defined a function that is supposed to take in the argument of reg_num, such as "S12345", and return the name of the student in this case, "John". If the number does not match at all, I need to print "Not found" as a message. In essence, I understand that I need to sort through the larger tuple, and compare the first element [0] of each smaller tuple, then return the [1] entry of each smaller tuple. Here's what I have in mind:
def get_student_name(reg_num, particulars):
for i in records:
if reg_num == particulars[::1][0]:
return particulars[i][1]
else:
print("Not found")
I know I'm wrong, but I can't tell why. I'm not well acquainted with how to sort through a tuple. Can anyone offer some advice, especially in syntax? Thank you very much!

When you write for i in particulars, in each iteration i is an item of the collection and not an index. As such you cannot do particulars[i] (and there is no need - as you already have the item). In addition, remove the else statement so to not print for every item that doesn't match condition:
def get_student_name(reg_num, particulars):
for i in particulars:
if reg_num == i[0]:
return i[1]
print("Not found")
If you would want to iterate using indices you could do (but less nice):
for i in range(len(particulars)):
if reg_num == particulars[i][0]:
return particulars[i][1]

Another approach, provided to help learn new tricks for manipulating python data structures:
You can turn you tuple of tuples:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
into a dictionary:
>>> pdict = dict(particulars)
>>> pdict
{'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
You can look up the value by supplying the key:
>>> r = 'S23456'
>>> dict(pdict)[r]
'Max'
The function:
def get_student_name(reg, s_data):
try:
return dict(s_data)[reg]
except:
return "Not Found"
The use of try ... except will catch errors and just return Not Found in the case where the reg is not in the tuple in the first place. It will also catch of the supplied tuple is not a series of PAIRS, and thus cannot be converted the way you expect.
You can read more about exceptions: the basics and the docs to learn how to respond differently to different types of error.

for loops in python
Gilad Green already answered your question with a way to fix your code and a quick explanation on for loops.
Here are five loops that do more or less the same thing; I invite you to try them out.
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
for t in particulars:
print("{} {}".format(t[0], t[1]))
for i in range(len(particulars)):
print("{}: {} {}".format(i, particulars[i][0], particulars[i][1]))
for i, t in enumerate(particulars):
print("{}: {} {}".format(i, t[0], t[1]))
for reg_value, student_name in particulars:
print("{} {}".format(reg_value, student_name))
for i, (reg_value, student_name) in enumerate(particulars):
print("{}: {} {}".format(i, reg_value, student_name))
Using dictionaries instead of lists
Most importantly, I would like to add that using an unsorted list to store your student records is not the most efficient way.
If you sort the list and maintain it in sorted order, then you can use binary search to search for reg_num much faster than browsing the list one item at a time. Think of this: when you need to look up a word in a dictionary, do you read all words one by one, starting by "aah", "aback", "abaft", "abandon", etc.? No; first, you open the dictionary somewhere in the middle; you compare the words on that page with your word; then you open it again to another page; compare again; every time you do that, the number of candidate pages diminishes greatly, and so you can find your word among 300,000 other words in a very small time.
Instead of using a sorted list with binary search, you could use another data structure, for instance a binary search tree or a hash table.
But, wait! Python already does that very easily!
There is a data structure in python called a dictionary. See the documentation on dictionaries. This structure is perfectly adapted to most situations where you have keys associated to values. Here the key is the reg_number, and the value is the student name.
You can define a dictionary directly:
particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
Or you can convert your list of tuples to a dictionary:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
particulars_as_dict = dict(particulars)
Then you can check if an reg_number is in the dictionary, with they keyword in; you can return the student name using square brackets or with the method get:
>>> particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
>>> 'S23456' in particulars
True
>>> 'S98765' in particulars
False
>>>
>>> particulars['S23456']
'Max'
>>> particulars.get('S23456')
'Max'
>>> particulars.get('S23456', 'not found')
'Max'
>>>
>>> particulars['S98765']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'S98765'
>>> particulars.get('S98765')
None
>>> particulars.get('S98765', 'not found')
'not found'

How to delete item in nested list if it contains keyword?

I have 2 lists. The list named "keyword" is a list I manually created, and the nested list named "mylist" is an output of a function that I have in my script. This is what they look like:
keyword = ["Physics", "Spanish", ...]
mylist = [("Jack","Math and Physics"),
("Bob","English"),
("Emily","Physics"),
("Mark","Gym and Spanish"),
("Brian", "Math and Gym"),
...]
What I am trying to do is delete each item in the nested list if that item (in parenthesis) contains any of the keywords written in the "keyword" list.
For example, in this case, any items in "mylist" that contain the words "Physics" or "Spanish" should be deleted from "mylist". Then, when I print "mylist", this should be the output:
[("Bob","English"), ("Brian", "Math and Gym")]
I tried searching through the internet and many different SO posts to learn how to do this (such as this), but when I modify (because I have a nested list, instead of just a list) the code and run it, I get the following error:
Traceback (most recent call last):
File "namelist.py", line 165, in <module>
asyncio.get_event_loop().run_until_complete(request1())
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\asyncio\base_events.py", line 576, in run_until_complete
return future.result()
File "namelist.py", line 154, in request1
mylist.remove(a)
ValueError: list.remove(x): x not in list
Does anyone know how to fix this error, and could you share your code?
EDIT: By the way, the real "mylist" I have on my script is much longer than what I wrote here, and I have about 15 keywords. When I run it on a small scale like this, the code works well, but as soon as I have more than 5 keywords, for some reason, I keep getting this error.

You could join each of the tuples into a string and then check if any keyword is in the string to filter your list.
newlist = [m for m in mylist if not any(k for k in keywords if k in ' '.join(m))]
print(newlist)
# [('Bob', 'English'), ('Brian', 'Math and Gym')]

for key in keyword:
for tup in mylist:
[mylist.remove(tup) for i in tup if key in i]

You can start by splitting the fields with and and looking at intersection between the keys and the fields of each person. For instance, you could imagine something like this:
new_list = []
for name,fields in mylist:
# Convert the string into a set of string for intersection
field_set = set(fields.split(" and "))
field_in_keys = field_set.intersection(keyword)
# Add in the new list if no intersection is found
if len(field_in_keys) == 0:
new_list.append((name,fields))
You get:
[('Bob', 'English'), ('Brian', 'Math and Gym')]
If you care for speed, then pandas might do the work more efficiently

for x in keyword:
for i in mylist:
for w in i[1].split(' '):
if w == x:
mylist.remove(i)
If you just loop through each word I think that will work as well.

From an (ID, number) pair keep only those pairs that contain the largest number

I am new in python and I would like some help for a small problem. I have a file whose each line has an ID plus an associated number. More than one numbers can be associated to the same ID. How is it possible to get only the ID plus the largest number associated with it in python?
Example:
Input: ID_file.txt
ENSG00000133246 2013 ENSG00000133246 540
ENSG00000133246 2010
ENSG00000253626 465
ENSG00000211829 464
ENSG00000158458 2577
ENSG00000158458 2553
What I want is the following:
ENSG00000133246 2013
ENSG00000253626 465
ENSG00000211829 464
ENSG00000158458 2577
Thanks in advance for any help!

I would think there are many ways to do this I would though use a dictionary
from collections import defaultdict
id_value_dict = defaultdict()
for line in open(idfile.txt).readlines():
id, value = line.strip().split()
if id not in id_value_dict:
id_value_dict[id] = int(value)
else:
if id_value_dict[id] < int(value):
id_value_dict[id] = int(value)
Next step is to get the dictionary written out
out_ref = open(outputfile.txt,'w')
for key, value in id_value_dict:
outref.write(key + '\t' + str(value)
outref.close()
There are slicker ways to do this, I think the dictionary could be written in a one-liner using a lamda or a list-comprehension but I like to start simple
Just in case you need the results sorted there are lots of ways to do it but I think it is critical to understand working with lists and dictionaries in python as I have found that the learning to think about the right data container is usually the key to solving many of my problems but I am still a new. Any way if you need the sorted results a straightforward was is to
id_value_dict.keys().sort()
SO this is one of the slick things about python id_value__dict.keys() is a list of the keys of the dictionary sorted
out_ref = open(outputfile.txt,'w')
for key in id_value_dict.keys():
outref.write(key + '\t' + str(id_value_dict[key])
outref.close()
its really tricky because you might want (I know I always want) to code
my_sorted_list = id_value_dict.keys().sort()
However you will find that my_sorted_list does not exist (NoneType)

Given that your input consists of nothing but contiguous runs for each ID—that is, as soon as you see another ID, you never see the previous ID again—you can just do this:
import itertools
import operator
with open('ID_file.txt') as idfile, open('max_ID_file.txt', 'w') as maxidfile:
keyvalpairs = (line.strip().split(None, 1) for line in idfile)
for key, group in itertools.groupby(keyvalpairs, operator.itemgetter(0)):
maxval = max(int(keyval[1]) for keyval in group)
maxidfile.write('{} {}\n'.format(key, maxval))
To see what this does, let's go over it line by line.
A file is just an iterable full of lines, so for line in idfile means exactly what you'd expect. For each line, we're calling strip to get rid of extraneous whitespace, then split(None, 1) to split it on the first space, so we end up with an iterable full of pairs of strings.
Next, we use groupby to change that into an iterable full of (key, group) pairs. Try printing out list(keyvalpairs) to see what it looks like.
Then we iterate over that, and just use max to get the largest value in each group.
And finally, we print out the key and the max value for the group.

Converting dict values into a set while preserving the dict

I have a dict like this:
(100002: 'APPLE', 100004: 'BANANA', 100005: 'CARROT')
I am trying to make my dict have ints for the keys (as it does now) but have sets for the values (rather than strings as it is now.) My goal is to be able to read from a .csv file with one column for the key (an int which is the item id number) and then columns for things like size, shape, and color. I want to add this information into my dict so that only the information for keys already in dict are added.
My goal dict might look like this:
(100002: set(['APPLE','MEDIUM','ROUND','RED']), 100004: set(['Banana','MEDIUM','LONG','YELLOW']), 100005: set(['CARROT','MEDIUM','LONG','ORANGE'])
Starting with my dict of just key + string for item name, I tried code like this to read the extra information in from a .csv file:
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[int(spl_line[0])].update(spl_line[1:])
Unfortunately this errors out saying AttributeError: 'str' object has no attribute 'update'. My attempts to change my dictionary's values into sets so that I can then .update them have yielded things like this: (100002: set(['A','P','L','E']), 100004: set(['B','A','N']), 100005: set(['C','A','R','O','T']))
I want to convert the values to a set so that the string that is currently the value will be the first string in the set rather than breaking up the string into letters and making a set of those letters.
I also tried making the values a set when I create the dict by zipping two lists together but it didn't seem to make any difference. Something like this
MyDict = dict(zip(listofkeys, set(listofnames)))
still makes the whole listofnames list into a set but it doesn't achieve my goal of making each value in MyDict into a set with the corresponding string from listofnames as the first string in the set.
How can I make the values in MyDict into a set so that I can add additional strings to that set without turning the string that is currently the value in the dict into a set of individual letters?
EDIT:
I currently make MyDict by using one function to generate a list of item ids (which are the keys) and another function which looks up those item ids to generate a list of corresponding item names (using a two column .csv file as the data source) and then I zip them together.
ANSWER:
Using the suggestions here I came up with this solution. I found that the section that has set()).update can easily be changed to list()).append to yield a list rather than a set (so that the order is preserved.) I also found it easier to update by .csv data input files by adding the column containing names to the FileWithTheData.csv so that I didn't have to mess with making the dict, converting the values to sets, and then adding in more data. My code for this section now looks like this:
MyDict = {}
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in itemidlist: #note that this is the list I was formerly zipping together with a corresponding list of names to make my dict
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict

Your error is because originally your MyDict variable maps an integer to a string. When you are trying to update it you are treating the value like a set, when it is a string.
You can use a defaultdict for this:
combined_dict = defaultdict(set)
# first add all the values from MyDict
for key, value in MyDict.iteritems():
combined_dict[int(key)].add(value)
# then add the values from the file
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
combined_dict[int(sp_line[0])].update(spl_line[1:])

Your issue is with how you are initializing MyDict, try changing it to the following:
MyDict = dict(zip(listofkeys, [set([name]) for name in listofnames]))
Here is a quick example of the difference:
>>> listofkeys = [100002, 100004, 100005]
>>> listofnames = ['APPLE', 'BANANA', 'CARROT']
>>> dict(zip(listofkeys, set(listofnames)))
{100002: 'CARROT', 100004: 'APPLE', 100005: 'BANANA'}
>>> dict(zip(listofkeys, [set([name]) for name in listofnames]))
{100002: set(['APPLE']), 100004: set(['BANANA']), 100005: set(['CARROT'])}
set(listofnames) is just going to turn your list into a set, and the only effect that might have is to reorder the values as seen above. You actually want to take each string value in your list, and convert it to a one-element set, which is what the list comprehension does.
After you make this change, your current code should work fine, although you can just do the contains check directly on the dictionary instead of explicitly checking the keys (key in MyDict is the same as key in MyDict.keys()).

Python name variable from string [duplicate]

This question already has answers here:
How can you dynamically create variables? [duplicate]
(8 answers)
Closed 8 years ago.
Is it possible to create a variable name based on the value of a string?
I have a script that will read a file for blocks of information and store them in a dictionary. Each block's dictionary will then be appended to a 'master' dictionary. The number of blocks of information in a file will vary and uses the word 'done' to indicate the end of a block.
I want to do something like this:
master={}
block=0
for lines in file:
if line != "done":
$block.append(line)
elif line == "done":
master['$block'].append($block)
block = block + 1
If a file had content like so:
eggs
done
bacon
done
ham
cheese
done
The result would be a dictionary with 3 lists:
master = {'0': ["eggs"], '1': ["bacon"], '2': ["ham", "cheese"]}
How could this be accomplished?

I would actually suggest you to use a list instead. Is there any specific point why would you need dicts that are array-ish?
In case you could do with an array, you can use this:
with open("yourFile") as fd:
arr = [x.strip().split() for x in fd.read().split("done")][:-1]
Output:
[['eggs'], ['bacon'], ['ham', 'cheese']]
In case you wanted number-string indices, you could use this:
with open("yourFile") as fd:
l = [x.strip().split() for x in fd.read().split("done")][:-1]
print dict(zip(map(str,range(len(l))),l))

You seem to be misunderstanding how dictionaries work. They take keys that are objects, so no magic is needed here.
We can however, make your code nicer by using a collections.defaultdict to make the sublists as required.
from collections import defaultdict
master = defaultdict(list)
block = 0
for line in file:
if line == "done":
block += 1
else:
master[block].append(line)
I would, however, suggest that a dictionary is unnecessary if you want continuous, numbered indices - that's what lists are for. In that case, I suggest you follow Thrustmaster's first suggestion, or, as an alternative:
from itertools import takewhile
def repeat_while(predicate, action):
while True:
current = action()
if not predicate(current):
break
else:
yield current
with open("test") as file:
action = lambda: list(takewhile(lambda line: not line == "done", (line.strip() for line in file)))
print(list(repeat_while(lambda x: x, action)))

I think that split on "done" is doomed to failure. Consider the list:
eggs
done
bacon
done
rare steak
well done stake
done
Stealing from Thrustmaster (which I gave a +1 for my theft) I'd suggest:
>>> dict(enumerate(l.split() for l in open(file).read().split('\ndone\n') if l))
{0: ['eggs'], 1: ['bacon'], 2: ['ham', 'cheese']}
I know this expects a trailing "\n". If there is a question there you could use "open(file).read()+'\n'" or even "+'\n\ndone\n'" if the final done is optional.

Use setattr or globals().
See How do I call setattr() on the current module?

Here's your code again, for juxtaposition:
master={}
block=0
for lines in file:
if line != "done":
$block.append(line)
elif line == "done":
master['$block'].append($block)
block = block + 1
As mentioned in the post by Thrustmaster, it makes more sense to use a nested list here. Here's how you would do that; I've changed as little as possible structurally from your original code:
master=[[]] # Start with a list containing only a single list
for line in file: # Note the typo in your code: you wrote "for lines in file"
if line != "done":
master[-1].append(line) # Index -1 is the last element of your list
else: # Since it's not not "done", it must in fact be "done"
master.append([])
The only thing here is that you'll end up with one extra list at the end of your master list, so you should add a line to delete the last, empty sublist:
del master[-1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a Dictionary from File [duplicate] - python

Make sure you don't have an empty line somewhere: if newdoc == True and line != "ARTICLE>" and line: (key, val) = line.split() (an empty line would be splitted as [], which cannot be parsed as a tuple with two elements...)

Related

Comparing items through a tuple in Python

How to delete item in nested list if it contains keyword?

From an (ID, number) pair keep only those pairs that contain the largest number

Converting dict values into a set while preserving the dict

Python name variable from string [duplicate]

Categories

Resources