Python File IO - building dictionary and finding max value

Python File IO - building dictionary and finding max value - python

Problem is to return the name of the event that has the highest number of participants in this text file:
#Beyond the Imposter Syndrome
32 students
4 faculty
10 industries
#Diversifying Computing Panel
15 students
20 faculty
#Movie Night
52 students
So I figured I had to split it into a dictionary with the keys as the event names and the values as the sum of the integers at the beginning of the other lines. I'm having a lot of trouble and I think I'm making it too complicated than it is.
This is what I have so far:
def most_attended(fname):
'''(str: filename, )'''
d = {}
f = open(fname)
lines = f.read().split(' \n')
print lines
indexes = []
count = 0
for i in range(len(lines)):
if lines[i].startswith('#'):
event = lines[i].strip('#').strip()
if event not in d:
d[event] = []
print d
indexes.append(i)
print indexes
if not lines[i].startswith('#') and indexes !=0:
num = lines[i].strip().split()[0]
print num
if num not in d[len(d)-1]:
d[len(d)-1] += [num]
print d
f.close()

import sys
from collections import defaultdict
from operator import itemgetter
def load_data(file_name):
events = defaultdict(int)
current_event = None
for line in open(file_name):
if line.startswith('#'):
current_event = line[1:].strip()
else:
participants_count = int(line.split()[0])
events[current_event] += participants_count
return events
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage:\n\t{} <file>\n'.format(sys.argv[0]))
else:
events = load_data(sys.argv[1])
print('{}: {}'.format(*max(events.items(), key=itemgetter(1))))

Here's how I would do it.
with open("test.txt", "r") as f:
docText = f.read()
eventsList = []
#start at one because we don't want what's before the first #
for item in docText.split("#")[1:]:
individualLines = item.split("\n")
#get the sum by finding everything after the name, name is the first line here
sumPeople = 0
#we don't want the title
for line in individualLines[1:]:
if not line == "":
sumPeople += int(line.split(" ")[0]) #add everything before the first space to the sum
#add to the list a tuple with (eventname, numpeopleatevent)
eventsList.append((individualLines[0], sumPeople))
#get the item in the list with the max number of people
print(max(eventsList, key=lambda x: x[1]))
Essentially you first want to split up the document by #, ignoring the first item because that's always going to be empty. Now you have a list of events. Now for each event you have to go through, and for every additional line in that event (except the first) you have to add that lines value to the sum. Then you create a list of tuples like (eventname) (numPeopleAtEvent). Finally you use max() to get the item with the maximum number of people.
This code prints ('Movie Night', 104) obviously you can format it to however you like

Similar answers to the ones above.
result = {} # store the results
current_key = None # placeholder to hold the current_key
for line in lines:
# find what event we are currently stripping data for
# if this line doesnt start with '#', we can assume that its going to be info for the last seen event
if line.startswith("#"):
current_key = line[1:]
result[current_key] = 0
elif current_key:
# pull the number out of the string
number = [int(s) for s in line.split() if s.isdigit()]
# make sure we actually got a number in the line
if len(number) > 0:
result[current_key] = result[current_key] + number[0]
print(max(result, key=lambda x: x[1]))
This will print "Movie Night".

Your problem description says that you want to find the event with highest number of participants. I tried a solution which does not use list or dictionary.
Ps: I am new to Python.
bigEventName = ""
participants = 0
curEventName = ""
curEventParticipants = 0
# Use RegEx to split the file by lines
itr = re.finditer("^([#\w+].*)$", lines, flags = re.MULTILINE)
for m in itr:
if m.group(1).startswith("#"):
# Whenever a new group is encountered, check if the previous sum of
# participants is more than the recent event. If so, save the results.
if curEventParticipants > participants:
participants = curEventParticipants
bigEventName = curEventName
# Reset the current event name and sum as 0
curEventName = m.group(1)[1:]
curEventParticipants = 0
elif re.match("(\d+) .*", m.group(1)):
# If it is line which starts with number, extract the number and sum it
curEventParticipants += int(re.search("(\d+) .*", m.group(1)).group(1))
# This nasty code is needed to take care of the last event
bigEventName = curEventName if curEventParticipants > participants else bigEventName
# Here is the answer
print("Event: ", bigEventName)

You can do it without a dictionary and maybe make it a little simpler if just using lists:
with open('myfile.txt', 'r') as f:
lines = f.readlines()
lines = [l.strip() for l in lines if l[0] != '#'] # remove comment lines and '\n'
highest = 0
event = ""
for l in lines:
l = l.split()
if int(l[0]) > highest:
highest = int(l[0])
event = l[1]
print (event)

Related

How would I modify this counting script to account for duplicate values with different versions?

Basically I have a stdin that looks like this
5 3
calculator-2.4
data_feed-3.2
protocol_adapter-1.0
protocol_adapter-1.1
local_network_connector-3.4
data_feed-3.2.1
calculator-2.4.1
protocol_adapter-1.2
where the 5 rows, starting from the top, are current versions of something and the 3 are the new versions I need to compare the current versions to and determind if I need to upgrade. The answer will be a single int that will indicate the number of upgrades required. So for the example given the answer would be 4, calculator-2.4 data_feed-3.2 protocol_adapter-1.0 protocol_adapter-1.1 would be the ones needing upgrades. But if I have something like
6 4
integrator-4.6.3
data_feed-1.1
calculator-3.6
protocol_adapter-2.2.1
data_feed-1.1
integrator-4.6.2
data_feed-1.1.1
protocol_adapter-2.3
data_feed-1.2
validator-1.0
Where the data feed upgrade is in there twice my code will not pick it up. The answer to this second one would be 5 too. I have attached my code below.
import sys
line1 = sys.stdin.readline()
const = int(line1[0])
new = line1[2]
count = 0
dict1 = {}
listy = []
ans = 0
trbl = []
for line in sys.stdin:
if count < const:
line = line.rstrip()
listy.append(line)
count+=1
else:
line = line.rstrip()
data = line.split('-')
dict1[data[0]]=data[1]
for value in listy:
values1 = value.split('-')
if values1[0] in dict1.keys():
if value[1] != dict1[values1[0]]:
ans+=1
trbl.append(value)
else:
pass
print(ans)
print(trbl)
My code answers the first one but if there's more than one instance of a new version on the same component it won't pick it up.

The following code checks if the version is not equal to the version stored in dict1:
if value[1] != dict1[values1[0]]:
ans+=1
trbl.append(value)
This means that if the version is in the dict1, it would be ignored. You might want to try and create a second map that keeps track of the number of times the latest version is seen. Something like:
import sys
line1 = sys.stdin.readline()
const = int(line1[0])
new = line1[2]
count = 0
dict1 = {}
set1 = set([]) # this set keeps track of the occurence of the latest version
listy = []
ans = 0
trbl = []
for line in sys.stdin:
if count < const:
line = line.rstrip()
listy.append(line)
count+=1
else:
line = line.rstrip()
data = line.split('-')
dict1[data[0]]=data[1]
for value in listy:
values1 = value.split('-')
if values1[0] in dict1.keys():
if value[1] != dict1[values1[0]]:
ans+=1
trbl.append(value)
elif values1[0] in set1: # same version already occurred, so add it to ans and trbl
ans+=1
trbl.append(value)
else:
set1.add(values1[0]) # add to set after the first occurence
print(ans)
print(trbl)

Selecting line from file by using "startswith" and "next" commands

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)

The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']

First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.

So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want

You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

CSV Python Outputting: Outputting non-matching field once rather than once for every item in list

I've been trying to figure this out for about a year now and I'm really burnt out on it so please excuse me if this explanation is a bit rough.
I cannot include job data, but it would be accurate to imagine 2 csv files both with the first column populated with values (Serial numbers/phone numbers/names, doesn't matter - just values). Between both csv files, some values would match while other values would only be contained in one or the other (Timmy is in both files and is a match, Robert is only in file 1 and does not match any name in file 2).
I can successfully output a csv value ONCE that exists in the both csv files (I.e. both files contain "Value78", output file will contain "Value78" only once).
When I try to tack on an else statement to my if condition, to handle non-matching items, the program will output 1 entry for every item it does not match with (makes 100% sense, matches happen once but every other comparison result besides the match is a non-match).
I cannot envision a structure or method to hold the fields that don't match back so that they can be output once and not overrun my terminal or output file.
My goal is to output two csv files, matches and non-matches, with the non-matches having only one entry per value.
Anyways, onto the code:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
pass
The "else: pass" is where the logic of filtering out non-matches is escaping me. Outputting from this else statement will write the non-matching value (len(VList) - 1) times for an iteration that DOES produce 1 match, the entire len(VList) for an iteration with no match. I've tried using a counter and only outputting if the counter equals the len(VList), (incrementing in the else statement, writing output under the scope of the second for loop), but received the same output as if I tried outputting non-matches.

Below is one way you might go about deduplicating and then writing to a file:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
list_of_non_matches = []
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
list_of_non_matches.append(MyList[x][0])
# Remove duplicates from the non matches
new_list = []
[new_list.append(x) for x in list_of_non_matches if x not in new_list]
# Write the new list to a file
for i in new_list:
NFile.write(i + '\n')

Does this work?
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,'r') as MFile,
(VENDORUNITS,'r') as VFile,
(MATCHES,'w') as OFile,
(NONMATCHES,mode,'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
MyVals = [x for x in MyList]
MyVals = [x[0] for x in MyVals]
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
vVals = [x for x in VList]
vVals = [x[0] for x in vVals]
for val in MyVals:
if val in vVals:
OFile.write(Val + '\n')
else:
NFile.write(Val + '\n')
#for x in range(len(MyList)):
# for y in range(len(VList)):
# if str(MyList[x][0]) == str(VList[y][0]):
# OFile.write(MyList[x][0] + '\n')
# else:
# pass

Sorry, I had some issues with my PC. I was able to solve my own question the night I posted. The solution I used is so simple I'm kicking myself for not figuring it out way sooner:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
tmpStr = ''
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
tmpStr = '' #Sets to blank so comparison fails, works because break
OFile.write(MyList[x][0] + '\n')
break
else:
tmp = str(MyList[x][0])
if tmp != '':
NFile.write(tmp + '\n')

Code doesn't print the last sequence in a file

I have a file that looks like this:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2
<s2> 4
etc. up to more than a thousand
Each sequence has a header like <s0> 3, which in this case states that three lines follow. In the example above, the number of lines below <s1> is two, so I have to correct the header to <s1> 2.
The code I have below picks out the sequence headers and the correct number of lines below them. But for some reason, it never gets the details of the last sequence. I know something is wrong but I don't know what. Can someone point me to what I am doing wrong?
import re
def call():
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
c = 0
c1 = 0
header = []
k = -1
for line in fp:
if line.startswith("<s"):
#header = line.split(" ")
#print header[1]
c = 0
else:
c1 = c + 1
c += 1
if c == 0 and c1>0:
k +=1
printing = c1
if printing >= 0:
s = "<s%s>" % (k)
#print "%s %d" % (s, printing)
docHeader.write(s+" "+str(printing)+"\n")
call()

you have no sentinel at the end of the last sequence in your data, so your code will need to deal with the last sequence AFTER the loop is done.

If I may suggest some python tricks to get to your results; you don't need those c/c1/k counter variables, as they make the code more difficult to read and maintain. Instead, populate a map of sequence header to sequence items and then use the map to do all your work:
(this code works only if all sequence headers are unique - if you have duplicates, it won't work)
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
data = {}
for line in fp:
if line.startswith("<s"):
current_sequence = line
# create a list with the header as the key
data[current_sequence] = []
else:
# add each sequence to the list we defined above
data[current_sequence].append(line)
Your map is ready! It looks like this:
{"<s0> 3": ["line1", "line2", "line5"],
"<s1> 5": ["line1", "line2"]}
You can iterate it like this:
for header, lines in data.items():
# header is the key, or "<s0> 3"
# lines is the list of lines under that header ["line1", "line2", etc]
num_of_lines = len(lines)

The main problem is that you neglect to check the value of c after you have read the last line. You probably had difficulty spotting this problem because of all the superfluous code. You don't have to increment k, since you can extract the value from the <s...> tag. And you don't have to have all three variables c, c1, and printing. A single count variable will do.
import re, sys
def call():
with open('trial_perl.txt') as fp:
docHeader = sys.stdout #open("C:\path\header.txt","w")
count = 0
id = None
for line in fp:
if line.startswith("<s"):
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
count = 0
id = int(line[2:line.find('>')])
else:
count += 1
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
call()

Another approach using groupby from itertools, where you take the maximum number of line in each group - a group corresponding to a sequence of header + line in your file: :
from itertools import groupby
def call():
with open('stack.txt') as fp:
header = [-1]
lines = [0]
for line in fp:
if line.startswith("<s"):
header.append(header[-1]+1)
lines.append(0)
else:
header.append(header[-1])
lines.append(lines[-1] +1)
with open('result','w') as f:
for key, group in groupby(zip(header[1:],lines[1:]), lambda x: x[0]):
f.write(str(("<s%d> %d\n" % max(group))))
f.close()
call()
#<s0> 3
#<s1> 2
stack.txt is the file containing your data:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))

Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result

Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python File IO - building dictionary and finding max value - python

Related

How would I modify this counting script to account for duplicate values with different versions?

Selecting line from file by using "startswith" and "next" commands

CSV Python Outputting: Outputting non-matching field once rather than once for every item in list

Code doesn't print the last sequence in a file

Dictionaries overwriting in Python

Categories

Resources