I have this type of string:
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
turquoise, PF00011
NULL
"""
Every line starts with an identifier (e.g. tan, magenta...) What I want is to count the number of occurrences of each PF-number per identifier.
So, the final structure would be something like this:
magenta turquoise tan NULL
PF00575 0 0 0 0
PF00154 0 1 0 0
PF06745 0 0 1 0
PF08423 0 0 1 0
PF13481 0 0 1 0
PF14520 0 0 1 0
PF00011 0 1 0 0
I started with making a a dictionary where every first word on a line is a key and then I want as values the PF-numbers behind it.
When I use this code, I get the values as a list of strings instead of as separate values in the dictionary:
lines = []
lines.append(sheet.split("\n"))
flattened=[]
flattened = [val for sublist in lines for val in sublist]
pfams = []
for i in flattened:
pfams.append(i.split(","))
d = defaultdict(list)
for i in pfams:
pfam = i[0]
d[pfam].append(i[1:])
So, the result is this:
defaultdict(<type 'list'>, {'': [[], []], 'magenta': [[]], 'NULL': [[]], 'turquoise': [['PF00575']], 'tan': [['PF00154', 'PF06745', 'PF08423', 'PF13481', 'PF14520']]})
How can I split up the PFnumbers so that they are separate values in the dictionary and then count the number of occurrences of each unique PF-number per key?
Use collections.Counter (https://docs.python.org/2/library/collections.html#collections.Counter)
import collections
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
NULL
"""
acc = {}
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] = collections.Counter(parts[1])
EDIT: Now with accumulating all PF values for each key
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] += parts[1:]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
Final edit Count the occurrence of colours per PF value, which is what we were after all along, in the end:
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
for pfval in parts[1:]
acc[ pfval ] += [ parts[0] ]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
With thanks to dwblas on devshed, this is the most efficient way I've found to tackle the task:
I build a dictionary whose key is the PFnumber, and a list ordered by how I want the colors printed.
colors_list= ['cyan','darkorange','greenyellow','yellow','magenta','blue','green','midnightblue','brown','darkred','lightcyan','lightgreen','darkgreen','royalblue','orange','purple','tan','grey60','darkturquoise','red','lightyellow','darkgrey','turquoise','salmon','black','pink','grey','null']
lines = sheet.splitlines()
counts = {}
for line in lines:
parts = line.split(",")
if len(parts) > 1:
## doesn't break out the same item in the list many times
color=parts[0].strip().lower()
for key in parts[1:]: ## skip color
key=key.strip()
if key not in counts:
## new key and list of zeroes-print it if you want to verify
counts[key]=[0 for ctr in range(len(colors_list))]
## offset number/location of this color in list
el_number=colors_list.index(color)
if color > -1: ## color found
counts[key][el_number] += 1
else:
print "some error message"
import csv
with open("out.csv", "wb") as f:
writer=csv.writer(f)
writer.writerow( ["PFAM",] + colors_list)
for pfam in counts:
writer.writerow([pfam] + counts[pfam])
Related
I have space delimited data in a text file look like the following:
0 1 2 3
1 2 3
3 4 5 6
1 3 5
1
2 3 5
3 5
each line has different length.
I need to read it starting from line 2 ('1 2 3')
and parse it and get the following information:
Number of unique data = (1,2,3,4,5,6)=6
Count of each data:
count data (1)=3
count data (2)=2
count data (3)=5
count data (4)=1
count data (5)=4
count data (6)=1
Number of lines=6
Sort the data in descending order:
data (3)
data (5)
data (1)
data (2)
data (4)
data (6)
I did this:
file=open('data.txt')
csvreader=csv.reader(file)
header=[]
header=next(csvreader)
print(header)
rows=[]
for row in csvreader:
rows.append(row)
print(rows)
After this step, what should I do to get the expected results?
I would do something like this:
from collections import Counter
with open('data.txt', 'r') as file:
lines = file.readlines()
lines = lines[1:] # skip first line
data = []
for line in lines:
data += line.strip().split(" ")
counter = Counter(data)
print(f'unique data: {list(counter.keys())}')
print(f'count data: {list(sorted(counter.most_common(), key=lambda x: x[0]))}')
print(f'number of lines: {len(lines)}')
print(f'sort data: {[x[0] for x in counter.most_common()]}')
A simple brute force approach:
nums = []
counts = {}
for row in open('data.txt'):
if row[0] == '0':
continue
nums.extend( [int(k) for k in row.rstrip().split()] )
print(nums)
for n in nums:
if n not in counts:
counts[n] = 1
else:
counts[n] += 1
print(counts)
ordering = list(sorted(counts.items(), key=lambda k: -k[1]))
print(ordering)
Here is another approach
def getData(infile):
""" Read file lines and return lines 1 thru end"""
lnes = []
with open(infile, 'r') as data:
lnes = data.readlines()
return lnes[1:]
def parseData(ld):
""" Parse data and print desired results """
unique_symbols = set()
all_symbols = dict()
for l in ld:
symbols = l.strip().split()
for s in symbols:
unique_symbols.add(s)
cnt = all_symbols.pop(s, 0)
cnt += 1
all_symbols[s] = cnt
print(f'Number of Unique Symbols = {len(unique_symbols)}')
print(f'Number of Lines Processed = {len(ld)}')
for symb in unique_symbols:
print(f'Number of {symb} = {all_symbols[symb]}')
print(f"Descending Sort of Symbols = {', '.join(sorted(list(unique_symbols), reverse=True))}")
On executing:
infile = r'spaced_text.txt'
parseData(getData(infile))
Produces:
Number of Unique Symbols = 6
Number of Lines Processed = 6
Number of 2 = 2
Number of 5 = 4
Number of 3 = 5
Number of 1 = 3
Number of 6 = 1
Number of 4 = 1
Descending Sort of Symbols = 6, 5, 4, 3, 2, 1
I made a surname dict containing surnames like this:
--The files contains 200 000 words, and this is a sample on the surname_dict--
['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
I am not allow to use counter library or numpy, just native Python.
My idea was to use for-loop sorting through the dictionary, but just hit some walls. Please help with some advice.
Thanks.
surname_dict = []
count = 0
for index in data_list:
if index["lastname"] not in surname_dict:
count = count + 1
surname_dict.append(index["lastname"])
for k, v in sorted(surname_dict.items(), key=lambda item: item[1]):
if count < 10: # Print only the top 10 surnames
print(k)
count += 1
else:
break
As mentioned in a comment, your dict is actually a list.
Try using the Counter object from the collections library. In the below example, I have edited your list so that it contains a few duplicates.
from collections import Counter
surnames = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN', 'OLDERVIK', 'ØSTBY', 'ØSTBY']
counter = Counter(surnames)
for name in counter.most_common(3):
print(name)
The result becomes:
('ØSTBY', 3)
('OLDERVIK', 2)
('KRISTIANSEN', 1)
Change the integer argument to most_common to 10 for your use case.
The best approach to answer your question is to consider the top ten categories :
for example : category of names that are used 9 times and category of names that are used 200 times and so . Because , we could have a case where 100 of users use different usernames but all of them have to be on the top 10 used username. So to implement my approach here is the script :
def counter(file : list):
L = set(file)
i = 0
M = {}
for j in L :
for k in file :
if j == k:
i+=1
M.update({i : j})
i = 0
D = list(M.keys())
D.sort()
F = {}
if len(D)>= 10:
K = D[0:10]
for i in K:
F.update({i:D[i]})
return F
else :
return M
Note: my script calculate the top ten categories .
You could place all the values in a dictionary where the value is the number of times it appears in the dataset, and filter through your newly created dictionary and push any result that has a value count > 10 to your final array.
edit: your surname_dict was initialized as an array, not a dictionary.
surname_dict = {}
top_ten = []
for index in data_list:
if index['lastname'] not in surname_dict.keys():
surname_dict[index['lastname']] = 1
else:
surname_dict[index['lastname']] += 1
for k, v in sorted(surname_dict.items()):
if v >= 10:
top_ten.append(k)
return top_ten
Just use a standard dictionary. I've added some duplicates to your data, and am using a threshold value to grab any names with more than 2 occurences. Use threshold = 10 for your actual code.
names = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY','ØSTBY','ØSTBY','REMLO', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
# you need 10 in your code, but I've only added a few dups to your sample data
threshold = 2
di = {}
for name in names:
#grab name count, initialize to zero first time
count = di.get(name, 0)
di[name] = count + 1
#basic filtering, no sorting
unsorted = {name:count for name, count in di.items() if count >= threshold}
print(f"{unsorted=}")
#sorting by frequency: filter out the ones you don't want
bigenough = [(count, name) for name, count in di.items() if count >= threshold]
tops = sorted(bigenough, reverse=True)
print(f"{tops=}")
#or as another dict
tops_dict = {name:count for count, name in tops}
print(f"{tops_dict=}")
Output:
unsorted={'ØSTBY': 3, 'REMLO': 2}
tops=[(3, 'ØSTBY'), (2, 'REMLO')]
tops_dict={'ØSTBY': 3, 'REMLO': 2}
Update.
Wanted to share what code I made in the end. Thank you guys so much. The feedback really helped.
Code:
etternavn_dict = {}
for index in data_list:
if index['etternavn'] not in etternavn_dict.keys():
etternavn_dict[index['etternavn']] = 1
else:
etternavn_dict[index['etternavn']] += 1
print("\nTopp 10 etternavn:")
count = 0
for k, v in sorted(etternavn_dict.items(), key=lambda item: item[1]):
if count < 10:
print(k)
count += 1
else:
break
So I have this list and a function that calculates the scores of my teams. i then put the team name and the score in a separate dictionary but the problem is that i have a few duplicate teams in this list. theres a second item which is whether or not the team response was valid if the result was this: team1 - score 100 - validresponse 0 i just want to get rid of the team even if its a duplicate, however of theres two duplicates of the SAME team and both their submissions were valid then i want to add their scores together and set it as one thing in the dictionary. the only problem is that when doing this, the dictionary automatically disregards the other duplicates.
Here's my code:
import numpy as np
import pandas as pd
mylist = []
with open("input1.txt", "r") as input:
for line in input:
items = line.split()
mylist.append([int(item) for item in items[0:]])
amountOfTestCases = mylist[0][0]
amountOfTeams = mylist[1][0]
amountOfLogs = mylist[1][1]
count = 1
count2 = 1
mydict = {}
teamlist = []
for i in mylist[2:]:
count2 += 1
teamlist.append(mylist[count2][1])
def find_repeating(lst, count=2):
ret = []
counts = [None] * len(lst)
for i in lst:
if counts[i] is None:
counts[i] = i
elif i == counts[i]:
ret += [i]
if len(ret) == count:
return ret
rep_indexes = np.where(pd.DataFrame(teamlist).duplicated(keep=False))
print(teamlist)
print(rep_indexes)
duplicate = find_repeating(teamlist)
def calculate_points(row):
points = mylist[row][3] * 100
points -= mylist[row][0]
return points
for i in teamlist:
count += 1
mydict['team%s' % mylist[count][1]] = calculate_points(count)
print(mydict)
the teamlist = [5, 4, 1, 2, 5, 4]
validresponse 0 i just want to get rid of the team even if its a duplicate
check if the response is valid
if invalid continue without doing anything else
duplicates of the SAME team and both their submissions were valid then i want to add their scores together
check if the key/team already exists (a duplicate)
if it exists
get its value
add the new value
assign the result to that dictionary key
if it is not a duplicate
make a new key with that value
Problem is to return the name of the event that has the highest number of participants in this text file:
#Beyond the Imposter Syndrome
32 students
4 faculty
10 industries
#Diversifying Computing Panel
15 students
20 faculty
#Movie Night
52 students
So I figured I had to split it into a dictionary with the keys as the event names and the values as the sum of the integers at the beginning of the other lines. I'm having a lot of trouble and I think I'm making it too complicated than it is.
This is what I have so far:
def most_attended(fname):
'''(str: filename, )'''
d = {}
f = open(fname)
lines = f.read().split(' \n')
print lines
indexes = []
count = 0
for i in range(len(lines)):
if lines[i].startswith('#'):
event = lines[i].strip('#').strip()
if event not in d:
d[event] = []
print d
indexes.append(i)
print indexes
if not lines[i].startswith('#') and indexes !=0:
num = lines[i].strip().split()[0]
print num
if num not in d[len(d)-1]:
d[len(d)-1] += [num]
print d
f.close()
import sys
from collections import defaultdict
from operator import itemgetter
def load_data(file_name):
events = defaultdict(int)
current_event = None
for line in open(file_name):
if line.startswith('#'):
current_event = line[1:].strip()
else:
participants_count = int(line.split()[0])
events[current_event] += participants_count
return events
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage:\n\t{} <file>\n'.format(sys.argv[0]))
else:
events = load_data(sys.argv[1])
print('{}: {}'.format(*max(events.items(), key=itemgetter(1))))
Here's how I would do it.
with open("test.txt", "r") as f:
docText = f.read()
eventsList = []
#start at one because we don't want what's before the first #
for item in docText.split("#")[1:]:
individualLines = item.split("\n")
#get the sum by finding everything after the name, name is the first line here
sumPeople = 0
#we don't want the title
for line in individualLines[1:]:
if not line == "":
sumPeople += int(line.split(" ")[0]) #add everything before the first space to the sum
#add to the list a tuple with (eventname, numpeopleatevent)
eventsList.append((individualLines[0], sumPeople))
#get the item in the list with the max number of people
print(max(eventsList, key=lambda x: x[1]))
Essentially you first want to split up the document by #, ignoring the first item because that's always going to be empty. Now you have a list of events. Now for each event you have to go through, and for every additional line in that event (except the first) you have to add that lines value to the sum. Then you create a list of tuples like (eventname) (numPeopleAtEvent). Finally you use max() to get the item with the maximum number of people.
This code prints ('Movie Night', 104) obviously you can format it to however you like
Similar answers to the ones above.
result = {} # store the results
current_key = None # placeholder to hold the current_key
for line in lines:
# find what event we are currently stripping data for
# if this line doesnt start with '#', we can assume that its going to be info for the last seen event
if line.startswith("#"):
current_key = line[1:]
result[current_key] = 0
elif current_key:
# pull the number out of the string
number = [int(s) for s in line.split() if s.isdigit()]
# make sure we actually got a number in the line
if len(number) > 0:
result[current_key] = result[current_key] + number[0]
print(max(result, key=lambda x: x[1]))
This will print "Movie Night".
Your problem description says that you want to find the event with highest number of participants. I tried a solution which does not use list or dictionary.
Ps: I am new to Python.
bigEventName = ""
participants = 0
curEventName = ""
curEventParticipants = 0
# Use RegEx to split the file by lines
itr = re.finditer("^([#\w+].*)$", lines, flags = re.MULTILINE)
for m in itr:
if m.group(1).startswith("#"):
# Whenever a new group is encountered, check if the previous sum of
# participants is more than the recent event. If so, save the results.
if curEventParticipants > participants:
participants = curEventParticipants
bigEventName = curEventName
# Reset the current event name and sum as 0
curEventName = m.group(1)[1:]
curEventParticipants = 0
elif re.match("(\d+) .*", m.group(1)):
# If it is line which starts with number, extract the number and sum it
curEventParticipants += int(re.search("(\d+) .*", m.group(1)).group(1))
# This nasty code is needed to take care of the last event
bigEventName = curEventName if curEventParticipants > participants else bigEventName
# Here is the answer
print("Event: ", bigEventName)
You can do it without a dictionary and maybe make it a little simpler if just using lists:
with open('myfile.txt', 'r') as f:
lines = f.readlines()
lines = [l.strip() for l in lines if l[0] != '#'] # remove comment lines and '\n'
highest = 0
event = ""
for l in lines:
l = l.split()
if int(l[0]) > highest:
highest = int(l[0])
event = l[1]
print (event)
I am making a Python script that parses an Excel file using the xlrd library.
What I would like is to do calculations on different columns if the cells contain a certain value. Otherwise, skip those values. Then store the output in a dictionary.
Here's what I tried to do :
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
num_rows = worksheet.nrows -1
num_cells = worksheet.ncols - 1
first_col = 0
scnd_col = 1
third_col = 2
# Read Data into double level dictionary
celldict = dict()
for curr_row in range(num_rows) :
cell0_val = int(worksheet.cell_value(curr_row+1,first_col))
cell1_val = worksheet.cell_value(curr_row,scnd_col)
cell2_val = worksheet.cell_value(curr_row,third_col)
if cell1_val[:3] == 'BL1' :
if cell2_val=='toSkip' :
continue
elif cell1_val[:3] == 'OUT' :
if cell2_val == 'toSkip' :
continue
if not cell0_val in celldict :
celldict[cell0_val] = dict()
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
So here as you can see, I count the number of "cell1_val" values for each "cell0_val". But I would like to skip those values which have "toSkip" in the adjacent column's cell before doing the sum and storing it in the dict.
I am doing something wrong here, and I feel like the solution is much more simple.
Any help would be appreciated. Thanks.
Here's an example of my sheet :
cell0 cell1 cell2
12 BL1 toSkip
12 BL1 doNotSkip
12 OUT3 doNotSkip
12 OUT3 toSkip
13 BL1 doNotSkip
13 BL1 toSkip
13 OUT3 doNotSkip
Use collections.defaultdict with collections.Counter for your nested dictionary.
Here it is in action:
>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> d['red']['blue'] += 1
>>> d['green']['brown'] += 1
>>> d['red']['blue'] += 1
>>> pprint.pprint(d)
{'green': Counter({'brown': 1}),
'red': Counter({'blue': 2})}
Here it is integrated into your code:
from collections import defaultdict, Counter
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
first_col = 0
scnd_col = 1
third_col = 2
celldict = defaultdict(Counter)
for curr_row in range(1, worksheet.nrows): # start at 1 skips header row
cell0_val = int(worksheet.cell_value(curr_row, first_col))
cell1_val = worksheet.cell_value(curr_row, scnd_col)
cell2_val = worksheet.cell_value(curr_row, third_col)
if cell2_val == 'toSkip' and cell1_val[:3] in ('BL1', 'OUT'):
continue
celldict[cell0_val][cell1_val] += 1
I also combined your if-statments and changed the calculation of curr_row to be simpler.
It appears you want to skip the current line whenever cell2_val equals 'toSkip', so it would simplify the code if you add if cell2_val=='toSkip' : continue directly after computing cell2_val.
Also, where you have
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
the usual idiom is more like
celldict[cell0_val][cell1_val] = celldict[cell0_val].get(cell1_val, 0) + 1
That is, use a default value of 0 so that if key cell1_val is not yet in celldict[cell0_val], then get() will return 0.