Python 2D list to dictionary - python

I have a 2 Dimensional list and have to get 2 columns from the 2D list and place the values from each column as key:value pairs.
Example:
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
avgdict[groupby] = [columns]
print(avgdict)
averages(table, 1, 3)
Output is:
{(2, 0, 0): [(29, 9, 27)]}
I am trying to get the output to equal:
{0:36, 2:29}
So essentially the 2 keys of 0 have their values added
I'm having a hard time understanding how to separate each key with their values
and then adding the values together if the keys are equal.
Edit: I'm only using Python Standard library, and not implementing numpy for this problem.

You can create an empty dictionary, then iterate through every element of groupby. If the element in groupby exist in the dictionary, then add the corresponding element in columns to the values in the dictionary. Otherwise, add the element in groupby as key and the corresponding element in columns as value.The implementation is as follows:
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
for x in range(len(groupby)):
key = groupby[x]
if key in avgdict:
avgdict[key] += columns[x]
else:
avgdict[key] = columns[x]
print(avgdict)
averages(table, 1, 3)
Otherwise, if you want to keep your initial avgdict, then you can change the averages() function to
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
avgdict[groupby] = [columns]
newdict = {}
for key in avgdict:
for x in range(len(key)):
if key[x] in newdict:
newdict[key[x]] += avgdict[key][0][x]
else:
newdict[key[x]] = avgdict[key][0][x]
print(newdict)

It took me a minute to figure out what you were trying to accomplish because your function and variable names reference averages but your output is a sum.
Based on your output, it seems you're trying to aggregate row values in a given column by a group in another column.
Here's a recommended solution (which likely could be reduced to a one-liner via list comprehension). This loops through the unique (using set) values (b) in your group by, creates a dictionary key (agg_dict[b]) for the group by being processed, and sums all rows in a given column (col) if the group by is being processed (table[i][by] == by).
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def aggregate(tbl, col, by):
agg_dict = {}
for b in list(set([table[i][by] for i in range(len(table))]))
agg_dict[b] = sum([table[i][col] for i in range(len(table)) if table[i][by] == b])
print(agg_dict)
aggregate(table, 1, 3)

You can also try the following answer. It doesn't use numpy, and is based on the use of sets to find unique elements in groupby.
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
'''groupby_unq: tuple data type
stores list of unique entries in groupby.'''
groupby_unq = tuple(set(groupby))
'''avg: numpy.ndarray data type
numpy array of zeros of same length as groupby_unq.'''
avg = np.zeros( len(groupby_unq) )
for i in range(len(groupby)):
for j in range(len(groupby_unq)):
if(groupby[i]==groupby_unq[j]): avg[j]+=columns[i]
avgdict = dict( (groupby_unq[i], avg[i]) for i in range(len(avg)) )
return avgdict
result = averages(table, 1, 3)
print result
{0: 36.0, 2: 29.0}

Related

How can you find the maximum nth integer in a list in python? [duplicate]

I know how to find the 1st highest value but don't know the rest. Keep in mind i need to print the position of the 1st 2nd and 3rd highest value.Thank You and try to keep it simple as i have only been coding for 2 months. Also they can be joint ranks
def linearSearch(Fscore_list):
pos_list = []
target = (max(Fscore_list))
for i in range(len(Fscore_list)):
if Fscore_list[i] >= target:
pos_list.append(i)
return pos_list
This will create a list of the 3 largest items, and a list of the corresponding indices:
lst = [9,7,43,2,4,7,8,5,4]
values = []
values = zip(*sorted( [(x,i) for (i,x) in enumerate(f_test)],
reverse=True )[:3] )[0]
posns = []
posns = zip(*sorted( [(x,i) for (i,x) in enumerate(f_test)],
reverse=True )[:3] )[1]
Things are a bit more complicated if the same value can appear multiple times (this will show the highest position for a value):
lst = [9,7,43,2,4,7,8,5,4]
ranks = sorted( [(x,i) for (i,x) in enumerate(lst)], reverse=True
)
values = []
for x,i in ranks:
if x not in values:
values.append( x )
posns.append( i )
if len(values) == 3:
break
print zip( values, posns )
Use heapq.nlargest:
>>> import heapq
>>> [i
... for x, i
... in heapq.nlargest(
... 3,
... ((x, i) for i, x in enumerate((0,5,8,7,2,4,3,9,1))))]
[7, 2, 3]
Add all the values in the list to a set. This will ensure you have each value only once.
Sort the set.
Find the index of the top three values in the set in the original list.
Make sense?
Edit
thelist = [1, 45, 88, 1, 45, 88, 5, 2, 103, 103, 7, 8]
theset = frozenset(thelist)
theset = sorted(theset, reverse=True)
print('1st = ' + str(theset[0]) + ' at ' + str(thelist.index(theset[0])))
print('2nd = ' + str(theset[1]) + ' at ' + str(thelist.index(theset[1])))
print('3rd = ' + str(theset[2]) + ' at ' + str(thelist.index(theset[2])))
Edit
You still haven't told us how to handle 'joint winners' but looking at your responses to other answers I am guessing this might possibly be what you are trying to do, maybe? If this is not the output you want please give us an example of the output you are hoping to get.
thelist = [1, 45, 88, 1, 45, 88, 5, 2, 103, 103, 7, 8]
theset = frozenset(thelist)
theset = sorted(theset, reverse=True)
thedict = {}
for j in range(3):
positions = [i for i, x in enumerate(thelist) if x == theset[j]]
thedict[theset[j]] = positions
print('1st = ' + str(theset[0]) + ' at ' + str(thedict.get(theset[0])))
print('2nd = ' + str(theset[1]) + ' at ' + str(thedict.get(theset[1])))
print('3rd = ' + str(theset[2]) + ' at ' + str(thedict.get(theset[2])))
Output
1st = 103 at [8, 9]
2nd = 88 at [2, 5]
3rd = 45 at [1, 4]
BTW : What if all the values are the same (equal first) or for some other reason there is no third place? (or second place?). Do you need to protect against that? If you do then I'm sure you can work out appropriate safety shields to add to the code.
Jupyter image of the code working
This question was on my Udemy machine learning course way too soon. Scott Hunter helped me the most on this problem, but didn't get me to a pass on the site. Having to really think about the issue deeper on my own. Here is my solution, since couldn't find it anywhere else online--in terms that I understood everything that was going on*:
lst = [9,7,43,2,4,7,8,9,4]
ranks = sorted( [(x,i) for (i,x) in enumerate(lst)], reverse=True )
box = []
for x,i in ranks:
if i&x not in box:
box.append( x )
if len(box) == 3:
break
print(box)
So we have a list of numbers. To rank the numbers we sort the value with its position for every position that has a value when we enumerate/iterate the list. Then we put the highest values on top by reversing it. Now we need a box to put our information in to pull out of later, so we build that box []. Now for every value with a position put that in the box, if the value and position isn't already in the box--meaning if the value is already in the box, but the position isn't, still put in the box. And we only want three answers. Finally tell me what is in the variable called box.
*Many of these answers, on this post, will most likely work.
Input : [4, 5, 1, 2, 9]
N = 2
Output : [9, 5]
Input : [81, 52, 45, 10, 3, 2, 96]
N = 3
Output : [81, 96, 52]
# Python program to find N largest
# element from given list of integers
l = [1000,298,3579,100,200,-45,900]
n = 4
l.sort()
print(l[-n:])
Output:
[298, 900, 1000, 3579]
lst = [9,7,43,2,4,7,8,9,4]
temp1 = lst
print(temp1)
#First Highest value:
print(max(temp1))
temp1.remove(max(temp1))
#output: 43
# Second Highest value:
print(max(temp1))
temp1.remove(max(temp1))
#output: 9
# Third Highest Value:
print(max(temp1))
#output: 7
There's a complicated O(n) algorithm, but the simplest way is to sort it, which is O(n * log n), then take the top. The trickiest part here is to sort the data while keeping the indices information.
from operator import itemgetter
def find_top_n_indices(data, top=3):
indexed = enumerate(data) # create pairs [(0, v1), (1, v2)...]
sorted_data = sorted(indexed,
key=itemgetter(1), # sort pairs by value
reversed=True) # in reversed order
return [d[0] for d in sorted_data[:top]] # take first N indices
data = [5, 3, 6, 3, 7, 8, 2, 7, 9, 1]
print find_top_n_indices(data) # should be [8, 5, 4]
Similarly, it can be done with heapq.nlargest(), but still you need to pack the initial data into tuples and unpack afterwards.
To have a list filtered and returned in descending order with duplicates removed try using this function.
You can pass in how many descending values you want it to return as keyword argument.
Also a side note, if the keyword argument (ordered_nums_to_return) is greater than the length of the list, it will return the whole list in descending order. if you need it to raise an exception, you can add a check to the function. If no args is passed it will return the highest value, again you can change this behaviour if you need.
list_of_nums = [2, 4, 23, 7, 4, 1]
def find_highest_values(list_to_search, ordered_nums_to_return=None):
if ordered_nums_to_return:
return sorted(set(list_to_search), reverse=True)[0:ordered_nums_to_return]
return [sorted(list_to_search, reverse=True)[0]]
print find_highest_values(list_of_nums, ordered_nums_to_return=4)
If values can appear in your list repeatedly you can try this solution.
def search(Fscore_list, num=3):
l = Fscore_list
res = dict([(v, []) for v in sorted(set(l), reverse=True)[:num]])
for index, val in enumerate(l):
if val in res:
res[val].append(index)
return sorted(res.items(), key=lambda x: x[0], reverse=True)
First it find num=3 highest values and create dict with empty list for indexes for it. Next it goes over the list and for every of the highest values (val in res) save it's indexes. Then just return sorted list of tuples like [(highest_1, [indexes ...]), ..]. e.g.
>>> l = [9, 7, 43, 2, 4, 7, 43, 8, 5, 8, 4]
>>> print(search(l))
[(43, [2, 6]), (9, [0]), (8, [7, 9])]
To print the positions do something like:
>>> Fscore_list = [9, 7, 43, 2, 4, 7, 43, 8, 5, 8, 4, 43, 43, 43]
>>> result = search(Fscore_list)
>>> print("1st. %d on positions %s" % (result[0][0], result[0][1]))
1st. 43 on positions [2, 6, 11, 12, 13]
>>> print("2nd. %d on positions %s" % (result[1][0], result[1][1]))
2nd. 9 on positions [0]
>>> print("3rd. %d on positions %s" % (result[2][0], result[2][1]))
3rd. 8 on positions [7, 9]
In one line:
lst = [9,7,43,2,8,4]
index = [i[1] for i in sorted([(x,i) for (i,x) in enumerate(lst)])[-3:]]
print(index)
[2, 0, 1]
None is always considered smaller than any number.
>>> None<4
True
>>> None>4
False
Find the highest element, and its index.
Replace it by None. Find the new highest element, and its index. This would be the second highest in the original list. Replace it by None. Find the new highest element, which is actually the third one.
Optional: restore the found elements to the list.
This is O(number of highest elements * list size), so it scales poorly if your "three" grows, but right now it's O(3n).

How to map over a list of dates and values and add the values based on unique dates?

I have this data:
data = [['20-01-22', '20-01-22', '09-09-21'],
[10, 10, 10],
[12, 10, 1 ]]
and I would like to add the value based on the date (ultimately going into an Excel to chart the data).
End result would be like so:
data = [['20-01-22', '09-09-21'],
[20, 10],
[22, 1 ]]
I have tried to pull out the first row and use it as keys to access the next rows, but I am a little stumped.
get all the datetimes.
keys = data[0]
newlist = []
for x in keys: # returns unique keys
if x not in newlist:
newlist.append(x)
Can you give me tips on where to go from here - I need to use the keys to access the values and add them.
If you can use pandas:
import pandas as pd
data = [['20-01-22', '20-01-22', '09-09-21'],
[10, 10, 10],
[12, 10, 1 ]]
pd.DataFrame(data).T.groupby(0, as_index=False, sort=False).sum().T.values.tolist()
Output
[['20-01-22', '09-09-21'], [20, 10], [22, 1]]

How to speed up nested loop and add condition?

I am trying to speed up my nested loop it currently takes 15 mins for 100k customers.
I am also having trouble adding an additional condition that only multiplies states (A,B,C) by lookup2 val, else multiplies by 1.
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
calculation:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
example of expected output:
{1: [[array([55000, 56000, 57000, 58000, 59000]),
array([5500., 5600., 5700., 5800., 5900.]),
array([550., 560., 570., 5800., 5900.])],...
You could use multiprocessing if you have more than one CPU.
from multiprocessing import Pool
def get_customer_data(data_tuple) -> dict:
results = {}
customer, state, amount, start, group = data_tuple
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
return results
p = Pool(mp.cpu_count())
# Pool.map() takes a function and an iterable like a list or generator
results_list = p.map(get_customer_data, [data_tuple for data_tuple in customer_data.itertuples(name=None, index=False)] )
# results is a list of dict()
results_dict = {k:v for x in results_list for k,v in x.items()}
p.close()
Glad to see you posting this! As promised, my thoughts:
With Pandas works with columns very well. What you need to look to do is remove the need for loops as much as possible (In your case I would say get rid of the main loop you have then keep the year and lookups loop).
To do this, forget about the results{} variable for now. You want to do the calculations directly on the DataFrame. For example your first calculation would become something like:
customer_data['meaningful_column_name'] = [[amount * multi_data[customer_data['year']-1, customer_data['state'], :]]]
For your lookups loop you just have to be aware that the if statement will be looking at entire columns.
Finally, as it seems you want to have your data in a list of arrays you will need to do some formatting to extract the data from a DataFrame structure.
I hope that makes some sense

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

Pythonic way for max-sum-max multiple lists

I have 3 lists:
a1 = range(10)
a2 = range(10,20)
a3 = range(20,30)
I need to do the following:
For each list, get max of every 5 element blocks, so hypothetically:
a1_maxes = [max1_a1, max2_a1]
a2_maxes = [max1_a2, max2_a2]
a3_maxes = [max1_a3, max2_a3]
Sum each "maxes" list, so:
for each i:
sum_i = sum(ai_maxes)
Take the max of these 3 sums, so:
max(sum_1, sum_2, sum_3)
I could not get myself to use map() here. What would be the most Pythonic (concise) way to do this? Thanks.
a1 = range(10)
a2 = range(10,20)
a3 = range(20,30)
print(max(sum(x[i:i+5]) for x in (a1,a2,a3) for i in xrange(0,len(a1),5)))
135
Just get the sumof each chunk x[i:i+5]
To make it more obvious, the lists become split into the following chucks:
print(list(x[i:i+5]) for x in [a1,a2,a3] for i in xrange(0,len(a1),5))
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29]]
Then max just gets the largest sum:
If you want the highest two elements from each check:
mx_pair = max(sorted(x[i:i+5])[-2:] for x in (a1,a2,a3) for i in xrange(0,len(a1),5))
print(sum(mx_pair))
57
If the answer should be 53:
from itertools import izip,imap
def chunks(l):
for i in xrange(0,len(l), 5):
yield l[i:i+5]
sums = (max(izip(*ele)) for ele in imap(chunks,(a1,a2,a3)))
print(sum(max(sums)))
Let's break this into pieces.
The first point is that you probably don't want separate a1, a2, and a3` variables; if you're going to have to do the exact same thing repeatedly to multiple values, and then iterate over those values, they probably belong in a list. So:
a = [a1, a2, a3]
Now, how do you split an iterable into 5-element pieces? There are a number of ways to do it, from the grouper function in the itertools recipes to zipping slices to iterating over slices. I'll use grouper:
grouped = [grouper(sublist, 5) for sublist in a]
Now we just want the max value of each group, so:
maxes = [[max(group) for group in sublist] for sublist in a]
And now, we want to sum up each sublist:
sums = [sum(sublist) for sublist in maxes]
And finally, we want to take the max of these sums:
maxsum = max(sums)
Now, given that each of these list comprehensions is only being used as a one-shot iterable, we might as well turn them into generator expressions. And if you want to, you can merge some of the steps together:
maxsum = max(sum(max(group) for group in grouper(sublist, 5)) for sublist in a)
And, having done that, you don't actually need a to be created explicitly, because it only appears once:
maxsum = max(sum(max(group) for group in grouper(sublist, 5))
for sublist in a1, a2, a3)

Categories