how to merge data in python

how to merge data in python - python

I just learn python for not long. And I had try my best to represent my data looks better just like showing before.
Now I have some tuple data type which look like this:
('John', '5', 'Coke')
('Mary', '1', 'Pie')
('Jack', '3', 'Milk')
('Mary', '2', 'Water')
('John', '3', 'Coke')
And I wanna count how many items that each one had bought.
Assume that the different name is different person.
So how could I do in order to get some information like this below:
John: 8 Coke
Mary: 1 Pie
Mary: 2 Water
Jack: 3 Milk
I have no idea how could I do now. I can't come up with any method even the stupid one.

I'd suggest using name and drink as a key for collections.Counter:
from collections import Counter
count = Counter()
for name, amount, drink in tuples:
key = name, drink
count.update({key: int(amount)}) # increment the value
# represent the aggregated data
for (name, drink), amount in count.items():
print('{}: {} {}'.format(name, amount, drink))
Update I made some simple measurements, and figured out that
count[name, drink] += value
is not only more readable, but much faster than calling update, which should not be a surprise. Moreover, defaultdict(int) is even faster (about twice) than that (presumably, because Counter performs some ordering additionally.)

Re-arranging the order of your data might help:
John: 8 Coke
Mary: 1 Pie
Mary: 2 Water
Jack: 3 Milk
might be more insightful, when written as
(John, Coke) : 8
(Mary, Pie) : 1
(Mary, Water): 2
(Jack, Milk) : 3
If you know SQL, this is more or less equivalent to groupby(name, dish) together with sum(count).
So, in Python, you can create a dictionary for that pair:
data = [
('John', '5', 'Coke'),
('Mary', '1', 'Pie'),
('Jack', '3', 'Milk'),
('Mary', '2', 'Water'),
('John', '3', 'Coke'),
]
orders = {}
for name, count, dish in data:
if (name, dish) in orders:
orders[(name, dish)] += int(count)
else:
# first entry
orders[(name, dish)] = int(count)
Even more pythonic, use collections.defaultdict:
orders = defaultdict(int)
for name, count, dish in data:
orders[(name, dish)] += int(count)
or collections.Counter as noted by #bereal.
Format data as you like.

Assuming you have a list of tuples
tuples = [('John', '5', 'Coke'),
('Mary', '1', 'Pie'),
('Jack', '3', 'Milk'),
('Mary', '2', 'Water'),
('John', '3', 'Coke')]
memory = {}
# First, we calculate the amount for each pair
for tuple in tuples:
# I define a generated key through the names. For example John-Cake, Mary-Pie, Jack-Milk,...
key = (tuple[0],tuple[2])
number = int(tuple[1])
if key in memory:
memory[key] += number
else:
memory[key] = number
# After, we format the information
list = []
for key in memory:
list.append((key[0],memory[key],key[1]))

Related

Pairing elements of list of lists and storing in tuple form

I have a file say : file1.txt, which has multiple rows and columns. I want to read that and store that as list of lists. Now I want to pair them using the logic, no 2 same rows can be in a pair. Now the 2nd lastcolumn represent the class. Below is my file:
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
Here all the 6 rows are class 1. I am using below logic to do this pairing part.
from operator import itemgetter
rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)
list1 = []
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
list1=sorted(list1,key=itemgetter(-1),reverse=True)
length = len(list1)
middle_index = length // 2
first_half = list1[:middle_index]
second_half = list1[middle_index:]
result=[]
result=list(zip(first_half,second_half))
for a,b in result:
if a==b:
result.remove((a, b))
print(result)
print("-------------------")
It is working absolutely fine when I have one class only. But if my file has multiple classes then I want the pairing to be done with is the same class only. For an example if my file looks like below: say file2
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
Then I want to make 3 pairs from class 1, 2 from class 2 and 1 from class 3.Then I am using this logic to make the dictionary where the keys will be the classes.
d = {}
sorted_grouped = []
for row in list1:
# Add name to dict if not exists
if row[-2] not in d:
d[row[-2]] = []
# Add all non-Name attributes as a new list
d[row[-2]].append(row)
#print(d.items())
for k,v in d.items():
sorted_grouped.append(v)
#print(sorted_grouped)
gp_vals = {}
for i in sorted_grouped:
gp_vals[i[0][-2]] = i
print(gp_vals)
Now how can I do it, please help !
My desired output for file2 is:
[([43,44,45,46,1,0.92], [39,40,41,42,1,0.82]), ([43,44,45,46,1,0.92],
[27,28,29,30,1,0.67]), ([31,32,33,34,1,0.84], [35,36,37,38,1,0.45])]
[([55,56,57,58,2,0.77], [59,60,61,62,2,0.39]), ([63,64,65,66,2,0.41],
[51,52,53,54,2,0.28])] [([90,91,92,93,3,0.97], [75,76,77,78,3,0.51])]
Edit1:
All the files will have even number of rows, where every class will have even number of rows as well.
For a particular class(say class 2), if there are n rows then there can be maximum n/2 identical rows for that class in the dataset.
My primary intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately.

First read in the data from the file, I'd use assert here to communicate your assumptions to people who read the code (including future you) and to confirm the assumption actually holds for the file. If not it will raise an AssertionError.
rule_file_name = 'file2.txt'
list1 = []
with open(rule_file_name) as rule_fp:
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
assert len(list1) & 1 == 0 # confirm length is even
Then use a defaultdict to store the lists for each class.
from collections import defaultdict
classes = defaultdict(list)
for _list in list1:
classes[_list[4]].append(_list)
Then use sample to draw pairs and confirm they aren't the same. Here I'm including a seed to make the results reproducible but you can take that out for randomness.
from random import sample, seed
seed(1) # remove this line when you want actual randomness
for key, _list in classes.items():
assert len(_list) & 1 == 0 # each also be even else an error in data
_list.sort(key=lambda x: x[5])
pairs = []
while _list:
first = _list[-1]
candidate = sample(_list, 1)[0]
if first != candidate:
print(f'first {first}, candidate{candidate}')
pairs.append((first, candidate))
_list.remove(first)
_list.remove(candidate)
classes[key] = pairs
Note that an implicit assumption in the way to do the sampling (stated in edit) is that the duplicates arise from the highest fitness values. If this is not true this could go into an infinite loop.
If you want to print them then iterate over the dictionary again:
for key, pairs in classes.items():
print(key, pairs)
which for me gives:
1 [(['43', '44', '45', '46', '1', '0.92'], ['27', '28', '29', '30', '1', '0.67']), (['43', '44', '45', '46', '1', '0.92'], ['31', '32', '33', '34', '1', '0.84']), (['39', '40', '41', '42', '1', '0.82'], ['35', '36', '37', '38', '1', '0.45'])]
2 [(['55', '56', '57', '58', '2', '0.77'], ['51', '52', '53', '54', '2', '0.28']), (['63', '64', '65', '66', '2', '0.41'], ['59', '60', '61', '62', '2', '0.39'])]
3 [(['90', '91', '92', '93', '3', '0.97'], ['75', '76', '77', '78', '3', '0.51'])]
Using these values for file2.text-the first numbers are row numbers and not part of the actual file.
1 27,28,29,30,1,0.67
2 31,32,33,34,1,0.84
3 35,36,37,38,1,0.45
4 39,40,41,42,1,0.82
5 43,44,45,46,1,0.92
6 43,44,45,46,1,0.92
7 51,52,53,54,2,0.28
8 55,56,57,58,2,0.77
9 59,60,61,62,2,0.39
10 63,64,65,66,2,0.41
11 75,76,77,78,3,0.51
12 90,91,92,93,3,0.97

How to add numbers in duplicate list

I've collected data from txt file and made it to the list (actually there are a lot more players, so it is impossible to count without loop), like:
data_list = [
['FW', '1', 'Khan', '2', '0'],
['FW', '25', 'Daniel', '0', '0'],
['FW', '3', 'Daniel', '1', '0'],
['FW', '32', 'Daniel', '0', '0'],
['FW', '4', 'Khan', '1', '0']
]
and I want to add the goal of each Khan and Daniel and make a list like:
['Khan', 3]
['Daniel', 1]
I have a name list (name_list = [Khan, Daniel])
I've tried to do with for loop, like:
goal = []
num = 0
for i in name_list:
for j in data_list:
if i == j[2]:
num += int(j[3])
goal.append([i, num])
else:
continue
and it did not work.
I am very novice, so your comments will be a really big help.
Thanks!

Your code is very close from working, there are syntax error and one single real problem.
The problem is that you are appending num too soon. You should sum over rows that contain the name you are looking for, then, once all rows have been seen append the value:
data_list = [
['pos', 'num', 'name', 'goal', 'assist'],
['FW', '1', 'Khan', '2', '0'],
['FW', '25', 'Daniel', '0', '0'],
['FW', '3', 'Daniel', '1', '0'],
['FW', '32', 'Daniel', '0', '0'],
['FW', '4', 'Khan', '1', '0']
]
name_list = ['Khan', 'Daniel']
goal = []
for name in name_list:
total_score = 0
for j in data_list:
if name == j[2]:
total_score += int(j[3])
goal.append([i, total_score])
On the other hand this strategy is not the most efficient since for every name the code will iterate over all rows. You could (using dictionaries to store intermediate results) need a single look on each row, independently of the number of "names" you are looking for.
name_list = {'Khan', 'Daniel'}
goal = dict()
for row in data_list:
if row[2] in name_list:
if not row[2] in goal:
goal[row[2]] = 0
goal[row[2]] += int(row[3])
Which set goal to {'Khan': 3, 'Daniel': 1}.
Yet this could be improved (readability), using defaultdict. What default dictionary do is doing the existence check of a given "key" and initialisation automatically for you, which simplifies the code:
from collections import defaultdict
goal = defaultdict(int)
for row in data_list:
if row[2] in name_list:
goal[row[2]] += int(row[3])
Which does the exact same thing as before. At that point it's not even clear that we really need to provide a list of names (unless memory is an issue). Getting a dictionary for all names would again simplify the code (we just need to make sure to ignore the first row using the slice notation [1:]):
goal = defaultdict(int)
for row in data_list[1:]:
goal[row[2]] += int(row[3])

You can create a dictionary to keep the sum number of goals, with the names as keys. This will make easier to access the values:
goals_dict = {}
for name in name_list:
goals_dict[name] = 0
# {'Khan': 0, 'Daniel': 0}
Then just sum it:
for name in name_list:
for data in data_list:
if data[2] == name:
goals_dict[name] += int(data[3])
Now you will have your dictionary populated correctly. Now to set the result as the list you requested, do as such:
result = [[key, value] for key, value in d.items()]

Don't bother doing it manually. Use a Counter instead:
from collections import Counter
c = Counter()
for j in data_list:
name = j[2]
goal = int(j[3])
c[name] += goal
print(c.most_common()) # -> [('Khan', 3), ('Daniel', 1)]

In your above code you increment the value of num without first defining it. You'll want to initialize it to 0 outside of your inner for loop. You'd then append the name/goal to the list like this:
for i in name_list:
#Init num
num = 0
# Iterate through each data entry
for j in data_list:
if i == j[2]:
# Increment goal count for this player
num+= int(j[3])
# Append final count to goal list
goal.append([i, num])
This should have the desired effect, although as #wjandrea has pointed out, a Counter would be a much cleaner implementation.

Sort list of names and scores in Python 3

I have a list of data that is in the structure of name and then score like this:
['Danny', '8', 'John', '5', 'Sandra', 10]
What I require to do in the simplest way possible is sort the data by lowest to highest score for example like this:
['John', '5', 'Danny', '8', 'Sandra', 10]

You should create pairings which will make your life a lot easier:
l = ['Danny', '8', 'John', '5', 'Sandra', '10']
it = iter(l)
srt = sorted(zip(it, it), key=lambda x: int(x[1]))
Which will give you:
[('John', '5'), ('Danny', '8'), ('Sandra', '10')]
it = iter(l) creates an iterator, then zip(it, it) basically calls (next(it), next(it)) each iteration so you create pairs of tuples in the format (user, score), then we sort by the second element of each tuple which is the score, casting to int.
You may be as well to cast to int and then sortif you plan on using the data, you could also create a flat list from the sorted data but I think that would be a bad idea.

The best data structure for your problem is Dictionary.
In your situatiton you need to map between names and scores.
dict = {'Danny':'8', 'John':'5', 'Sandra':'10'}
sorted_dict = ((k, dict[k]) for k in sorted(dict, key=dict.get, reverse=False))
for k, v in genexp:
... k, v
('John', '5')
('Danny', '8')
('Sandra', 10)

Writing a list with nested tuples to a csv file

I have a list with nested tuples, like the one below:
data = [('apple', 19.0, ['gala', '14', 'fuji', '5', 'dawn', '3', 'taylor', '3']),
('pear', 35.0, ['anjou', '29', 'william', '6', 'concorde', '4'])]
I want to flatten it out so that I can write a .csv file in which each item on every list corresponds to a column:
apple 19.0, gala 14 fuji 5 dawn 3 taylor 3
pear 35.0 anjou 29 william 6 concorde 4
I tried using simple flattening:
flattened = [value for pair in data for value in pair]
But the outcome has not been the desired one. Any ideas on how to solve this?

To write out the data to CSV, simply use the csv module and give it one row; constructing the row is not that hard:
import csv
with open(outputfile, 'w', newlines='') as ofh:
writer = csv.writer(ofh)
for row in data:
row = list(row[:2]) + row[2]
writer.writerow(row)
This produces:
apple,19.0,gala,14,fuji,5,dawn,3,taylor,3
pear,35.0,anjou,29,william,6,concorde,4

Disclaimer - Not very efficient Python code.
But, it does the job. (You can adjust the width (currently 10))
data = [('apple', 19.0, ['gala', '14', 'fuji', '5', 'dawn', '3', 'taylor', '3']),
('pear', 35.0, ['anjou', '29', 'william', '6', 'concorde', '4'])]
flattened = list()
for i, each in enumerate(data):
flattened.append(list())
for item in each:
if isinstance(item, list):
flattened[i].extend(item)
else:
flattened[i].append(item)
# Now print the flattened list in the required prettified manner.
for each in flattened:
print ("".join(["{:<10}".format(item) for item in each]))
# String is formatted such that all items are width 10 & left-aligned
Note - I tried to write the function for a more general case.
PS - Any code suggestions are welcome. I really want to improve this one.

This seems like it calls for recursion
def flatten(inlist):
outlist=[]
if isinstance(inlist, (list, tuple)):
for item in inlist:
outlist+=flatten(item)
else:
outlist+=[inlist]
return outlist
This should work no matter how nested your list becomes. Tested it with this:
>>> flatten([0,1,2,[3,4,[5,6],[7,8]]])
[0, 1, 2, 3, 4, 5, 6, 7, 8]

load text file then load different words into different lists

I have searched on strings, lists, append etc. but can't seem to handle this.
I have created some files from android based on selections done in an app.
The output looks like this and is in a text file:
House 1,bwfront3,colorfront2,bwtilt3,colortilt3
House 2,bwfront6,colorfront6,bwtilt6,colortilt6
House 3,bwfront5,colorfront5,bwtilt5,colortilt5
House 4,bwfront4,colorfront4,bwtilt4,colortilt4
House 5,bwfront2,colorfront2,bwtilt2,colortilt2
the reason for the naming:
I have 5 houses, where the user firstly selects from 9 'bwfront..' pictures. and then between color images and so on.
THe exercise is to map different pictures to the 'house'.
I now wish to load the text file(s) and count how many of each of the different 'bwfront' have been selected etc. To clarify, the user selects four times per 'house'.
This will continue with all houses + types of pictures,but if any of you can get me started, I should be able to apply the solution to all my 23 files.
Does it make sense?

Possible way to parse such a file to count different bwfronts:
import csv
from collections import Counter
def count_bwfronts():
"""Counts occurrences of different bwfronts in
yourfile.txt. Returns a Counter object which
maps different bwfront values to a number of their
occurances in the file."""
reader = csv.reader(open('yourfile.txt', 'rb'), delimiter=",")
counter = Counter()
counter.update(row[1] for row in reader)
return counter
if __name__ == "__main__":
print count_bwfronts()
As you might have guessed, each row taken from reader is just a list of strings which used to be separated by comma in your input file. In order to do more complex calculations you might want to rewrite generator expression into a loop.

# First, create a dictionary for each column, that maps each
# value (eg colorfront2) to a list of house names.
results = [{}, {}, {}, {}]
for filename in os.listdir('/wherever'):
s = open(os.path.join('/wherever', filename), 'rb').read()
for line in s.split('\n'):
if not line.strip(): continue # skip blank lines
house, values = line.split(',', 1)
values = values.split(',')
assert len(values) == len(results) # sanity check
for value, result in zip(values, results):
if value not in result:
result[value] = []
result[value].append(house)
# Then, do something with it -- e.g., show them in order
for i, result in enumerate(results):
print 'COLUMN %d' % i
def sortkey(pair): return len(pair[1]) # number of houses
for (val, houses) in sorted(result.items(), key=sortkey, reverse=True):
print ' %20s occurs for %d houses' % (val, len(houses))

This will loop through the file and extract each part of a line and store it in a list of 5-tuples, from there you can do whatever you need with the house/color/etc. This is just an example because it is hard to determine exactly what you need out of the script, but this should help get you started:
houses = open("houses.txt", "r")
house_list = []
for line in houses:
# This assumes each line will always have 5 items separated by commas.
house, bwfront, colorfront, bwtilt, colortilt = line.split(",")
# This strips off the initial word ("bwfront") and just gives you the number
house_num = house[6:]
bwfront_num = bwfront[7:]
colorfront_num = colorfront[10:]
bwtilt_num = bwtilt[6:]
colortilt_num = colortilt[9:]
house_list.append((house_num, bwfront_num, colorfront_num, bwtilt_num, colortilt_num))
print house_list
Results in:
[('1', '3', '2', '3', '3'), ('2', '6', '6', '6', '6'), ('3', '5', '5', '5', '5'), ('4', '4', '4', '4', '4'), ('5', '2', '2', '2', '2')]
From there, you can do something like [h[1] for h in house_list] to get all of the bwfront numbers for each house, etc:
['3', '6', '5', '4', '2']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to merge data in python - python

Related

Pairing elements of list of lists and storing in tuple form

How to add numbers in duplicate list

Sort list of names and scores in Python 3

Writing a list with nested tuples to a csv file

load text file then load different words into different lists

Categories

Resources