How to understand this python code? - python

This code is from the book Learning python and it is used to sum columns in a text file separated by commas. I really can't understand line 7, 8 &9.
Thanks for the help. Here is the code:
filename='data.txt'
sums={}
for line in open(filename):
cols=line.split(',')
nums=[int(col) for col in cols]
for(ix, num) in enumerate(nums):
sums[ix]=sums.get(ix, 0)+num
for key in sorted(sums):
print(key, '=', sums[key])

It looks like the input file contains rows of comma-separated integers. This program prints out the sum of each column.
You've mixed up the indentation, which changes the meaning of the program, and it wasn't terribly nicely written to begin with. Here it is with lots of commenting:
filename='data.txt' # name of the text file
sums = {} # dictionary of { column: sum }
# not initialized, because you don't know how many columns there are
# for each line in the input file,
for line in open(filename):
# split the line by commas, resulting in a list of strings
cols = line.split(',')
# convert each string to an integer, resulting in a list of integers
nums = [int(col) for col in cols]
# Enumerating a list numbers the items - ie,
# enumerate([7,8,9]) -> [(0,7), (1,8), (2,9)]
# It's used here to figure out which column each value gets added to
for ix, num in enumerate(nums):
# sums.get(index, defaultvalue) is the same as sums[index] IF sums already has a value for index
# if not, sums[index] throws an error but .get returns defaultvalue
# So this gets a running sum for the column if it exists, else 0;
# then we add the new value and store it back to sums.
sums[ix] = sums.get(ix, 0) + num
# Go through the sums in ascending order by column -
# this is necessary because dictionaries have no inherent ordering
for key in sorted(sums):
# and for each, print the column# and sum
print(key, '=', sums[key])
I would write it a bit differently; something like
from collections import Counter
sums = Counter()
with open('data.txt') as inf:
for line in inf:
values = [int(v) for v in line.split(',')]
sums.update(enumerate(values))
for col,sum in sorted(sums.iteritems()):
print("{}: {}".format(col, sum))

Assuming you understand lines 1-6…
Line 7:
sums[ix]=sums.get(ix, 0)+num
sums.get(ix, 0) is the same as sums[ix], except that if ix not in sums it returns 0 instead. So, this is just like sums[ix] += num, except that it first sets the value to 0 if this is the first time you've seen ix.
So, it should be clear that by the end of this loop, sums[ix] is going to have the sum of all values in column ix.
This is a silly way to do this. As mgilson points out, you could just use defaultdict so you don't need that extra logic. Or, even more simply, you could just use a list instead of a dict, because this (indexing by sequential small numbers) is exactly what lists are for…
Line 8:
for key in sorted(sums):
First, you can iterate over any dict as if it were a list or other iterable, and it has the same effect as iterating over sums.keys(). So, if sums looks like { 0: 4, 1: 6, 2: 3 }, you're going to iterate over 0, 1, 2.
Except that dicts don't have any inherent order. You may get 0, 1, 2, or you may get 1, 0, 2, or any other order.
So, sorted(sums) just returns a copy of that list of keys in sorted order, guaranteeing that you'll get 0, 1, 2 in that order.
Again, this is silly, because if you just used a list in the first place, you'd get things in order.
Line 9:
print(key, '=', sums[key])
This one should be obvious. If key iterates over 0, 1, 2, then this is going to print 0 = 4, 1 = 6, 2 = 3.
So, in other words, it's printing out each column number, together with the sum of all values in that column.

Related

Finding first time value occurs in an array when you don't know what it is

I have a very long array (over 2 million values) with repeating value. It looks something like this:
array = [1,1,1,1,......,2,2,2.....3,3,3.....]
With a bunch of different values. I want to create individual arrays for each group of points. IE: an array for the ones, an array for the twos, and so forth. So something that would look like:
array1 = [1,1,1,1...]
array2 = [2,2,2,2.....]
array3 = [3,3,3,3....]
.
.
.
.
None of the values occur an equal amount of time however, and I don't know how many times each value occurs. Any advice?
Assuming that repeated values are grouped together (otherwise you simply need to sort the list), you can create a nested list (rather than a new list for every different value) using itertools.groupby:
from itertools import groupby
array = [1,1,1,1,2,2,2,3,3]
[list(v) for k,v in groupby(array)]
[[1, 1, 1, 1], [2, 2, 2], [3, 3]]
Note that this will be more convenient than creating n new lists created dinamically as shown for instance in this post, as you have no idea of how many lists will be created, and you will have to refer to each list by its name rather by simply indexing a nested list
You can use bisect.bisect_left to find the indices of the first occurence of each element. This works only if the list is sorted:
from bisect import bisect_left
def count_values(l, values=None):
if values is None:
values = range(1, l[-1]+1) # Default assume list is [1..n]
counts = {}
consumed = 0
val_iter = iter(values)
curr_value = next(val_iter)
next_value = next(val_iter)
while True:
ind = bisect_left(l, next_value, consumed)
counts[curr_value] = ind - consumed
consumed = ind
try:
curr_value, next_value = next_value, next(val_iter)
except StopIteration:
break
counts[next_value] = len(l) - consumed
return counts
l = [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
print(count_values(l))
# {1: 9, 2: 8, 3: 7}
This avoids scanning the entire list, trading that for a binary search for each value. Expect this to be more performant where there are very many of each element, and less performant where there are few of each element.
Well, it seems to be wasteful and redundant to create all those arrays, each of which just stores repeating values.
You might want to just create a dictionary of unique values and their respective counts.
From this dictionary, you can always selectively create any of the individual arrays easily, whenever you want, and whichever particular one you want.
To create such a dictionary, you can use:
from collections import Counter
my_counts_dict = Counter(my_array)
Once you have this dict, you can get the number of 23's, for example, with my_counts_dict[23].
And if this returns 200, you can create your list of 200 23's with:
my_list23 = [23]*200
****Use this code ****
<?php
$arrayName = array(2,2,5,1,1,1,2,3,3,3,4,5,4,5,4,6,6,6,7,8,9,7,8,9,7,8,9);
$arr = array();
foreach ($arrayName as $value) {
$arr[$value][] = $value;
}
sort($arr);
print_r($arr);
?>
Solution with no helper functions:
array = [1,1,2,2,2,3,4]
result = [[array[0]]]
for i in array[1:]:
if i == result[-1][-1]:
result[-1].append(i)
else:
result.append([i])
print(result)
# [[1, 1], [2, 2, 2], [3], [4]]

How to sort a string of numbers with time involved?

What I have is following code snippet:
a = ["2013-11-20,29,0,0", "2013-11-20,3,0,2"]
where a[1] is the a[1]th 5 minute in a day, a[3] and a[4] are number of counts.
I want to sort this by the first two elements. But when I use sort, a[0] always comes first. In fact, I want a[1] to come first. How should I do this?
I have tried to use the key argument in sort(), for example a.sort(key=int). But then an error occurred saying:
ValueError: invalid literal for int() with base 10: '2013-11-20,29,0,0'
Make a key function that returns a tuple of values you want to sort on.
import datetime
a=["2013-11-20,29,0,0","2013-11-20,3,0,2"]
def f(thing):
#separate the values
a,b,c,d = thing.strip().split(',')
# turn one into a datetime.date
y, m, d = map(int, a.split('-'))
a = datetime.date(y, m, d)
# turn the others into ints
b,c,d = map(int, (b,c,d))
# return the values in order of precedence
return (a,b,c,d)
Then use it to sort the list
a.sort(key = f)
Your issue is, that each item in your list is a string. If you sort a string, each character at each position will be compared with eachother. In your example all characters are the same until after the first comma. After the comma, the next characters are a '2' and a '3'. As '3'>'2', the sorting is not as you wish. I assume you want 29 be > 3.
In this particular case, you could just reverse the sorting
a.sort()
a.reverse()
But as you probably have a list with more items, this will not work... The only solution I see is to split each item in your list at the comma ','. Then convert the items which should be considered as integers to int. For example you can do it like this:
a=["2013-11-20,29,0,0","2013-11-20,3,0,2"]
a_temp=[]
for item in a:
splitstr = item.split(',')
i=0
temp = []
for elem in splitstr:
if i>0:
temp_i=int(elem)
else:
temp_i=elem
temp.append(temp_i)
i+=1
a_temp.append(temp)
Your temporary list looks now like this:
[['2013-11-20', 29, 0, 0], ['2013-11-20', 3, 0, 2]]
Then sort it by the position as you wish. This you can do for example like this:
from operator import itemgetter
a_temp_sorted=sorted(a_temp, key=itemgetter(0,1,2,3))
By using the itemgetter you can define in what order you want to sort. Here it is sorted at first by the element 0, then 1, etc... but you can change the order. a_temp_sorted now looks like:
[['2013-11-20', 3, 0, 2], ['2013-11-20', 29, 0, 0]]
Now you can convert your result again to a string. This you can do like this:
a_sorted=[]
for item in a_temp_sorted:
newstring=''
i=0
for elem in item:
if i>0:
temp_i=str(elem)
newstring+=','+temp_i
else:
newstring+=elem
i=1
a_sorted.append(newstring)
a_sorted is now your sorted version of your source a. It now looks like this:
['2013-11-20,3,0,2', '2013-11-20,29,0,0']

Unknown error on PySpark map + broadcast

I have a big group of tuples with tuple[0] = integer and tuple[1] = list of integers (resulting from a groupBy). I call the value tuple[0] key for simplicity.
The values inside the lists tuple[1] can be eventually other keys.
If key = n, all elements of key are greater than n and sorted / distinct.
In the problem I am working on, I need to find the number of common elements in the following way:
0, [1,2]
1, [3,4,5]
2, [3,7,8]
.....
list of values of key 0:
1: [3,4,5]
2: [3,7,8]
common_elements between list of 1 and list of 2: 3 -> len(common_elements) = 1
Then I apply the same for keys 1, 2 etc, so:
list of values of 1:
3: ....
4: ....
5: ....
The sequential script I wrote is based on pandas DataFrame df, with the first column v as list of 'keys' (as index = True) and the second column n as list of list of values:
for i in df.v: #iterate each value
for j in df.n[i]: #iterate within the list
common_values = set(df.n[i]).intersection(df.n[j])
if len(common_values) > 0:
return len(common_values)
Since is a big dataset, I'm trying to write a parallelized version with PySpark.
df.A #column of integers
df.B #column of integers
val_colA = sc.parallelize(df.A)
val_colB = sc.parallelize(df.B)
n_values = val_colA.zip(val_colB).groupByKey().MapValues(sorted) # RDD -> n_values[0] will be the key, n_values[1] is the list of values
n_values_broadcast = sc.broadcast(n_values.collectAsMap()) #read only dictionary
def f(element):
for i in element[1]: #iterating the values of "key" element[0]
common_values = set(element[1]).intersection(n_values_broadcast.value[i])
if len(common_values) > 0:
return len(common_values)
collection = n_values.map(f).collect()
The programs fails after few seconds giving error like KeyError: 665 but does not provide any specific failure reason.
I'm a Spark beginner thus not sure whether this the correct approach (should I consider foreach instead? or mapPartition) and especially where is the error.
Thanks for the help.
The error is actually pretty clear and Spark specific. You are accessing Python dict with __getitem__ ([]):
n_values_broadcast.value[i]
and if key is missing in the dictionary you'll get KeyError. Use get method instead:
n_values_broadcast.value.get(i, [])

Python - Iterating and Replacing List Index via enumerate()

I have a python script that imports a CSV file and based on the file imported, I have a list of the indexes of the file.
I am trying to match the indexes in FILESTRUCT to the CSV file and then replace the data in the column with new generated data. Here is a code snip-it:
This is just a parsed CSV file returned from my fileParser method:
PARSED = fileParser()
This is a list of CSV column positions:
FILESTRUCT = [6,7,8,9,47]
This is the script that is in question:
def deID(PARSED, FILESTRUCT):
for item in PARSED:
for idx, lis in enumerate(item):
if idx == FILESTRUCT[0]:
lis = dataGen.firstName()
elif idx == FILESTRUCT[1]:
lis = dataGen.lastName()
elif idx == FILESTRUCT[2]:
lis = dataGen.email()
elif idx == FILESTRUCT[3]:
lis = dataGen.empid()
elif idx == FILESTRUCT[4]:
lis = dataGen.ssnGen()
else:
continue
return(PARSED)
I have verified that it is correctly matching the indices (idx) with the integers in FILESTRUCT by adding a print statement at the end of each if statement. That works perfectly.
The problem is that when I return(PARSED) it is not returning it with the new generated values, it is instead, returning the original PARSED input values. I assume that I am probably messing something up with how I use the enumerate method in my second loop, but I do not understand the enumerate method well enough to really know what I am messing up here.
You can use
item[idx] = dataGen.firstName()
to modify the underlying item. The reason here is that enumerate() returns (id, value) tuples rather than references to the iterable that you passed.
Given your example above you may not even need enumerate, because you're not parsing the lis at all. So you could also just do
for i in range(len(item)):
# your if .. elif statements go here ...
item[i] = dataGen.firstName()
On a side-note, the elif statements in your code will become unwieldy once you start adding more conditions and columns. Maybe consider making FILESTRUCT a dictionary like:
FILESTRUCT = {
6: dataGen.firstName,
7: dataGen.lastName,
....
}
...
for idx in range(len(item)):
if idx in FILESTRUCT.keys():
item[idx] = FILESTRUCT[idx]()
So PARSED is an iterable, and item is an element of it and is also an iterable, and you want to make changes to PARSED by changing elements of item.
So let's do a test.
a = [1, 2, 3]
print 'Before:'
print a
for i, e in enumerate(a):
e += 10
print 'After:'
print a
for e in a:
e += 10
print 'Again:'
print a
a[0] += 10
print 'Finally:'
print a
The results are:
Before:
[1, 2, 3]
After:
[1, 2, 3]
Again:
[1, 2, 3]
Finally:
[11, 2, 3]
And we see, a is not changed by changing the enumerated elements.
You aren't returning a changed variable. You don't ever change the variable FILESTRUCT. Rather make another variable, make it as you loop through FILESTRUCT and then return your new FILE.
You can't change the values in a loop like that, Kind of like expecting this to return all x's:
demo_data = "A string with some words"
for letter in demo_data:
letter = "x"
return demo_data
It won't, it will return: "A string with some words"

Python Values in Lists

I am using Python 3.0 to write a program. In this program I deal a lot with lists which I haven't used very much in Python.
I am trying to write several if statements about these lists, and I would like to know how to look at just a specific value in the list. I also would like to be informed of how one would find the placement of a value in the list and input that in an if statement.
Here is some code to better explain that:
count = list.count(1)
if count > 1
(This is where I would like to have it look at where the 1 is that the count is finding)
Thank You!
Check out the documentation on sequence types and list methods.
To look at a specific element in the list you use its index:
>>> x = [4, 2, 1, 0, 1, 2]
>>> x[3]
0
To find the index of a specific value, use list.index():
>>> x.index(1)
2
Some more information about exactly what you are trying to do would be helpful, but it might be helpful to use a list comprehension to get the indices of all elements you are interested in, for example:
>>> [i for i, v in enumerate(x) if v == 1]
[2, 4]
You could then do something like this:
ones = [i for i, v in enumerate(your_list) if v == 1]
if len(ones) > 1:
# each element in ones is an index in your_list where the value is 1
Also, naming a variable list is a bad idea because it conflicts with the built-in list type.
edit: In your example you use your_list.count(1) > 1, this will only be true if there are two or more occurrences of 1 in the list. If you just want to see if 1 is in the list you should use 1 in your_list instead of using list.count().
You can use list.index() to find elements in the list besides the first one, but you would need to take a slice of the list starting from one element after the previous match, for example:
your_list = [4, 2, 1, 0, 1, 2]
i = -1
while True:
try:
i = your_list[i+1:].index(1) + i + 1
print("Found 1 at index", i)
except ValueError:
break
This should give the following output:
Found 1 at index 2
Found 1 at index 4
First off, I would strongly suggest reading through a beginner’s tutorial on lists and other data structures in Python: I would recommend starting with Chapter 3 of Dive Into Python, which goes through the native data structures in a good amount of detail.
To find the position of an item in a list, you have two main options, both using the index method. First off, checking beforehand:
numbers = [2, 3, 17, 1, 42]
if 1 in numbers:
index = numbers.index(1)
# Do something interesting
Your other option is to catch the ValueError thrown by index:
numbers = [2, 3, 17, 1, 42]
try:
index = numbers.index(1)
except ValueError:
# The number isn't here
pass
else:
# Do something interesting
One word of caution: avoid naming your lists list: quite aside from not being very informative, it’ll shadow Python’s native definition of list as a type, and probably cause you some very painful headaches later on.
You can find out in which index is the element like this:
idx = lst.index(1)
And then access the element like this:
e = lst[idx]
If what you want is the next element:
n = lst[idx+1]
Now, you have to be careful - what happens if the element is not in the list? a way to handle that case would be:
try:
idx = lst.index(1)
n = lst[idx+1]
except ValueError:
# do something if the element is not in the list
pass
list.index(x)
Return the index in the list of the first item whose value is x. It is an error if there is no such item.
--
In the docs you can find some more useful functions on lists: http://docs.python.org/tutorial/datastructures.html#more-on-lists
--
Added suggestion after your comment: Perhaps this is more helpful:
for idx, value in enumerate(your_list):
# `idx` will contain the index of the item and `value` will contain the value at index `idx`

Categories