Python: Creating clean code with lots of lists

Python: Creating clean code with lots of lists - python

In an attempt to become a better programmer I want to be able to create a cleaner code and one of the things that looks "messy" on the way I currently code is how I manage lists.
Typically I will (in the beginning of the code) start by defining the lists like this (below are completely arbitrary values I set into the lists) and then i iterate the proces many times throughout loops:
for i in range(100):
value1 = []
datapre = []
datapost = []
value1.append('Name')
value1.append('Number')
datapre.append(13)
dapre.append(16)
datapost.append(25)
datapost.append(28)
The above example only has 3 lists, however some times I need to use many different lists (like 50), which makes the code quite long, and I expect that good programmers do not actually work like this. So can anyone provides some tips on how you actually should store data?

The initialize-append way:
my_list = []
for i in range(10):
my_list.append(i**2)
The list-comprehension way:
my_list = [i**2 for i in range(10)]
What you choose to use depends on many factors, the most important of which (imo) is what else is done with it apart from populating the list. If they have to used elsewhere too, a for loop is advised to avoid re-iterating.
Then again, lists might not be the best choice in terms of data-structure for your specific problem, but that is impossible to judge from the info you provide.

Related

Better way to fill out dict?

I'm using python with the binance.client wrapper. I'm gathering all of the BTC trade pairs from the exchange and wanting to create a simple dict with tradepair: price.
I have figured a way to do this but it seems clunky to me and takes a minute or so to run. I'm currently a programming student just getting started with python as well as some other languages.
Is there a better way to do it than this?
def BTCPair():
BTCPair = []
BTCPrice = []
BTCPairAndPrice = {}
exchange_info = client.get_exchange_info()
for s in exchange_info['symbols']:
if 'BTC' in (s['symbol'])[-3:]:
BTCPair.append(s['symbol'])
BTCPrice.append(client.get_avg_price(symbol=s['symbol'])['price'])
for i in range(len(BTCPair)):
BTCPairAndPrice[BTCPair[i]] = BTCPrice[i]
return BTCPairAndPrice

I don't understand why you are using two loops; one to put the data into lists, and another to convert those lists into a dictionary - why not just construct the dictionary directly?
You could just construct your dictionary directly using a comprehension:
BTCPairAndPrice = {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in exchange_info['symbols']
if 'BTC' in (s['symbol'])[-3:]
}
The way the dictionary is constructed is unlikely to have a big impact on performance, but not iterating through all the data twice should have an impact if there is a lot of data.
Also consider that contacting a web service is also likely to take some time, so contacting the exchange might be the slowest part.

First of all, for i in range(len(BTCPair)) is an antipattern. You could instead iterate over those zipped together.
But we don't actually need to do that either! Rather than creating two lists and then iterating over them to fill in your dictionary, you could create everything in one go with a dictionary comprehension. Also, a cleaner way to check the end of a string is endswith().
def btc_pair():
symbols = client.get_exchange_info()['symbols']
return {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in symbols
if s['symbol'].endswith('BTC')
}
This probably will run a little bit faster, but I doubt that the dictionary creation itself is the real performance bottleneck in your code.

What will it give me and how to access the elements?

Please look at this piece of code :
sig_array=[]
...
for i in range (0, 2):
....
temp=[]
for k in range (0, len (sig)):
#print (k)
temp.append(downsample(sig[k],sampl, new_freq))
sig_array.append(temp)
In other words, tempis a list of arrays (my downsamplefunction, as its name may suggest, return an array) and then the temp will be agregated so it would be a list of lists of arrays !
My questions are : How to deal with that (indexing, ...) and is there simplest way to proceed, by generating list of arrays in a loop but how to keep it in a data structure ?
Thanks

Regarding indexing, you'd just refer to elements like sig_array[0], sig_array[1][2] or sig_array[3][0][2] etc.
Regarding any better data structures, it really just depends on your use case. As #smagnan says in the comments, are you using it for easily accessing data? Matrix processing? If so, have a look at numpy ndarrays. You say that you need it for big data on time series analysis. In that case, using the pandas module will be quite helpful (more info).
Also, as #Bazingaa says, you can make your code less verbose by using list comprehensions (more info):
sig_array = [ [downsample(sig[i],sampl, new_freq) for i in range (len(sig))] for _ in range(2)]
With list comprehensions, it's best to start from the outside, and from the end. The for _ in range(2) will run twice (I've replaced your i with _ as I couldn't see you using it anywhere. If you need it, replace _ with a relevant variable name). In each iteration, it'll append the inner list comprehension to the sig_array. Inside the inner listcomp, the result of the downsample() function will be appended to the temporary list for each iteration of the for loop,
This will have exactly the same output as your code, but is clearly way shorter :)

Python for loop index, without enumerate

Many of us know that, enumerate is being using in a situation you use the for loop and need to know the index. However, it has its downsides. According to my tests with the timeit module, just using enumerate makes the code 2x slower. Adding this a tuple assignment makes it slower up to 3x. These numbers may come as fast enough for any programmer, but people dealing with algorithms know that every bit of code you can optimize, is a huge advantage. Now to my question,
An example of this usage would be, the need of finding indexes of multiple elements in a list. Say that there is two elements we need to find. The first two solutions that occur to me is like so:
x, y = 0, 0
for ind, val in enumerate(lst):
if x and y:
break
if val == "a":
x = ind
elif val == "b":
y = ind
The solution above iterates the list, assign the values, than break if the two is found.
x = lst.index("a")
y = lst.index("b")
This is an other solution, which I didn't want to use because it appeared really naive. It iterates over the same list twice, to find two elements. The first solution, does this in a single iteration. So by complexity terms, even though we make extra assignments in the first solution, it should be faster than the second one in larger lists. But my assumption failed.
Here is the code I tested the performance: https://codeshare.io/XfvGA
The second solution was 2x to 10x faster than the first one, changing with the position of these two elements. There are several possibilities which this would occur.
There is an optimization in index() method that I am unaware of.
Lower level assignments being made in index() method. Possible use of C++ code.
The conditions and extra assignments in the first solution, makes it slower than expected.
Even these reasons fall short of explaining the speed of iterating the list twice over iterating it once. Though languages have much difference in time while running code, iteration process itself is independant from the programming language, if you need to check a million elements, you still have to check a million elements (Could be exampled by map() being not much faster than using a loop to change values).
So though I need you to examine the cases I presented, in order to clarify what is being asked here, question can be put together like this. We know that Python's for loop is actually a while running in background (possibly in C ?). So this means, the index is being stored as it is incremented somewhere in the memory. If there was a way to access it, this would eliminate the cost of calling and unpacking enumerate. My question is:
Is there such a way exists ?, If not, could be made (why, or why not) ?
The sources I used for more information on the subject:
Python speed
Python objects time complexity
Performance tips for Python

I dont think that the enumerate is the problem, to prove this you can do:
x, y = 0, 0
for val in a:
if x and y:
break
if val == "a":
x = val
elif val == "b":
y = val
This doesnt do the same thing you wanted in the first place (you dont get the index) but if you messure it with timeit, you will find that the diffrence is not so significant, meaning that the enumerate is not the source of the problem ( in my case it was 0.185 to 0.155 when running your example, so it is faster but the second solution got 0.055 at my computer )
The reason that lst.index is faster is that it is implemented in C .
You can see it's source code here:
https://svn.python.org/projects/python/trunk/Objects/listobject.c
the index function is called listindex in this file and is defined like
static PyObject *
listindex(PyListObject *self, PyObject *args)
( i couldnt find a way to add a link directly to the function )

You are trying to be un-Pythonic, which isn't going to end terribly well for you. If you really need to have that iterator count information available, there is a well-known and optimized way to do that: enumerate(). If you need to find an item in a list, there is a well-known and optimized way to do that: lst.index(). As DorElias showed above/below, enumerate is not the problem, it's that you're attempting to reinvent the wheel with the rest of your for loop. enumerate is going to be the best-supported (clearest, fastest, etc.) way to maintain an iteration count in every situation where an iteration count is actually the thing you need.

Excel CSV into Nested Dictionary; List Comprehensions

I have a Excel CSV files with employee records in them. Something like this:
mail,first_name,surname,employee_id,manager_id,telephone_number
blah#blah.com,john,smith,503422,503423,+65(2)3423-2433
foo#blah.com,george,brown,503097,503098,+65(2)3423-9782
....
I'm using DictReader to put this into a nested dictionary:
import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])
Is the above the proper way to do it - it does work, but is it the Right Way? Something more efficient? Also, the funny thing is, in IDLE, if I try to print out "employees" at the shell, it seems to cause IDLE to crash (there's approximately 1051 rows).
2. Remove employee_id from inner dict
The second issue issue, I'm putting it into a dictionary indexed by employee_id, with the value as a nested dictionary of all the values - however, employee_id is also a key:value inside the nested dictionary, which is a bit redundant? Is there any way to exclude it from the inner dictionary?
3. Manipulate data in comprehension
Thirdly, we need do some manipulations to the imported data - for example, all the phone numbers are in the wrong format, so we need to do some regex there. Also, we need to convert manager_id to an actual manager's name, and their email address. Most managers are in the same file, while others are in an external_contractors CSV, which is similar but not quite the same format - I can import that to a separate dict though.
Are these two items things that can be done within the single list comprehension, or should I use a for loop? Or does multiple comprehensions work? (sample code would be really awesome here). Or is there a smarter way in Python do it?
Cheers,
Victor

Your first part has one simple issue (which might not even be an issue). You don't handle key collisions at all (unless you intend to simply overwrite).
>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}
If you're guaranteed that employee_id is unique, there isn't an issue though.
2) Sure you can exclude it, but no real harm done. Actually, especially in python, if employee_id is a string or int (or some other primitive), the inner dict's reference and the key actually reference the same thing. They both point to the same spot in memory. The only duplication is in the reference (which isn't that big). If you're worried about memory consumption, you probably don't have to.
3) Don't try to do too much in one list comprehension. Just use a for loop after the first list comprehension.
To sum it all up, it sounds like you're really worried about the performance of iterating over the loop twice. Don't worry about performance initially. Performance problems come from algorithm problems, not specific language constructs like for loops vs list comprehensions.
If you're familiar with Big O notation, the list comprehension and for loop after (if you decide to do that) both have a Big O of O(n). Add them together and you get O(2n), but as we know from Big O notation, we can simplify that to O(n). I've over simplified a lot here, but the point is, you really don't need to worry.
If there are performance concerns, raise them after you written the code and prove it to yourself with a code profiler.
response to comments
As for your #2 reply, python really doesn't have a lot of mechanisms for making one liners cute and extra snazzy. It's meant to force you into simply writing the code out vs sticking it all in one line. That being said, it's still possible to do quite a bit of work in one line. My suggestion is to not worry about how much code you can stick in one line. Python looks a lot more beautiful (IMO) when its written out, not jammed in one line.
As for your #1 reply, you could try something like this:
employees = {}
for row in gd_extract:
if row['employee_id'] in employees:
... handle duplicates in employees dictionary ...
else:
employees[row['employee_id']] = row
As for your #3 reply, not sure what you're looking for and what about the telephone numbers you'd like to fix, but... this may give you a start:
import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
retelephone.sub('',row['telephone'])

How do I know what data type to use in Python?

I'm working through some tutorials on Python and am at a position where I am trying to decide what data type/structure to use in a certain situation.
I'm not clear on the differences between arrays, lists, dictionaries and tuples.
How do you decide which one is appropriate - my current understanding doesn't let me distinguish between them at all - they seem to be the same thing.
What are the benefits/typical use cases for each one?

How do you decide which data type to use? Easy:
You look at which are available and choose the one that does what you want. And if there isn't one, you make one.
In this case a dict is a pretty obvious solution.

Tuples first. These are list-like things that cannot be modified. Because the contents of a tuple cannot change, you can use a tuple as a key in a dictionary. That's the most useful place for them in my opinion. For instance if you have a list like item = ["Ford pickup", 1993, 9995] and you want to make a little in-memory database with the prices you might try something like:
ikey = tuple(item[0], item[1])
idata = item[2]
db[ikey] = idata
Lists, seem to be like arrays or vectors in other programming languages and are usually used for the same types of things in Python. However, they are more flexible in that you can put different types of things into the same list. Generally, they are the most flexible data structure since you can put a whole list into a single list element of another list, but for real data crunching they may not be efficient enough.
a = [1,"fred",7.3]
b = []
b.append(1)
b[0] = "fred"
b.append(a) # now the second element of b is the whole list a
Dictionaries are often used a lot like lists, but now you can use any immutable thing as the index to the dictionary. However, unlike lists, dictionaries don't have a natural order and can't be sorted in place. Of course you can create your own class that incorporates a sorted list and a dictionary in order to make a dict behave like an Ordered Dictionary. There are examples on the Python Cookbook site.
c = {}
d = ("ford pickup",1993)
c[d] = 9995
Arrays are getting closer to the bit level for when you are doing heavy duty data crunching and you don't want the frills of lists or dictionaries. They are not often used outside of scientific applications. Leave these until you know for sure that you need them.
Lists and Dicts are the real workhorses of Python data storage.

Best type for counting elements like this is usually defaultdict
from collections import defaultdict
s = 'asdhbaklfbdkabhvsdybvailybvdaklybdfklabhdvhba'
d = defaultdict(int)
for c in s:
d[c] += 1
print d['a'] # prints 7

Do you really require speed/efficiency? Then go with a pure and simple dict.

Personal:
I mostly work with lists and dictionaries.
It seems that this satisfies most cases.
Sometimes:
Tuples can be helpful--if you want to pair/match elements. Besides that, I don't really use it.
However:
I write high-level scripts that don't need to drill down into the core "efficiency" where every byte and every memory/nanosecond matters. I don't believe most people need to drill this deep.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.