I have a Excel CSV files with employee records in them. Something like this:
mail,first_name,surname,employee_id,manager_id,telephone_number
blah#blah.com,john,smith,503422,503423,+65(2)3423-2433
foo#blah.com,george,brown,503097,503098,+65(2)3423-9782
....
I'm using DictReader to put this into a nested dictionary:
import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])
Is the above the proper way to do it - it does work, but is it the Right Way? Something more efficient? Also, the funny thing is, in IDLE, if I try to print out "employees" at the shell, it seems to cause IDLE to crash (there's approximately 1051 rows).
2. Remove employee_id from inner dict
The second issue issue, I'm putting it into a dictionary indexed by employee_id, with the value as a nested dictionary of all the values - however, employee_id is also a key:value inside the nested dictionary, which is a bit redundant? Is there any way to exclude it from the inner dictionary?
3. Manipulate data in comprehension
Thirdly, we need do some manipulations to the imported data - for example, all the phone numbers are in the wrong format, so we need to do some regex there. Also, we need to convert manager_id to an actual manager's name, and their email address. Most managers are in the same file, while others are in an external_contractors CSV, which is similar but not quite the same format - I can import that to a separate dict though.
Are these two items things that can be done within the single list comprehension, or should I use a for loop? Or does multiple comprehensions work? (sample code would be really awesome here). Or is there a smarter way in Python do it?
Cheers,
Victor
Your first part has one simple issue (which might not even be an issue). You don't handle key collisions at all (unless you intend to simply overwrite).
>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}
If you're guaranteed that employee_id is unique, there isn't an issue though.
2) Sure you can exclude it, but no real harm done. Actually, especially in python, if employee_id is a string or int (or some other primitive), the inner dict's reference and the key actually reference the same thing. They both point to the same spot in memory. The only duplication is in the reference (which isn't that big). If you're worried about memory consumption, you probably don't have to.
3) Don't try to do too much in one list comprehension. Just use a for loop after the first list comprehension.
To sum it all up, it sounds like you're really worried about the performance of iterating over the loop twice. Don't worry about performance initially. Performance problems come from algorithm problems, not specific language constructs like for loops vs list comprehensions.
If you're familiar with Big O notation, the list comprehension and for loop after (if you decide to do that) both have a Big O of O(n). Add them together and you get O(2n), but as we know from Big O notation, we can simplify that to O(n). I've over simplified a lot here, but the point is, you really don't need to worry.
If there are performance concerns, raise them after you written the code and prove it to yourself with a code profiler.
response to comments
As for your #2 reply, python really doesn't have a lot of mechanisms for making one liners cute and extra snazzy. It's meant to force you into simply writing the code out vs sticking it all in one line. That being said, it's still possible to do quite a bit of work in one line. My suggestion is to not worry about how much code you can stick in one line. Python looks a lot more beautiful (IMO) when its written out, not jammed in one line.
As for your #1 reply, you could try something like this:
employees = {}
for row in gd_extract:
if row['employee_id'] in employees:
... handle duplicates in employees dictionary ...
else:
employees[row['employee_id']] = row
As for your #3 reply, not sure what you're looking for and what about the telephone numbers you'd like to fix, but... this may give you a start:
import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
retelephone.sub('',row['telephone'])
Related
I'm using python with the binance.client wrapper. I'm gathering all of the BTC trade pairs from the exchange and wanting to create a simple dict with tradepair: price.
I have figured a way to do this but it seems clunky to me and takes a minute or so to run. I'm currently a programming student just getting started with python as well as some other languages.
Is there a better way to do it than this?
def BTCPair():
BTCPair = []
BTCPrice = []
BTCPairAndPrice = {}
exchange_info = client.get_exchange_info()
for s in exchange_info['symbols']:
if 'BTC' in (s['symbol'])[-3:]:
BTCPair.append(s['symbol'])
BTCPrice.append(client.get_avg_price(symbol=s['symbol'])['price'])
for i in range(len(BTCPair)):
BTCPairAndPrice[BTCPair[i]] = BTCPrice[i]
return BTCPairAndPrice
I don't understand why you are using two loops; one to put the data into lists, and another to convert those lists into a dictionary - why not just construct the dictionary directly?
You could just construct your dictionary directly using a comprehension:
BTCPairAndPrice = {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in exchange_info['symbols']
if 'BTC' in (s['symbol'])[-3:]
}
The way the dictionary is constructed is unlikely to have a big impact on performance, but not iterating through all the data twice should have an impact if there is a lot of data.
Also consider that contacting a web service is also likely to take some time, so contacting the exchange might be the slowest part.
First of all, for i in range(len(BTCPair)) is an antipattern. You could instead iterate over those zipped together.
But we don't actually need to do that either! Rather than creating two lists and then iterating over them to fill in your dictionary, you could create everything in one go with a dictionary comprehension. Also, a cleaner way to check the end of a string is endswith().
def btc_pair():
symbols = client.get_exchange_info()['symbols']
return {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in symbols
if s['symbol'].endswith('BTC')
}
This probably will run a little bit faster, but I doubt that the dictionary creation itself is the real performance bottleneck in your code.
I'm a sucker for reducing code to its bare minimum and love keeping it short and slim, but occasionally I get into the dilemma of whether I'm doing more harm than good. Below is an example of a situation I frequently encounter and where I start pondering if I am minifying at the expense of speed.
str = "my name is john"
##Alternative 1
for el in str.split(" "):
print(el)
##Alternative 2
splittedStr = str.split(" ")
for el in splittedStr:
print(el)
What is faster? I'd assume it's the second one because we don't split the string after every iteration (not even sure we do that)?
str.split(" ") does the exact same thing in both cases. It creates an anonymous list of the split strings. In the second case you have the minor overhead of assigning it to a variable and then fetching the value of the variable. Its wasted time if you don't need to keep the object for other reasons. But this is a trivial amount of time compared to other object referencing taking place in the same loop. Alternative 2 also leaves the data in memory which is another small performance issue.
The real reason Alternative 1 is better than 2, IMHO, is that it doesn't leave the hint that splittedStr is going to be needed later.
Look my friend, if you want to actually reduce the amount of time in the code in general,loop on a tuple instead of list but assigning the result in a variable then using the variable is not the best approach is you just reserved a memory location just to store the value but sometimes you can do that just for the sake of having a clean code like if you have more than one operation in one line like
min(str.split(mylist)[3:10])
In this case, it is better to have a variable called min_value for example just to make things cleaner.
returning back to the performance issue, you could actually notice the difference in performance if you loop through a list or a tuple like
This is looping through a tuple
for i in (1,2,3):
print(i)
& This is looping through a list
for i in [1,2,3]:
print(i)
you will find that using tuple will be faster !
I'm parsing a xml file ... so there's a field called case: Sometimes it's a single OrderedDict, other times it's a list of OrderedDict. That's it:
OrderedDict([(u'duration', u'2.111'), (u'className', u'foo'), (u'testName', u'runTest'), (u'skipped', u'false'), (u'failedSince', u'0')])
[OrderedDict([(u'duration', u'0.062'), (u'className', u'foo'), (u'testName', u'runTest'), (u'skipped', u'false'), (u'failedSince', u'0')]), OrderedDict([(u'duration', u'0.461'), (u'className', u'bar'), (u'testName', u'runTest'), (u'skipped', u'false'), (u'failedSince', u'0')])]
I want to always have that expression as a single list. The reason is to have a for loop to take care of that. I thought about doing something like:
[case]
But as the later I would have [[case]]. I don't think list joins or concatenations would help me. A trivial solution would be to check if case is of the type list or OrderedDict, however I was looking for a simpler, one line, pythonic solution like the one I described above. How can I accomplish that?
Since list and OrderedDict are both kinds of containers, checking the type sounds like it might be the simplest solution, if you're sure that the xml parse will always use the list type.
There's no reason you can't do this in a one-liner:
case = [case] if not isinstance(case, list) else case
I have a library that does some "translation" and uses the awesome tokenize.generate_tokens() function to do so.
And it is pretty fast and I have things working correctly. But when translating, I've found that the function keeps growing with new tokens that I want to translate and the if and elif conditions start to pop all over. I also keep a few variables outside the generator that keeps track of "last keyword seen" and similar.
A good example of this is the actual Python documentation one seen here (at the bottom): http://docs.python.org/library/tokenize.html#tokenize.untokenize
Every time I add a new thing I need to translate this function grows a couple of conditionals. I don't think that having a function with so many conditionals is the way to or the proper way to pave the ground to grow.
Furthermore, I feel that the tokenizer consumes a lot of irrelevant lines that do not contain any of the keywords I am translating.
So 2 questions:
How can I avoid adding more and more conditional statements that will make this translation function easy/clean to keep growing (without a performance hit)?
How can I make it efficient for all the irrelevant lines I am not interested in?
You could use a dict dispatcher. For example, the code you linked to might look like this:
def process_number(result,tokval):
if '.' in tokval:
result.extend([
(NAME, 'Decimal'),
(OP, '('),
(STRING, repr(tokval)),
(OP, ')')
])
def process_default(result,tokval):
result.append((toknum, tokval))
dispatcher={NUMBER: process_number, }
for toknum, tokval, _, _, _ in g:
dispatcher.get(toknum,process_default)(result,tokval)
Instead of adding more if-blocks, you add key-value pairs to dispatcher.
This may be more efficient than evaluating a long list of if-else conditionals, since dict lookup is O(1), but it does require a function call. You'll have to benchmark to see how this compares to many if-else blocks.
I think its main advantage is that it keeps code organized in small(er), comprehensible units.
I'm working through some tutorials on Python and am at a position where I am trying to decide what data type/structure to use in a certain situation.
I'm not clear on the differences between arrays, lists, dictionaries and tuples.
How do you decide which one is appropriate - my current understanding doesn't let me distinguish between them at all - they seem to be the same thing.
What are the benefits/typical use cases for each one?
How do you decide which data type to use? Easy:
You look at which are available and choose the one that does what you want. And if there isn't one, you make one.
In this case a dict is a pretty obvious solution.
Tuples first. These are list-like things that cannot be modified. Because the contents of a tuple cannot change, you can use a tuple as a key in a dictionary. That's the most useful place for them in my opinion. For instance if you have a list like item = ["Ford pickup", 1993, 9995] and you want to make a little in-memory database with the prices you might try something like:
ikey = tuple(item[0], item[1])
idata = item[2]
db[ikey] = idata
Lists, seem to be like arrays or vectors in other programming languages and are usually used for the same types of things in Python. However, they are more flexible in that you can put different types of things into the same list. Generally, they are the most flexible data structure since you can put a whole list into a single list element of another list, but for real data crunching they may not be efficient enough.
a = [1,"fred",7.3]
b = []
b.append(1)
b[0] = "fred"
b.append(a) # now the second element of b is the whole list a
Dictionaries are often used a lot like lists, but now you can use any immutable thing as the index to the dictionary. However, unlike lists, dictionaries don't have a natural order and can't be sorted in place. Of course you can create your own class that incorporates a sorted list and a dictionary in order to make a dict behave like an Ordered Dictionary. There are examples on the Python Cookbook site.
c = {}
d = ("ford pickup",1993)
c[d] = 9995
Arrays are getting closer to the bit level for when you are doing heavy duty data crunching and you don't want the frills of lists or dictionaries. They are not often used outside of scientific applications. Leave these until you know for sure that you need them.
Lists and Dicts are the real workhorses of Python data storage.
Best type for counting elements like this is usually defaultdict
from collections import defaultdict
s = 'asdhbaklfbdkabhvsdybvailybvdaklybdfklabhdvhba'
d = defaultdict(int)
for c in s:
d[c] += 1
print d['a'] # prints 7
Do you really require speed/efficiency? Then go with a pure and simple dict.
Personal:
I mostly work with lists and dictionaries.
It seems that this satisfies most cases.
Sometimes:
Tuples can be helpful--if you want to pair/match elements. Besides that, I don't really use it.
However:
I write high-level scripts that don't need to drill down into the core "efficiency" where every byte and every memory/nanosecond matters. I don't believe most people need to drill this deep.