I have a very long list of emails that I would like to process to:
separate good emails from bad emails, and
remove duplicates but keep all the non-duplicates in the same order.
This is what I have so far:
email_list = ["joe#example.com", "invalid_email", ...]
email_set = set()
bad_emails = []
good_emails = []
dups = False
for email in email_list:
if email in email_set:
dups = True
continue
email_set.add(email)
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
I would like this chunk of code to be as fast as possible, and of less importance, to minimize memory requirements. Is there a way to improve this in Python? Maybe using list comprehensions or iterators?
EDIT: Sorry! Forget to mention that this is Python 2.5 since this is for GAE.
email_re is from django.core.validators
Look at: Does Python have an ordered set? , and select an implementation you like.
So just:
email_list = OrderedSet(["joe#example.com", "invalid_email", ...])
bad_emails = []
good_emails = []
for email in email_list:
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
Probably is the fastest and simpliest solution you can achieve.
I can't think of any way to speed up what you have. It's fast to use a set to keep track of things, and it's fast to use a list to store a list.
I like the OrderedSet solution, but I doubt a Python implementation of OrderedSet would be faster than what you wrote.
You could use an OrderedDict to solve this problem. But that was added for Python 2.7. You could use a recipe (like: http://code.activestate.com/recipes/576693/) to add OrderedDict but again I don't think it would be any faster than what you have.
I'm trying to think of a Python module that is implemented in C to solve this problem. I think that's the only hope of beating your code. But I haven't thought of anything.
If you can get rid of the dups flag, it will be faster simply by running less Python code.
Interesting question. Good luck.
Related
Basically, I have a troubleshooting program, which, I want the user to enter their input. Then, I take this input and split the words into separate strings. After that, I want to create a dictionary from the contents of a .CSV file, with the key as recognisable keywords and the second column as solutions. Finally, I want to check if any of the strings from the split users input are in the dictionary key, print the solution.
However, the problem I am facing is that I can do what I have stated above, however, it loops through and if my input was 'My phone is wet', and 'wet' was a recognisable keyword, it would go through and say 'Not recognised', 'Not recognised', 'Not recognised', then finally it would print the solution. It says not recognised so many times because the strings 'My', 'phone' and 'is' are not recognised.
So how do I test if a users split input is in my dictionary without it outputting 'Not recognised' etc..
Sorry if this was unclear, I'm quite confused by the whole matter.
Code:
import csv, easygui as eg
KeywordsCSV = dict(csv.reader(open('Keywords and Solutions.csv')))
Problem = eg.enterbox('Please enter your problem: ', 'Troubleshooting').lower().split()
for Problems, Solutions in (KeywordsCSV.items()):
pass
Note, I have the pass there, because this is the part I need help on.
My CSV file consists of:
problemKeyword | solution
For example;
wet Put the phone in a bowl of rice.
Your code reads like some ugly code golf. Let's clean it up before we look at how to solve the problem
import easygui as eg
import csv
# # KeywordsCSV = dict(csv.reader(open('Keywords and Solutions.csv')))
# why are you nesting THREE function calls? That's awful. Don't do that.
# KeywordsCSV should be named something different, too. `problems` is probably fine.
with open("Keywords and Solutions.csv") as f:
reader = csv.reader(f)
problems = dict(reader)
problem = eg.enterbox('Please enter your problem: ', 'Troubleshooting').lower().split()
# this one's not bad, but I lowercased your `Problem` because capital-case
# words are idiomatically class names. Chaining this many functions together isn't
# ideal, but for this one-shot case it's not awful.
Let's break a second here and notice that I changed something on literally every line of your code. Take time to familiarize yourself with PEP8 when you can! It will drastically improve any code you write in Python.
Anyway, once you've got a problems dict, and a problem that should be a KEY in that dict, you can do:
if problem in problems:
solution = problems[problem]
or even using the default return of dict.get:
solution = problems.get(problem)
# if KeyError: solution is None
If you wanted to loop this, you could do something like:
while True:
problem = eg.enterbox(...) # as above
solution = problems.get(problem)
if solution is None:
# invalid problem, warn the user
else:
# display the solution? Do whatever it is you're doing with it and...
break
Just have a boolean and an if after the loop that only runs if none of the words in the sentence were recognized.
I think you might be able to use something like:
for word in Problem:
if KeywordsCSV.has_key(word):
KeywordsCSV.get(word)
or the list comprehension:
[KeywordsCSV.get(word) for word in Problem if KeywordsCSV.has_key(word)]
I am trying to name keys in my dictionary in a generic way because the name will be based on the data I get from a file. I am a new beginner to Python and I am not able to solve it, hope to get answer from u guys.
For example:
from collections import defaultdict
dic = defaultdict(dict)
dic = {}
if cycle = fergurson:
dic[cycle] = {}
if loop = mourinho:
a = 2
dic[cycle][loop] = {a}
Sorry if there is syntax error or any other mistake.
The variable fergurson and mourinho will be changing due to different files that I will import later on.
So I am expecting to see my output when i type :
dic[fergurson][mourinho]
the result will be:
>>>dic[fergurson][mourinho]
['2']
It will be done by using Python
Naming things, as they say, is one of the two hardest problems in Computer Science. That and cache invalidation and off-by-one errors.
Instead of focusing on what to call it now, think of how you're going to use the variable in your code a few lines down.
If you were to read code that was
for filename in directory_list:
print filename
It would be easy to presume that it is printing out a list of filenames
On the other hand, if the same code had different names
for a in b:
print a
it would be a lot less expressive as to what it is doing other than printing out a list of who knows what.
I know that this doesn't help what to call your 'dic' variable, but I hope that it gets you on the right track to find the right one for you.
i have found a way, if it is wrong please correct it
import re
dictionary={}
dsw = "I am a Geography teacher"
abc = "I am a clever student"
search = re.search(r'(?<=Geography )(.\w+)',dsw)
dictionary[search]={}
again = re.search(r'(?<=clever )(.\w+)' abc)
dictionary[search][again]={}
number = 56
dictionary[search][again]={number}
and so when you want to find your specific dictionary after running the program:
dictionary["teacher"]["student"]
you will get
>>>'56'
This is what i mean to
I want to pull all the objects from my list of their keys. Is there a better way than how I'm doing it now?
list_of_objs = []
for obj_key in list_of_keys:
this_obj = db.get(obj.key)
list_of_objs.append(this_obj)
According to documentation db.get may accept list of keys and fetch them in a single batch, which will be much faster.
list_of_objects = db.get(list_of_keys)
Not familiar with your specific application (specifically the db.get) but you may want to consider a list comprehension.
list_of_objs = [db.get(obj.key) for obj in list_of_keys]
list_of_objs = [db.get(key) for key in list_of_keys]
You could use the IN operator
query = db.Query()
list_of_objs = query.filter('__key__ IN', ','.join(list_of_keys))
this (or something similar) should do the job
According to the docs such queries are possible
I have a large sets of urls. Some are similar to each other i.e. they represent the similar set of pages.
For eg.
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
are similar. Similarly
http://example.com/showitem/apple/
http://example.com/showitem/banana/
http://example.com/showitem/grapes/
are also similar. So i need to represent them as http://example.com/product/(Integers)/
where (Integers) = 1,2,40,33 and http://example.com/showitem/(strings)/ where strings = apple,banana,grapes ... and so on.
Is there any inbuilt function or library in python to do find these similar urls from large set of mixed urls? How can this be done more efficiently? Please suggest. Thanks in advance.
Use a string to store the first part of the URL and just handle IDs, example:
In [1]: PRODUCT_URL='http://example.com/product/%(id)s/'
In [2]: _ids = '1 2 40 33'.split() # split string into list of IDs
In [3]: for id in _ids:
...: print PRODUCT_URL % {'id':id}
...:
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
The statement print PRODUCT_URL % {'id':id} uses Python string formatting to format the product URL depending on the variable id passed.
UPDATE:
I see you've changed your question. The solution for your problem is quite domain-specific and depends on your data set. There are several approaches, some more manual than others. One such approach would be to get the top-level URLs i.e. to retrieve the domain name:
In [7]: _url = 'http://example.com/product/33/' # url we're testing with
In [8]: ('/').join(_url.split('/')[:3]) # get domain
Out[8]: 'http://example.com'
In [9]: ('/').join(_url.split('/')[:4]) # get domain + first URL sub-part
Out[9]: 'http://example.com/product'
[:3] and [:4] above are just slicing the list resulting from split('/')
You can set the result as a key on a dict for which you keep a count of each time you encounter the URL part. And move on from there. Again the solution depends on your data. If it gets more complex than above then I suggest you look into regex as the other answers suggest.
You can use regular expressions to handle that cases. You can go to the Python documentation to see how is this handle.
Also you can see how Django implement this on its routings system
I'm not exactly sure what specifically you are looking for. It sounds to me that you are looking for something to match URLs. If this is indeed what you want then I suggest you use something that is built using regular expressions. One example can be found here.
I also suggest you take a look at Django and its routing system.
Not in Python, but I've created a Ruby Library (and an accompanying app) --
https://rubygems.org/gems/LinkGrouper
It works on all links (doesn't need to know any pattern).
I use SPARQLWrapper in python to query a web enpoint with many different querys in a loop.
So I tried to make it work like this (let queries hold all different queries and result the results):
sparql = SPARQLWrapper("url")
prefix = "prefix..."
for i in arange(1:len(queries)):
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
But this does not work. The first query I pick from the list would return the supposed result, but any other querys wouldn't.
Instead of that, I now use this:
for i in arange(1:len(queries)):
[sparql,prefix] = initializeSPARQL()
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
and also
def initializeSPARQL():
sparql = SPARQLWrapper("url")
prefix = "prefix..."
return sparql,prefix
That works and is also not an issue of performance, since the querying itself is the bottleneck. But is there a better solution? This appears to be so wrong...
It is strange.. because I've been checking the code, and the query() method is completely stateless, so no idea why it's failing.
With i > 1, what does result[i] contains?
May I suggest you to try the following?
sparql = SPARQLWrapper("url")
prefix = "prefix..."
results = []
for i in range(0, len(queries)):
sparql.resetQuery()
sparql.setQuery(prefix+queries[i])
results[i] = sparql.query().convert()
I'm one of the developers of the library.
Your first try arises a bug. I'll check what internal data structure keeps with the previous usage to allow such way to use the library.
You second solution, even if is works, should be not the right way to do it.
As I said, I'll take a look on how to fix this.
For the future, please, submit a proper bug report to the project or an email to the mailing list.