I asked another question about doing large queries in GAE, to which the answer was pretty much not possible.
What I want to do is this: from an iOS device, I get all the user's contacts phone numbers. So now I have a list of say 250 phone numbers. I want to send these phone numbers back to the server and check to see which of these phone numbers belong to a User account.
So I need to do a query: query = User.query(User.phone.IN(phones_list))
However, with GAE, this is quite an expensive query. It will cost 250 reads for just this one query, and I expect to do this type of query often.
So I came up with a crazy idea. Why don't I host the phone numbers on another host, on another database, where this type of query is cheaper. Then I can have GAE send a HTTP request to my other server to get the desired info.
So I have two questions:
Are there any databases more streamlined to handle these kinds of
queries, and which it would be more cheaper to do? Or will it all be
the same as GAE?
Is this overkill? Is it a good idea? Should I suck it up and pay the cost?
GAE's datastore should be good enough for your service. Since your application looks like could be parallelized very well.
1. use phone number as key_name of User.
As you set number as key_name of User, the following code will increase the query speed and reduce the read operation.
memcache.get_multi([phone_number1, phone_number2 ... ])
db.get([number1_not_found_in_memcache, number2_not_found_in_memcache])
memcache.set_multi("all_number_found_in_db")
2. store multi number in one datastore.
the operation cost of GAE not directly related to the entity's size. therefore a large entity store multi data would be another way to save the operation cost.
for example, store several phone number which have the same number_prefix together.
class Number(db.Model):
number_prefix = db.StringProperty()
numbers = db.StringListProperty(indexed = False)
# check number 01234567, 032123124
numbers = Number.get(["01", "03'])
# check 01234567 in number[0].numbers ?
# check 032123124 in number[1].numbers ?
this method could further imporve with memcache.
Generalizing slightly on other ideas offered... assuming that all your search keys are unique to a single User (e.g. email, phone, twitter handle, etc.)
At User write time, you can generate a set of SearchIndex(...) and persist that. Each SearchIndex has the key of the User.
Then at search time you can construct the keys for any SearchIndex and do two ndb.get_multi_async calls. The first to get matching SearchIndex entities, and the second to get the Users associated with those index entities.
Related
So I have a site that on a per-user basis, and it is expected to query a very large database, and flip through the results. Due to the size of the number of entries returned, I run the query once (which takes some time...), store the result in a global, and let folks iterate through the results (or download them) as they want.
Of course, this isn't scalable, as the globals are shared across sessions. What is the correct way to do this in Django? I looked at session management, but I always ran into the "xyz is not serializeable on json" issue. Do I look into how I do this correctly using sessions, or is there another preferred way to do this?
If the user is flipping through the results, you probably don't want to pull back and render any more than you have to. Most SQL dialects have TOP and LIMIT clauses that will let you pull back a limited range of results, as long as your data is ordered consistently. Django's Pagination classes are a nice abstraction of this on top of Django Model classes: https://docs.djangoproject.com/en/dev/topics/pagination/
I would be careful of storing large amounts of data in user sessions, as it won't scale as your number of users grows, and user sessions can stay around for a while after the user has left the site. If you're set on this option, make sure you read about clearing the expired sessions. Django doesn't do it for you:
https://docs.djangoproject.com/en/1.7/topics/http/sessions/#clearing-the-session-store
Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?
I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).
One of the core functions of my messaging app is allowing users to find friends who are also on the service based on their phone numbers. Apps like Whatsapp and Snapchat do have the same kind of mechanism.
I'm struggling to find a solution that returns a good number of results. I'm wondering how most other apps approach this pretty widely implemented feature.
My current implementation is that I have User model and a PhoneUser model. The PhoneUser model is keyed on the user's phone number that has been converted into the standardized E164 format. It has a KeyProperty to link it to the respective user.
class PhoneUser(ndb.Model):
# id is the phone number in E164 format
user = ndb.KeyProperty(kind='User', required=True)
When a user signs up for the service and grants access to their phone contacts, the app can get a large number of phone numbers from the user's phone book. 1,000 numbers is not impossible. I convert all these numbers into the standardized E164 format and then build keys for each (ie. ndb.Key('PhoneUser', PHONE_NUMBER)). With that list of PhoneUser keys, I can use ndb.get_multi(list_of_phoneuser_keys). This lets me avoid querying for 1,000 numbers.
This theoretically works well under the assumption that users enter their phone numbers with country code correctly so that the python phonenumbers library can parse it.
However, that is many times not the case, but this approach requires it because getting entities by keys requires exact matches.
This was just one approach I had thought of and it has its drawbacks. This seems like a very common function in apps and I was wondering if there was a better approach.
In any case you'll need to normalize phone numbers to common format (E164). We use libphonenumber, which works pretty well. You might check out the python port.
We replace missing country codes in friends phone numbers with the country code of the user doing the search. Rationale: if user does not have a country code entered for his contact, then they are probably from the same country.
Hint: things will get interesting when you will want to implement reverse-search - notifying existing users that one of their friends showed up in the network.
The right way would be not to rely on users placing city/county codes and/or prefixes. I dont know anyone that saves their numbers with country codes unless they travel frequently oversees.
You will need to parse and correct the numbers. You can use current geolocation to try and add missing city/country codes and also remove unneeded prefixes. Probably involves some research into your target country audiences.
I have documents in Mongo that contain questionnaire-like data (company's name, address, phone numbers, etc.). I should have this data displayed in web interface.
How to make it better: get the whole document as dict into a variable and fill webform from its data or make lots of database requests to fill each field of the web form?
If you are going to display all of the data on the web interface anyways, start simple: Get a the whole document. It reduces your latency by requiring less round trips to the database.
Agree with GPS - if you need all, or most, of the data then it should be a better option to retrieve the document in one request as opposed to hitting the database x number of times.
I'm working on an application that lets registered users create or upload content, and allows anonymous users to view that content and browse registered users' pages to find that content - this is very similar to how a site like Flickr, for example, allows people to browse its users' pages.
To do this, I need a way to identify the user in the anonymous HTTP GET request. A user should be able to type http://myapplication.com/browse/<userid>/<contentid> and get to the right page - should be unique, but mustn't be something like the user's email address, for privacy reasons.
Through Google App Engine, I can get the email address associated with the user, but like I said, I don't want to use that. I can have users of my application pick a unique user name when they register, but I would like to make that optional if at all possible, so that the registration process is as short as possible.
Another option is to generate some random cookie (a GUID?) during the registration process, and use that, I don't see an obvious way of guaranteeing uniqueness of such a cookie without a trip to the database.
Is there a way, given an App Engine user object, of getting a unique identifier for that object that can be used in this way?
I'm looking for a Python solution - I forgot that GAE also supports Java now. Still, I expect the techniques to be similar, regardless of the language.
Your timing is impeccable: Just yesterday, a new release of the SDK came out, with support for unique, permanent user IDs. They meet all the criteria you specified.
I think you should distinguish between two types of users:
1) users that have logged in via Google Accounts or that have already registered on your site with a non-google e-mail address
2) users that opened your site for the first time and are not logged in in any way
For the second case, I can see no other way than to generate some random string (e.g. via uuid.uuid4() or from this user's session cookie key), as an anonymous user does not carry any unique information with himself.
For users that are logged in, however, you already have a unique identifier -- their e-mail address. I agree with your privacy concerns -- you shouldn't use it as an identifier. Instead, how about generating a string that seems random, but is in fact generated from the e-mail address? Hashing functions are perfect for this purpose. Example:
>>> import hashlib
>>> email = 'user#host.com'
>>> salt = 'SomeLongStringThatWillBeAppendedToEachEmail'
>>> key = hashlib.sha1('%s$%s' % (email, salt)).hexdigest()
>>> print key
f6cd3459f9a39c97635c652884b3e328f05be0f7
As hashlib.sha1 is not a random function, but for given data returns always the same result, but it is proven to be practically irreversible, you can safely present the hashed key on the website without compromising user's e-mail address. Also, you can safely assume that no two hashes of distinct e-mails will be the same (they can be, but probability of it happening is very, very small). For more information on hashing functions, consult the Wikipedia entry.
Do you mean session cookies?
Try http://code.google.com/p/gaeutilities/
What DzinX said. The only way to create an opaque key that can be authenticated without a database roundtrip is using encryption or a cryptographic hash.
Give the user a random number and hash it or encrypt it with a private key. You still run the (tiny) risk of collisions, but you can avoid this by touching the database on key creation, changing the random number in case of a collision. Make sure the random number is cryptographic, and add a long server-side random number to prevent chosen plaintext attacks.
You'll end up with a token like the Google Docs key, basically a signature proving the user is authenticated, which can be verified without touching the database.
However, given the pricing of GAE and the speed of bigtable, you're probably better off using a session ID if you really can't use Google's own authentication.