How to use Graphlab recommend() for providing recommendations to new user? - python

In Graphlab,
I am trying to use recommend() method, to see how it provides recommendation for a new user(user_id) which isn't present in the trained model prepared from give dataset. Since the aim is to determine similar users through this recommendation model used, so I plan to pass the new_user_data in recommend(), but with exact same item- ratings of an existing user to check if it should returns the same ratings.
Here is what I am doing:
(data is the dataset containing UserIds, ItemIds and Rating columns)
(say 104 is a new UserId which isn't in data set)
result=graphlab.factorization_recommender.create(data,user_id='UserId',
item_id='ItemId',target='Rating')
new_user_info=graphlab.SFrame({'UserId':104,'ItemId':['x'],'Rating':9})
r=result.recommend(users=104,new_user_data=new_user_info)
I am getting an error:
raise exc_type(exc_value)
TypeError: object of type 'int' has no len()
Can anyone help as to how to use recommend() method for new user ?

Which of the lines give you the exception? I think that you have problems with your SFrame creation and with your usage of the .recommend() method.
new_user_info=graphlab.SFrame({'UserId':104,'ItemId':['x'],'Rating':9})
# should be
new_user_info=graphlab.SFrame({'UserId':[104],'ItemId':['x'],'Rating':[9]})
# construct SFrames from a dictionary where the values are lists
and
r = result.recommend(users=104,new_user_data=new_user_info)
# should be:
r = result.recommend(users=[104],new_user_data=new_user_info)
# users is a list, not an integer

Related

Grpahlab SFrames: Error in using SFrames with the dataset

In Graphlab,
I am working with small set of fitness data, to use recommender functions that could provide recommendations. The dataset has userid's column but not item id's, instead different items arranged in columns and their respective ratings in rows corresponding to each userid. In order to use any graphlab recommender method, I need to have userid's and item id's. Here is what I did:
v = graphlab.SFrame.read_csv('Data.csv')
userId = v["user_id"]
itemId = v["x","y","z","x1","y1","z1"] //x,y,z,x1,y1,z1 are activities that are actually the columns in Data and contains corresponding ratings given by user
sf= graphlab.SFrame({'UserId':userId,'ItemId':itemId})
print sf.head(5)
Basically, i extracted the user_id col from Data and tried making a column for ItemId using the x,y,z,etc columns extracted from the same data in order to make another sframe with just these 2 columns. This code results in a tabular format sframe with 2 column as expected, but not arranged in the same order I pass arguments in SFrame. So, the output gives ItemId as the first column and then UserId. Even though I tried to change the order of passing these 2 in sframe, it still gives the same output. Does anyone know the reason why ?
This is creating a problem further when using any recommender method as it gives the error: Column name user_id does not exist.
The reason for the column ordering is because you are passing a Python dictionary to the SFrame constructor. Dictionaries in Python will not keep keys in the order they are specified; they have their own order. If you prefer "UserId" to be first, you can call sf.swap_columns('UserId','ItemId').
The order of the columns does not affect the recommender method though. The Column name 'user_id' does not exist error will appear if you don't have a column named exactly user_id AND don't specify what the name of the user_id column is. In your case, you would want to do: graphlab.recommender.create(sf, user_id='UserId', item_id='ItemId').
Also, you may want to look at the stack method, which could help get your data in to the form the recommender method expects. Your current SFrame sf I think will have a column of dictionaries where the item id is the key and the rating is the value. I believe this would work in this case:
sf.stack('ItemId', new_column_name=['ItemId','Rating'])

Python categorize datatypes

I plan to make a 'table' class that I can use throughout my data-analyzis program to store gathered data to. Objective is to make simple tables like this:
ID Mean size Stdv Date measured Relative flatness
----------------------------------------------------------------
1 133.4242 34.43 Oct 20, 2013 32093
2 239.244 34.43 Oct 21, 2012 3434
I will follow the sqlite3 suggestion from this post: python-data-structure-for-maintaing-tabular-data-in-memory, but I will still need to save it as a csv file (not as a dbase) and I want it to eat my data as we go: add columns on the fly whenever new measures become available and are deemed to be interesting. For that the class will need to be able to determine the data type of the data thrown at it.
Sqlite3 has limited datatypes, float, int, date and string. Python and numpy together have many types. Is there an easy was to quickly decide what the datatype is of the variable? So my table class can automatically add a column when new data is entered containing new fields.
I am not too concerned about performance, the table should be fairly small.
I want to use my class like so:
dt = Table()
dt.add_record({'ID':5, 'Mean size':39.4334'})
dt.add_record({'ID':5, 'Goodness of fit': 12})
In the last line, there is new data. the Table class needs to figure out what kind of data that is and then add a column to the sqlite3 table. Making it all string seems a bit to floppy, I still want to keep my high precision floats correct....
Also: If something like this already exists, I'd like to know about it.
It seems that your question is: "Is there an easy was to quickly decide what the datatype is of the variable?". This is a simple question, and the answer is:
type(variable).
But the context you provide requires a more careful answer.
Since SQLite3 only provides only a few data types (slightly different ones than what you said), you need to map your input variables to the types provided by SQLite3.
But you may encounter further problems: You may need to change the types of columns as you receive new records, if you do not want to require that the column type be fixed in advance.
For example, for the Goodness of fit column in your example, you get an int (12) first. But you may get a float (e.g. 10.1) the second time, which shows that both values must be interpreted as floats. And if next time you receive a string, then all of them must be strings, right? But then the exact formatting of the numbers also counts: whereas 12 and 12.0 are the same when you interpret them as floats, they are not when you interpret them as strings; and the first value may become "12.0" when you convert all of them to strings.
So either you throw an exception when the type of consecutive values for the same column do not match, or you try to convert the previous values according to the new ones; but occasionally you may need to re-read the input.
Nevertheless, once you make those decision regarding the expected behavior, it should not be a very difficult problem to implement.
Regarding your last question: I personally do not know of an existing implementation to this problem.

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Store django forms.MultipleChoiceField in Models directly

Say I have choices defined as follows:
choices = (('1','a'),
('2','b'),
('3','c'))
And a form that renders and inputs these values in a MultipleChoiceField,
class Form1(forms.Form):
field = forms.MultipleChoiceField(choices=choices)
What is the right way to store field in a model.
I can of course loop through the forms.cleaned_data['field'] and obtain a value that fits in models.CommaSeperatedIntegerField.
Again, each time I retrieve these values, I will have to loop and convert into options.
I think there is a better way to do so, as in this way, I am in a way re-implementing the function that CommaSeperateIntegerField is supposed to do.
The first thing I would consider is better normalization of your database schema; if a single instance of your model can have multiple values for this field, the field perhaps should be a linked model with a ForeignKey instead.
If you're using Postgres, you could also use an ARRAY field; Django now has built-in support.
If you can't do either of those, then you do basically need to reimplement a (better) version of CommaSeparatedIntegerField. The reason is that CommaSeparatedIntegerField is nothing but a plain CharField whose default formfield representation is a regex-validated text input. In other words, it doesn't do anything that's useful to you.
What you need to write is a custom ListField or MultipleValuesField that expects a Python list and returns a Python list, but internally converts that list to/from a comma-separated string for insertion in the database. Read the documentation on custom model fields; I think in your case you'll want a subclass of CharField with two methods overridden: to_python (convert CSV string to Python list) and get_db_prep_value (convert Python list to CSV string).
I just had this same problem and the solution (to me) because as Carl Meyer put it. I don't want a normalized version of this "list of strings" is to just have a CharField in the model. This way your model will store the normalized list of items. In my case this is countries.
So the model declaration is just
countries = CharField(max_lenght=XXX)
where XXX is a precalculated value of 2x my country list. Because it's simpler for us to apply a check to see if the current country is in this list rather than do it as a M2M to a Country table.

Django filter based on custom function

I have a table AvailableDates with a column date that stores date information.
I want to filter the date after performing some operation on it which is defined by convert_date_to_type function, that takes parameter input_variable provided by user.
def convert_date_to_type(date,input_variable):
#perform some operation on date based on input_variable
#return value will be a type, which will be any one item from types list below
return type
list of types:
types = []
types.append('type1')
types.append('type2')
types.append('type3')
Now I want to filter the table based on type. I will do this in for loop:
for i in range(0,len(types)):
#filter table here based on types[i], something like this
AvailableDates.objects.filter(convert_date_to_type(date,input_variable)=types[i])
How can I achieve this? Any other approach is much appreciated.
I cannot store the type information in separate column, because one date can be of different types based on input_variable given by user.
The approach you are taking will result in very time consuming queries because you are looping over all the objects and therefore skipping all the
time benefits that Database systems give in terms of querying.
The main question you have to answer is "How frequently is this query going to be used ?"
If it's going to be a lot, then I will suggest the following approach.
Creating an extra column or a table with one-to-one relation with your model
Override the model's save function to process your date and store the result in this extra column created in step 1 at the time of saving your model.
Implement your query on this extra column created in step 1.
This approach has space overhead, but it will make the query faster.
If it's not going to be lot, but the query can make your web request noticeably slow, then also use the above approach. It will help with a smooth web experience.
one solution is to get the ids list first:
dates_, wanted_ids = AvailableDates.objects.all(), []
for i in range(0, len(types)):
wanted_ids += [d for d in dates_ if convert_date_to_type(d.date) == types[i]]
wanted_dates = AvailableDates.objects.filter(id__in=wanted_ids)
not very performant but it works

Categories