In Graphlab,
I am working with small set of fitness data, to use recommender functions that could provide recommendations. The dataset has userid's column but not item id's, instead different items arranged in columns and their respective ratings in rows corresponding to each userid. In order to use any graphlab recommender method, I need to have userid's and item id's. Here is what I did:
v = graphlab.SFrame.read_csv('Data.csv')
userId = v["user_id"]
itemId = v["x","y","z","x1","y1","z1"] //x,y,z,x1,y1,z1 are activities that are actually the columns in Data and contains corresponding ratings given by user
sf= graphlab.SFrame({'UserId':userId,'ItemId':itemId})
print sf.head(5)
Basically, i extracted the user_id col from Data and tried making a column for ItemId using the x,y,z,etc columns extracted from the same data in order to make another sframe with just these 2 columns. This code results in a tabular format sframe with 2 column as expected, but not arranged in the same order I pass arguments in SFrame. So, the output gives ItemId as the first column and then UserId. Even though I tried to change the order of passing these 2 in sframe, it still gives the same output. Does anyone know the reason why ?
This is creating a problem further when using any recommender method as it gives the error: Column name user_id does not exist.
The reason for the column ordering is because you are passing a Python dictionary to the SFrame constructor. Dictionaries in Python will not keep keys in the order they are specified; they have their own order. If you prefer "UserId" to be first, you can call sf.swap_columns('UserId','ItemId').
The order of the columns does not affect the recommender method though. The Column name 'user_id' does not exist error will appear if you don't have a column named exactly user_id AND don't specify what the name of the user_id column is. In your case, you would want to do: graphlab.recommender.create(sf, user_id='UserId', item_id='ItemId').
Also, you may want to look at the stack method, which could help get your data in to the form the recommender method expects. Your current SFrame sf I think will have a column of dictionaries where the item id is the key and the rating is the value. I believe this would work in this case:
sf.stack('ItemId', new_column_name=['ItemId','Rating'])
Related
I am currently trying to filter my dataframe into an if and get the field returned into variable.
Here is my code:
if df_table.filter(col(field).contains("val")):
id_2 = df_table.select(another_field)
print(id_2)
# Recursive call with new variable
The problem is : it looks like the if filtering works, but id_2 gives me the column name and type where I want the value itself from that field.
The output for this code is:
DataFrame[ID_1: bigint]
DataFrame[ID_2: bigint]
...
If I try collect like this : id_2 = df_table.select(another_field).collect()
I get this : [Row(ID_1=3013848), Row(ID_1=319481), Row(ID_1=391948)...] which looks like just listing all id in a list.
I thought of doing : id_2 = df_table.select(another_field).filter(col(field).contains("val"))
but I still get the same result as first attempt.
I would like my id_2 for each iteration of my loop to take value from the field I am filtering on. Like :
3013848
319481
...
But not a list from every value of matching fields from my dataframe.
Any idea on how I could get that into my variable ?
Thank you for helping.
In fact, dataFrame.select(colName) is supposed to return a column(a dataframe of with only one column) but not the column value of the line. I see in your comment you want to do recursive lookup in a spark dataframe. The thing is, firstly, spark AFAIK, doesn't support recursive operation. If you have a deep recursive operation to do, you'd better collect the dataframe you have and do it on your driver without spark. Instead, you can use what library you want but you lose the advantage of treating the data in the distributive way.
Secondly, spark isn't designed to do operations with iteration on each record. Try to achieve with join of dataframes, but it return to my first point, if your later operation of join depends on your join result, in a recursive way, just forget spark.
I'm trying to create a report of the cards I have in Trello through Rest API, where I need to show in the same report the card data and the names of the members assigned to each card.
The problem is that the Trello JSON is very cumbersome, and I need to make several queries, and then merge the different data frames.
I'm currently stuck, trying to add the cardmember names to the main card data frame.
I'm sending you a summary of the problem:
I have created the main data frame (trello_dataframe), where I have card level information from Trello, including the "ID Members" column (trello_dataframe['ID Members'], in list form, which I need to merge with another data frame.
More info about trello_dataframe: https://prnt.sc/boC6OL50Glwu
The second data frame (df_response_members) results from the query at the board member level, where I have 3 columns (ID Members (df_response_members['ID Members']), FullName (df_response_members['Member (Full Name)']), and Username (df_response_members['Member (Username)']).
More info about "df_response_members": https://prnt.sc/x6tmzI04rohs
Now I want to merge these two data frames, grouped by df_response_members['ID Members'], so that the full name and username of the card members appear in the card data frame (it's the main one).
The problem occurs when I try to merge the two data frames, with the following code, and I get the error
TypeError: unhashable type: 'list'.
at
trello_dataframe = pd.merge(df_response_members,trello_dataframe, on="ID Members", how='outer')
Here is how I would like to see the main data frame: https://prnt.sc/7PSTmG2zahZO
Thank you in advance!
You can't do that for two reasons: A) as the error says, lists aren't hashable, and DataFrame operations tipically don't work on unhashable data types, and, B) you are trying to merge a list column with a string column. Both column types should be the same in order to perform a merge.
A solution could be to first use DataFrame.explode() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) on your first DataFrame trello_dataframe using the 'ID Members' column, this will generate an independent row for each 'ID Member' on each list. Now you can perform your merge with this DataFrame.
To convert back to your desired format you can use GroupBy, as stated here: How to implode(reverse of pandas explode) based on a column.
I am currently trying to read in a csv file for the purpose of creating a budget from a stament and I want to group similar items eg fuel etc. So id like to get the values from column E (aka column 5). store these values in a list and pair them with cost and then group in to lumps eg fuel. So far for simply trying to read the column I have the following
temp=pd.read_csv("statement.csv",usecols=['columnE'])
print(temp)
and the following table:
Values removed for obvious reasons. However when I run this I get the error Usecols do not match columns, why is this? I assumed I would at least get a value even if it's not the right one.
Correct the column name to
temp=pd.read_csv("statement.csv",usecols=['Transaction Description'])
and try again
So, I have a large data frame with customer names. I used the phone number and email combined to create a unique ID key for each customer. But, sometimes there will be a typo in the email so it will create two keys for the same customer.
Like so:
Key | Order #
555261andymiller#gmail.com 901345
555261andymller#gmail.com 901345
I'm thinking of combining all the keys based on the phone number (partial string) and then assigning all the keys within each group to the first key in every group. How would I go about doing this in Pandas? I've tried iterating over the rows and I've also tried the groupby method by partial string, but I can't seem to assign new values using this method.
If you really don't care what the new ID is, you can groupby the first characters of the string (which represent the phone number)
For example:
df.groupby(df.Key.str[:6]).first()
This will result in a dataframe where the index is the the first entry of the customer record. This assumes that the phone number will always be correct, though it sounds like that should not be an issue
I have a CSV file that looks like:
customer_ID, location, ....other info..., item-bought, score
I am trying to build a collaborative filtering recommender in Spark. Spark takes data of the form:
userID, itemID, value
but my data is longer, I want all user's info to be used instead of just userID. I tried grouping the columns in one column as:
(customerID,location,....),itemID,score
but the ALS.train gives me this error:
TypeError: int() argument must be a string or a number, not 'tuple'
How can I let spark take multiple key/values and not only three columns?
thanks
For each customer, identify the columns which you would like to use to distinguish these user-entities. Create a table (e.g. in SQL) in which each row contains the information for one user-entity, and use the row number in this table as the userID.
Do the same for your items if necessary, and provide these IDs to your classifier.