Pandas: fix typos in keys within a dataframe - python

So, I have a large data frame with customer names. I used the phone number and email combined to create a unique ID key for each customer. But, sometimes there will be a typo in the email so it will create two keys for the same customer.
Like so:
Key | Order #
555261andymiller#gmail.com 901345
555261andymller#gmail.com 901345
I'm thinking of combining all the keys based on the phone number (partial string) and then assigning all the keys within each group to the first key in every group. How would I go about doing this in Pandas? I've tried iterating over the rows and I've also tried the groupby method by partial string, but I can't seem to assign new values using this method.

If you really don't care what the new ID is, you can groupby the first characters of the string (which represent the phone number)
For example:
df.groupby(df.Key.str[:6]).first()
This will result in a dataframe where the index is the the first entry of the customer record. This assumes that the phone number will always be correct, though it sounds like that should not be an issue

Related

Reading a particular column from a csv

I am currently trying to read in a csv file for the purpose of creating a budget from a stament and I want to group similar items eg fuel etc. So id like to get the values from column E (aka column 5). store these values in a list and pair them with cost and then group in to lumps eg fuel. So far for simply trying to read the column I have the following
temp=pd.read_csv("statement.csv",usecols=['columnE'])
print(temp)
and the following table:
Values removed for obvious reasons. However when I run this I get the error Usecols do not match columns, why is this? I assumed I would at least get a value even if it's not the right one.
Correct the column name to
temp=pd.read_csv("statement.csv",usecols=['Transaction Description'])
and try again

Iteration & Computation Pandas Dataframe

As a very very new beginner with Python & Pandas, I am looking for your support regarding an issue.
I need to iterate over columns and find out the maximum value in the concerning rows of a dataframe and write it in a new variable for each row. The number of columns is not manageable, almost 200 columns, therefore I do not want to write each required column id manually. And most importantly that I need to start from a given column id and continue with two columns id increments till a given last columns id.
Will appreciate sample codes, see attachment too.
Try:
df['x']=df.max(axis=1)
Replace x with the name for your desired output column.

Discard rows in dataframe if particular column values in list [duplicate]

I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())

Pandas: How to remove rows from a dataframe based on a list?

I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())

Grpahlab SFrames: Error in using SFrames with the dataset

In Graphlab,
I am working with small set of fitness data, to use recommender functions that could provide recommendations. The dataset has userid's column but not item id's, instead different items arranged in columns and their respective ratings in rows corresponding to each userid. In order to use any graphlab recommender method, I need to have userid's and item id's. Here is what I did:
v = graphlab.SFrame.read_csv('Data.csv')
userId = v["user_id"]
itemId = v["x","y","z","x1","y1","z1"] //x,y,z,x1,y1,z1 are activities that are actually the columns in Data and contains corresponding ratings given by user
sf= graphlab.SFrame({'UserId':userId,'ItemId':itemId})
print sf.head(5)
Basically, i extracted the user_id col from Data and tried making a column for ItemId using the x,y,z,etc columns extracted from the same data in order to make another sframe with just these 2 columns. This code results in a tabular format sframe with 2 column as expected, but not arranged in the same order I pass arguments in SFrame. So, the output gives ItemId as the first column and then UserId. Even though I tried to change the order of passing these 2 in sframe, it still gives the same output. Does anyone know the reason why ?
This is creating a problem further when using any recommender method as it gives the error: Column name user_id does not exist.
The reason for the column ordering is because you are passing a Python dictionary to the SFrame constructor. Dictionaries in Python will not keep keys in the order they are specified; they have their own order. If you prefer "UserId" to be first, you can call sf.swap_columns('UserId','ItemId').
The order of the columns does not affect the recommender method though. The Column name 'user_id' does not exist error will appear if you don't have a column named exactly user_id AND don't specify what the name of the user_id column is. In your case, you would want to do: graphlab.recommender.create(sf, user_id='UserId', item_id='ItemId').
Also, you may want to look at the stack method, which could help get your data in to the form the recommender method expects. Your current SFrame sf I think will have a column of dictionaries where the item id is the key and the rating is the value. I believe this would work in this case:
sf.stack('ItemId', new_column_name=['ItemId','Rating'])

Categories