Start an iteration on first row of a group Pandas - python

I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.
So far, I've been able to write this code:
import pandas
import datetime
wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
for each in range(i):
LapseDt = wd.loc[l, 'LapseDate']
if LapseDt != '?':
LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
if Covdt < LapseDt:
print('got one!')
l = l + 1
This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.
The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.
I've read through a question regarding finding the first row of a group, and am able to do so by using
wd.groupby(['EMPID','EBCN']).first()
but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?
Regarding my general method, I've read through the question here, which is very close to what I need:
pandas computation in each group
however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.
Is there a way to do what I'm attempting in Pandas/Python?

For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:
Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns
Once the data was in the correct format, I implemented the merge:
pandas.merge_asof(cov, term,
on='Date',
by='EMP|EBCN',
tolerance=pandas.Timedelta('5 days'))
Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.
Thanks so much to Boud for sending me down the correct path!

Related

How to use qcut in a dataframe with conditions value from columns

I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?

Run-time crashes when using CountVectorizer to create a term frequency dataframe from a column of lists

I am new to python and doing a data analysis project and would request some help.
I have a dataframe of 400,000 rows and it has columns: ID, Type 1 Categories, Type 2 Categories, Type 3 Categories, Amount, Age, Fraud.
The Category columns are column of lists. This list contains different terms which I want to take and create a matrix that counts and shows how many times a particular terms occurs in that row (with a column per term and frequency).
So the goal is to create a dataframe of sparse matrix, with each of those unique categories becoming a column - my dataset has over 2000 different categories - maybe thats why the count vectorizer is not good for this?
I tried two methods one using a count vectorizer and another using for loops
but Count Vectorizer crashes everytime it runs.
The second method is far too slow. Therefore, I was wondering if there anyway to improve these solutions.
I also split the dataframe into multiple chunks and it still causing problems
Example:
+------+--------------------------------------------+---------+---------+
| ID | Type 1 Category | Amount | Fraud |
+------+--------------------------------------------+---------+---------+
| ID1 | [Lex1, Lex2, Lex1, Lex4, Lex2, Lex1] | 110.0 | 0 |
| ID2 | [Lex3, Lex6, Lex3, Lex6, Lex3, Lex1, Lex2] | 12.5 | 1 |
| ID3 | [Lex7, Lex3, Lex2, Lex3, Lex3] | 99.1 | 0 |
+------+--------------------------------------------+---------+---------+
col = 'Type 1 Category'
# prior to this, I combined the entire dataframe based on ID
# this was from old dataframe where each row had different occurrence of id
# and only one category per row
terms = df_old[col].unique()
countvec = CountVectorizer(vocabulary=terms)
# create bag of words
df = df.join(pd.DataFrame(countvec.fit_transform(df[col]).toarray(),
columns=countvec.get_feature_names(),
index=df.index))
# drop original column of lists
df = df.drop(col, axis = 1)
##### second split dataframe to chunks using np.split
df_l3 = df_split[3]
output.index = df_l3.index
# Assign the columns.
output[['ID', '[col]']] = df_l3[['ID', '[col]']]
# split dataframe into chunks and 114305 is where the index starts
last = 114305+int(df_l3.shape[0])
for i in range(114305,last):
print(i)
for word in words:
output.ix[i,str(word)] = output[col][i].count(str(word))
Runs of memory for count vectorizer, second one does not count frequencies anymore. Worked for chunk 1 where index starts from zero but does not for the others.

Create a column to denote # of occurrence based on ID, and take difference of dates based on ID

I have a CSV dataset of roughly 250k rows that is stored as a dataframe using pandas.
Each row is a record of a client coming in. A client could come in multiple times, which would result in multiple records being created. Some clients have only ever come in once, other clients come in dozens of times.
The CSV dataset has many columns that I use for other purposes, but the ones that this specific problem uses include:
CLIENT_ID | DATE_ARRIVED
0001 1/01/2010
0002 1/02/2010
0001 2/01/2010
0001 2/22/2010
0002 4/01/2010
....
I am trying to create a new column that would assign a number denoting what # occurrence the row is based on the ID. Then if there is a # occurrence >1, have it take the difference of the dates in days from the prior occurrence.
Important note:
The dataset is not ordered, so the script has to be able to determine which is the first based on the earliest date. If the client came in multiple times in the day, it would look at which is the earliest time within the date.
I tried to create a set using the CLIENT_ID, then looping through each element in the set to get the count. This gives me the total count, but I can't figure out how to get it to create a new column with those incrementally increasing counts.
I haven't gotten far enough to the DATE_ARRIVED differences based on # occurrence.
Nothing viable, hoping to get some ideas! If there is an easier way to determine differences between two dates next to each other for a client, I'm also open to ideas! I have a way of doing this manually through Excel, which involves:
ordering the dataset by ID and date,
checking each to see if the ID before was equal (and if it is, increment by 1)
creating a new column that takes the difference of the above only if the previous number was >1
... but I have no idea how to do this in Python.
The output should look something like:
CLIENT_ID | DATE_ARRIVED | OCCURRENCE | DAYS_SINCE_LAST
0001 1/01/2019 1 N/A
0002 1/02/2019 1 N/A
0001 2/01/2019 2 31
0001 2/22/2010 3 21
0002 4/01/2010 2 90
Using groupby with transform count + diff
df['OCCURRENCE']=df.groupby('CLIENT_ID').CLIENT_ID.transform('count')
df['DAYS_SINCE_LAST']=df.groupby('CLIENT_ID')['DATE_ARRIVED'].diff().dt.days
df
Out[45]:
CLIENT_ID DATE_ARRIVED OCCURRENCE DAYS_SINCE_LAST
0 1 2010-01-01 2 NaN
1 2 2010-01-02 1 NaN
2 1 2010-02-01 2 31.0

Pandas performance: loc vs using .head(x) in first x rows using Sorted Data

Suppose we have a dataframe of 100.000 rows and 3 columns, structured like so:
| visitorId | timestamp | url |
1 | 1 | 11 | A |
2 | 1 | 12 | B |
3 | 2 | 21 | A |
4 | 3 | 31 | A |
5 | 3 | 32 | C |
.
.
n | z | Z1 | A |
This dataframe is always sorted, and stored in a variable called sortedData. First, we extract all the unique visitor IDs into a list called visitors, and create a path variable to hold all the URLs this visitor has visited.
visitors = sortedData.visitorId.unique()
visitors = visitors.tolist()
paths = []
visitor_length = len(visitors)
Now what I did was go into a loop for each visitor in order to find and store each visitor's traversed path, their timestamp and ID, and use it as an input in a secondary algorithm(not relevant).
I went two ways to do this:
A:
for visitor in visitors:
dataResult = sortedData.loc[sortedData['visitorId'] == visitor]
timestamps = dataResult.timestamp.tolist()
path = dataResult.pageUrl.tolist()
paths.append((visitor, timestamps, path))
sortedData = sortedData.iloc[len(dataResult):]
This uses the built-in pandas loc function to find the rows with the matching visitorId and extract the timestamps and paths into lists, and finally append them together. Then it goes on to delete the first x rows (equal to the length of the query result, aka the number of matches) in order to not traverse them in the future when doing similar matches.
The process is repeated for every unique visitor in the visitors list.
By timing this, I found that it took about 6.31 seconds to complete. In my case this was an 11.7MB file, 100.000 rows. On a 1.2GB file this scaled up to 14 hours. As such, I tried way B hoping for a speedup.
Way B uses the logic that the data is always sorted, as such visitor 1 in visitors will always be the 1st visitor in sortedData, visitor 2 the 2nd etc. So I could use the pandas value_counts() function to count the occurrences x of the current visitor, and simply extract the data from the first x rows using head(x), since they will always match. This way it would not have to iterate and search through the whole dataframe every time. Then as before, I remove those rows from the dataframe and repeat the loop for the next visitor.
B:
for visitor in visitors:
x = sortedData.visitorId.value_counts()[visitor]
timestamps = sortedData.timestamp.head(x).tolist()
path = sortedData.pageUrl.head(x).tolist()
paths.append((visitor, timestamps, path))
sortedData = sortedData.iloc[x:]
To my surprise, this made it almost twice as slow, to a 10.89 seconds runtime compared to the 6.31 from A.
Timing the loc and value_counts() it appeared that the latter was faster, however when used in the loop the opposite is true.
Considering in B we know the positions of the visitors and we only have to iterate through the first x rows of the dataframe, and in A we have to search the whole dataframe every time, what causes this difference in performance?
In a previous optimization I did, with deleting the already traversed rows, the speedup was considerable, doubling each time the dataframe size is halved in contrast to leaving it whole. This made me suspect that it iterates through the whole dataframe every time in way A, except if I am missing something?
I am using a MacBook Pro 2014, Python 3.6.4(Anaconda) running on PyCharm 2018.
Creating your own list of visitors, iterating over those and searching the data frame again and again is not ideal.
Have a look at groupby, which can be used in your situation, if I understand your question correctly.
To have code similar to what you have now, you can start like this:
grouped = sortedData.groupby('visitorId')
for visitorId, group in grouped:
print(vistorId)
# your code here
custom_url_algorithm(group)

How to avoid Pandas Groupby key error when a GroupBy object might not contain a certain key

I am doing some analysis over a dataframe, with one of the columns being an integer with values either 0 or 1 (Sort of boolean, but in integer form). It looks something like this:
Nat. | Result
-------|-------
CA | 1
USA | 0
GB | 1
USA | 1
CA | 0
GB | 1
I grouped the data according to the nationality column, and one of the values (GB in the example above) produced -by chance- a group whose all members were only 1. This have created a problem because I have a function that I call a lot that contains group_obj.get_group(0) and this causes a runtime error "KeyError: 0"
My question: I want to create the logic that follows:
if (group_obj contains key 0):
return group_obj.get_group(0)
else:
print "Group Object contains no 0s"
return null
Thanks
I am using Python2, Pandas and iPython Notebook.
OK, so here is how I was able to do it:
if key1 in group_obj.groups.keys():
#Do processing
so, the keys() method in a groupby object stores already the available keys and it can be accessed directly.
Use value_counts, unstack the result to get the results in columns and then use fillna(0) to replace all NaNs.
>>> df.groupby('Nationality').Result.value_counts().unstack().fillna(0)
Result 0 1
Nationality
CA 1 1
GB 0 2
USA 1 1
To get a group from a groupby object and get an empty data frame instead of an error if the group doesn't exist is this:
def get_group(key, dataframe_group):
if key in dataframe_group.groups.keys():
return dataframe_group.get_group(key)
else:
original_df = dataframe_group.obj
return original_df.drop(original_df.index)
Basically, checking first if the keys exists in the group and they if it doesn't get the original data frame but without any contents.

Categories