I have two dataframes that I want to merge. One contains data on "assessments" done on particular dates for particular clients. The second contains data on different categories of "services" performed for clients on particular dates. See sample code below:
assessments = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access']})
ClientID AssessmentDate Program
212 2018-01-04 Case Mgmt
212 2018-07-03 Case Mgmt
212 2019-06-10 Case Mgmt
292 2017-08-08 Coordinated Access
292 2017-12-21 Coordinated Access
services = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'ServiceDate' : ['2018-01-02', '2018-04-08', '2018-05-23', '2017-09-08', '2017-12-03'],
'Assistance Navigation' : ['0','1','1','0','1'],
'Basic Needs' : ['1','0','0','1','2']})
ClientID ServiceDate Assistance Navigation Basic Needs
212 2018-01-02 0 1
212 2018-04-08 1 0
212 2018-05-23 1 0
292 2017-09-08 0 1
292 2017-12-03 1 2
I want to know how many services of each service type (Assistance Navigation and Basic Needs) occur between consecutive assessments of the same program. In other words, I want to append two columns to the assessments dataframe named 'Assistance Navigation' and 'Basic Needs' that tell me how many Assistance Navigation services and how many Basic Needs services have occurred since the last assessment of the same program. The resulting dataframe would look like this:
assessmentsFinal = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access'],
'Assistance Navigation' : ['0','2','0','0','1'],
'Basic Needs' : ['0','0','0','0','3']})
ClientID AssessmentDate Program Assistance Navigation Basic Needs
212 2018-01-04 Case Mgmt 0 0
212 2018-07-03 Case Mgmt 2 0
212 2019-06-10 Case Mgmt 0 0
292 2017-08-08 Coordinated Access 0 0
292 2017-12-21 Coordinated Access 1 3
Of course, the real data has many more service categories than just 'Assistance Navigation' and 'Basic Needs' and the number of services and assessments is huge. My current attempt uses loops (which I know is a Pandas sin) and takes a couple of minutes to run, which may pose problems when our dataset gets even larger. Below is the current code for reference. Basically we loop through the assessments dataframe to get the ClientID and the date range and then we go into the services sheet and tally up the service type occurrences. There's got to be a quick and easy way to do this in Pandas but I'm new to the game. Thanks in advance.
servicesDict = {}
prevClient = -1
prevDate = ""
prevProg = ""
categories = ["ClientID","ServiceDate","Family Needs","Housing Navigation","Housing Assistance","Basic Needs","Professional","Education","Financial Aid","Healthcare","Counseling","Contact","Assistance Navigation","Referral","Misc"]
for index, row in assessmentDF.iterrows():
curClient = row[0]
curDate = datetime.strptime(row[1], '%m/%d/%y')
curProg = row[7]
curKey = (curClient, curDate)
if curKey not in servicesDict:
services = [curClient, curDate, 0,0,0,0,0,0,0,0,0,0,0,0,0]
servicesDict.update({curKey : services})
services = servicesDict[curKey]
#if curDate and prevDate equal each other action required
if curClient == prevClient and curProg == prevProg:
boundary = serviceDF[serviceDF['ClientID'] == curClient].index
for x in boundary:
curRowSer = serviceDF.iloc[x]
curDateSer = datetime.strptime(curRowSer[1], '%m/%d/%y')
if curDateSer>=prevDate and curDateSer<curDate:
serviceCategory = curRowSer[5]
i = categories.index(serviceCategory)
services[i] = services[i] + 1
servicesDict.update({curKey : services})
prevClient = curClient
prevDate = curDate
prevProg = curProg
servicesCleaned = pd.DataFrame.from_dict(servicesDict, orient = 'index',columns=categories)
#then merge into assessments on clientID and AssessmentDate
One way would be like this. You'll probably have to tweak it for your original dataset, and check the edge cases.
assessments['PreviousAssessmentDate'] = assessments.groupby(['ClientID', 'Program']).AssessmentDate.shift(1, fill_value='0000-00-00')
df = assessments.merge(services, on='ClientID', how='left')
df[df.columns[5:]] = df[df.columns[5:]].multiply((df.AssessmentDate > df.ServiceDate) & (df.PreviousAssessmentDate < df.ServiceDate), axis=0)
df = df.groupby(['ClientID', 'AssessmentDate', 'Program']).sum().reset_index()
ClientID AssessmentDate Program Assistance Navigation Basic Needs
0 212 2018-01-04 Case Mgmt 0 1
1 212 2018-07-03 Case Mgmt 2 0
2 212 2019-06-10 Case Mgmt 0 0
3 292 2017-08-08 Coordinated Access 0 0
4 292 2017-12-21 Coordinated Access 1 3
Logic
We shift the AssessmentDate by 1 in order to determine the
previous assessment date
We merge the two dataframes on ClientID
We set all service type columns to 0 incase the ServiceDate doesn't fall between PreviousAssessmentDate and the AssessmentDate.
We groupby ClientID, Program and AssessmentDate and do a sum()
Assumptions
Service type categories are integers
Your data frame is sorted on AssessmentDate (for the shift)
Related
I have an issue where I need to track the progression of patients insurance claim statuses based on the dates of those statuses. I also need to create a count of status based on certain conditions.
DF:
ClaimID
New
Accepted
Denied
Pending
Expired
Group
001
2021-01-01T09:58:35:335Z
2021-01-01T10:05:43:000Z
A
002
2021-01-01T06:30:30:000Z
2021-03-01T04:11:45:000Z
2021-03-01T04:11:53:000Z
A
003
2021-02-14T14:23:54:154Z
2021-02-15T11:11:56:000Z
2021-02-15T11:15:00:000Z
A
004
2021-02-14T15:36:05:335Z
2021-02-14T17:15:30:000Z
A
005
2021-02-14T15:56:59:009Z
2021-03-01T10:05:43:000Z
A
In the above dataset, we have 6 columns. ClaimID is simple and just indicates the ID of the claim. New, Accepted, Denied, Pending, and Expired indicate the status of the claim and the day/time those statuses were set.
What I need to do is get a count of how many claims are New on each day and how many move out of new into a new status. For example, There are 2 new claims on 2021-01-01. On that same day 1 moved to Accepted about 7 minutes later. Thus on 2021-01-01 the table of counts would read:
DF_Count:
Date
New
Accepted
Denied
Pending
Expired
2021-01-01
2
1
0
0
0
2021-01-02
1
0
0
0
0
2021-01-03
1
0
0
0
0
2021-01-04
1
0
0
0
0
2021-01-05
1
0
0
0
0
....
....
....
....
....
....
2021-02-14
4
2
0
0
0
2021-02-15
2
3
0
0
1
2021-02-16
2
2
0
0
0
Few Conditions:
If a claim moves from one status to the other on the same day (even if they are a minutes/hours apart) it would not be subtracted from the original status until the next day. This can be seen on 2021-01-01 where claim 001 moves from new to accepted on the same day but the claim is not subtracted from new until 2021-01-02.
Until something happens to a claim, it should remain in its original status. Claim 002 will remain in new until 2021-03-01 when it is approved.
If a claim changes status on a later date than its original status, it will be subtracted on that later date. For this, see status 003. It is new on 2/14 but accepted on 2/15. This is why New goes down by 2 on 2/15 (the other claim is the is 004 which is new and accepted on the same day)
For certain statuses, I do not need to look at all columns. For example, For new I only look at the dates inside Accepted and Denied. Not Pending and Expired. When I do these same steps for approved, I no longer need to look at new, just the other columns. How would I do that?
In the final DF_count table, the dates should start from the earliest date in 'New' and end on todays date.
The code needs to be grouped by the Group Column as well. For example, patients in group B (not pictured) will have to have the same start and end date but for their own claims.
I need to do this separately for all of the statuses. Not just new.
Current Solution:
My current solution has been to create an dataset with just dates from the min New Date to todays date. Then for each column, what I do is use the .loc method to find dates that are greater than New in each of the other columns. For example, in the code below I look for all cases where new is equal to approved.
df1 = df.loc[(df['New'] == df['Approved']) &
((df['Expired'].isnull()) | (df['Expired'] >= df['Accepted'])) &
((df['Pending'].isnull()) | (df['Pending'] >= df['Accepted'])) &
((df['Denied'].isnull()) | (df['Denied'] >= df['Accepted']))]
newtoaccsday = df1.loc[:, ('Group', 'Accepted')]
newtoappsday['Date'] = newtoappsday['Accepted']
newtoappsday = newtoappsday.reset_index(drop = True)
newtoappsday= newtoappsday.groupby(['Date', 'Group'], as_index = False)['Approved'].value_counts()
newtoappsday.drop(columns = {'Accepted'}, inplace = True)
newtoappsday.rename(columns = {'count': 'NewAppSDay'}, inplace = True)
newtoappsday['Date'] = newtoappsday['Date'] + timedelta(1)
df_count= df_count.merge(newtoappsday, how = 'left', on = ['Date', 'Group']).fillna(0)
--After doing the above steps for all conditions (where new goes to accepted on a later date etc.) I will do the final calculation for new:
df_count['New'] = df_count.eval('New = New - (NewAccSDay + NewAccLater + NewDenSDay + NewDenLater + NewExpLater + NewPendSDay + NewPendLater)').groupby(['Tier2_ID', 'ClaimType'])['New'].cumsum()
Any and all help would be greatly appreciated. My method above is extremely inefficient and leading to some errors. Do I need to write a for loop for this? What is the best way to go about this.
First convert the date columns with something like
for i in ['New', 'Accepted', 'Denied', 'Pending', 'Expired']:
df[i] = pd.to_datetime(df[i], format="%Y-%m-%dT%H:%M:%S:%f%z")
Then develop the date range applicable based on your column conditions. In this logic if Denied is there the range is new --> denied, or if accepted new --> accepted or if no acceptance new --> today with code like (alter as per rules):
df['new_range'] = df[['New','Accepted','Denied']].apply (lambda x: pd.date_range(x['New'],x['Denied']).date.tolist() if
pd.notnull(x['Denied']) else
pd.date_range(x['New'],x['Accepted']).date.tolist() if
pd.notnull(x['Accepted']) else
pd.date_range(x['New'],datetime.today()).date.tolist()
,axis=1)
You should be able filter on a group and see date ranges in your df like:
df[df['Group']=='A']['new_range']
0 [2021-01-01]
1 [2021-01-01, 2021-01-02, 2021-01-03, 2021-01-0...
2 [2021-02-14]
3 [2021-02-14]
4 [2021-02-14, 2021-02-15, 2021-02-16, 2021-02-1..
Then you can explode the date ranges and group on counts to get the new counts for each day with code like:
new = pd.to_datetime(df[df['Group']=='A']['new_range'].explode('Date')).reset_index()
newc = new.groupby('new_range').count()
newc
new_range
2021-01-01 2
2021-01-02 1
2021-01-03 1
2021-01-04 1
2021-01-05 1
2021-01-06 1...
Similarly get counts for accepted, denied and then left joined on date to arrive at final table, fill na to 0.
By creating your rules to expand your date range, then explode over date range and groupby to get your counts you should be able to avoid much of the expensive operation.
I think this is what you want or can be easily modified to your necesity:
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import date
def dateRange(d1,d2):
return [d1 + timedelta(days=x) for x in range((d2-d1).days)]
def addCount(dic,group,dat,cat):
if group not in dic:
dic[group]={}
if dat not in dic[group]:
dic[group][dat]={}
if cat not in dic[group][dat]:
dic[group][dat][cat]=0
dic[group][dat][cat]+=1
df =pd.read_csv("testdf.csv",
parse_dates=["New","Accepted","Denied","Pending", "Expired"])#,
cdic={}
for i,row in df.iterrows():
cid=row["ClaimID"]
dnew=row["New"].date()
dacc=row["Accepted"].date()
dden=row["Denied"].date()
dpen=row["Pending"].date()
dexp=row["Expired"].date()
group=row["Group"]
if not pd.isna(dacc): #Claim has been accepted
if(dnew == dacc):
dacc+=timedelta(days=1)
nend=dacc
addCount(cdic,group,dacc,"acc")
if not pd.isna(dden): # Claim has been denied
if(dnew == dden):
dden+=timedelta(days=1)
if pd.isna(dacc):
nend=dden
addCount(cdic,group,dden,"den")
if not pd.isna(dpen):
addCount(cdic,group,dpen,"pen") # Claim is pending
if not pd.isna(dexp):
addCount(cdic,group,dexp,"exp") # Claim is expired
if pd.isna(dacc) and pd.isna(dden):
nend=date.today()+timedelta(days+1)
for d in dateRange(dnew,nend): # Fill new status until first change
addCount(cdic,group,d,"new")
ndfl=[]
for group in cdic:
for dat in sorted(cdic[group].keys()):
r=cdic[group][dat]
ndfl.append([group,dat,r.get("new",0),r.get("acc",0),
r.get("den",0),r.get("pen",0),r.get("exp",0)])
ndf=pd.DataFrame(ndfl,columns=["Group", "Date","New","Accepted","Denied","Pending","Expired"])
I have a code that conducts a search of closest value between 2 CSV files. It reads a CSV file called "common_list" with some database which looks like this:
common_name
common_Price
common_Offnet
common_Traffic
name1
1300
250
13000
name2
1800
350
18000
The code puts these CSV rows into a list and then creates NumPy arrays.
common_list = pd.read_csv("common_list.csv")
common_list_offnet = common_list["common_Offnet"].to_list()
common_list_traffic = common_list["common_Traffic"].to_list()
array_offnet = np.array(common_list_offnet)
array_traffic = np.array(common_list_traffic)
array = np.column_stack((array_offnet,array_traffic))
We use this CSV file as a database for available cell phone plans (name of the plan, price, offnet calls, and internet traffic).
Then, the code reads another CSV file called "By_ARPU" with 100k+ rows with users and how they use their cell phone plans (how much money they spend (the price of plan), how much offnet calls, and traffic). The headers of this CSV file look like this:
User ID
ARPU_AVERAGE
Offnet Calls
Traffic (MB)
where ARPU_AVERAGE corresponds to the amount of money users spend (the price they pay). The code finds the closest value between the CSV files by 2 parameters: Offnet calls and Traffic (MB).
csv_data = pd.read_csv("By_ARPU.csv")
data = csv_data[['Offnet Calls', 'Traffic (MB)']]
data = data.to_numpy()
sol = []
for target in data:
dist = np.sqrt((np.square(array[:,np.newaxis]-target).sum(axis=2))
idx = np.argmin(dist)
sol.append(idx)
csv_data["Suggested Plan [SP]"] = common_list['common_name'][sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'][sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'][sol].values
csv_data.to_csv ('4.7 final.csv', index = False, header=True)
It finds closest value from the database and shows the name and corresponding offnet calls, traffic. For example, if in the file "By_ARPU" the values for Offnet calls and Traffic (MB) were 250 and 13000 respectively, it will show the name of closest value from "common_list" which is name1.
I wanted to create additional code for the same search but with 3 parameters. You can see that first database "common_list" has 3 parameters: common_Price, common_Offnet and common_Offnet. In the previous code, we found the closest value by 2 values.
Corresponding to each other columns in different CSV files were: "common_Offnet" from common_list - "Offnet calls" from By_ARPU AND "common_Traffic" from common_list - "Traffic (MB)" from By_ARPU.
And I want:
Find closest value by 3 parameters: Price, Offnet Calls and Traffic. The column which corresponds to the price in "By_ARPU" file is called "AVERAGE_APRU".
Please, help to modify the code to find closest value by making search by those 3 paramters instead of 2.
Input data:
>>> plans
common_name common_Price common_Offnet common_Traffic
0 plan1 1300 250 13000
1 plan2 1800 350 18000
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB)
0 Louis 1300 250 13000 # plan1 for sure (check 1)
1 Paul 1800 350 18000 # plan2, same values (check 2)
2 Alex 1500 260 14000 # plan1, probably
Create your matching function:
def rmse(user, plans):
u = user[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values.astype(float)
p = plans[['common_Price', 'common_Offnet', 'common_Traffic']].values
plan = np.sqrt(np.square(np.subtract(p, u)).mean(axis=1)).argmin()
return plans.iloc[plan]['common_name']
df['Best Plan'] = df.apply(rmse, axis="columns", plans=plans)
Output:
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB) Best Plan
0 Louis 1300 250 13000 name1
1 Paul 1800 350 18000 name2
2 Alex 1500 260 14000 name1
Edit: Full code with you variable names:
common_list = pd.read_csv("common_list.csv")
csv_data = pd.read_csv("By_ARPU.csv")
find_the_best_plan = lambda target: np.sqrt(np.square(np.subtract(array, target)).mean(axis=1)).argmin()
array = common_list[['common_Price', 'common_Offnet', 'common_Traffic']].values
data = csv_data[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values
sol = np.apply_along_axis(find_the_best_plan, 1, data)
csv_data["Suggested Plan [SP]"] = common_list['common_name'].iloc[sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'].iloc[sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'].iloc[sol].values
Imagine 2 Dataframes, where MasterDB (df1) is a Central Database and LatestResults (df2) is an incoming CSV I retrieve weekly.
df1 = pd.read_sv("Master.csv")
#with columns:
Index(['Timestamp', 'ID', 'Status', 'Username', 'URL', 'Last Seen'],
dtype='object')
Shape(1000,6)
df2 = pd.read_sv("LatestResults.csv")
#with same columns:
Index(['Timestamp', 'ID', 'Status', 'Username', 'URL', 'Last Seen'],
dtype='object')
Shape(300,6)
Based on certain conditions, I would like to iterate through my (new) items in df2 and treat items based on the following 3 conditions (assuming there's only 3 for now):
a) if new item does not pre-exist (for non-matching "URL")
append to MasterDB (df1)
set 'Status' = "Net New" (in df1)
set 'Last Seen' = today.date (in df1)
b) if new item pre-exists, edit its status to "Still Active" (for matching "URL", regardless of "ID")
set 'Status' to "Still Active"
set 'Last Seen' = today.date
c) if item in MasterDB (df1) is not found in LatestResults (only for matching "ID" AND "URL")
set 'Status to 'Expired'
Here's the code I have so far:
def function(input1, input2):
df1 = pd.read_csv(input1)
df2 = pd.read_csv(input2)
df1['Status'] = df1['Status'].astype(str)
df1['Last Seen'] = df1['Last Seen'].astype(str)
for i, row in df1.iterrows():
tmp = df2.loc[(df2['URL'] == row['URL'])
# A) STILL ACTIVE
if not tmp.empty:
df1.at[i, 'Last Seen'] = str(datetime.now().replace(second=0, microsecond=0))
df1.at[i, 'Status'] = "STILL ACTIVE"
for i, row in df1.iterrows():
tmp = df2.loc[(df2['URL'] == row['URL']) & (df2['ID'] == row['ID'])]
# B) NOT FOUND // EXPIRED
if tmp.empty:
df1.at[i, 'Status'] = "EXPIRED"
for i, row in df2.iterrows():
tmp = df1.loc[(df1['URL'] == row['URL']) & (df1['ID'] == row['ID'])]
# C) NET NEW
if tmp.empty:
df1.at[i, 'Last Seen'] = str(datetime.now().replace(second=0, microsecond=0))
row["Status"] = "NET NEW"
df1 = df1.append(row, ignore_index=True)
df1['Last Seen'].fillna("Not Set", inplace=True)
df1.to_csv("MasterDB.csv", index=False)
This solution works for the most part. New items are appended and (some) pre-existing items are labeled correctly. The problems I'm having are:
When matching still active records, if the item is untouched (most cases), its value is overwritten to NaN. How can I fix this?
For my third case "Expired", I want to be severe with the and conditions and only set to expired if both conditions are met. It does not seem to work. Is this the right way to think of it?
I've explored using pandas merge, join and map but persevered using the methods here above. But after a few days on this, I realize it's bulky. I'd love to hear ideas about how to design this differently. Many thanks in advance.
I'm going to share with you how I would implement a solution for your problem. Perhaps a different point of view can help you see what's wrong in your code.
First, the data that I used. This is my LatestResults.csv
ID Status Username URL Last Seen
0 1 new Harry Potter google 2020-10-20 12:00:00
1 20 new Hermione Granger yahoo 2020-10-20 12:00:00
2 3 new Ron Weasley twitter 2020-10-20 12:00:00
3 4 new Hagrid instagram 2020-10-20 12:00:00
And this is my MasterDB.csv
ID Status Username URL Last Seen
0 1 blabla Harry Potter google 2020-10-20 12:00:00
1 2 blabla Hermione Granger yahoo 2020-10-20 12:00:00
2 3 blabla Ron Weasley bing 2020-10-20 12:00:00
3 4 blabla McGonagall facebook 2020-10-20 12:00:00
4 5 blabla Dumbledore google 2020-10-20 12:00:00
5 6 blabla You-Know-Who badoo 2020-10-20 12:00:00
I tried to include all possible combinations:
Harry Potter has the same ID and URL in both DataFrames
Hermione Granger has the same URL, but different ID
Ron Weasley has the same ID, but different URL
McGonagall has a unique URL, but the same ID as Hagrid
Dumbledore has a unique ID, but the same URL as Harry Potter
You-Know-Who has uniques ID and URL
Hagrid has a unique unique URL, but it's not in you database yet
I know the username is not used anywhere in the code, but these names help us to understand what's going on.
Now the code.
def new_function(db_file, input_file):
today = datetime.now().replace(second=0, microsecond=0)
df_input = pd.read_csv(input_file)
df_database = pd.read_csv(db_file)
df_database['Last Seen'].astype(str)
# Filter for input data
net_new_filter = ~ df_input['URL'].isin(df_database['URL'])
# Filter for DB data
still_active = df_database['URL'].isin(df_input['URL'])
expired_filter = (pd.merge(df_database, df_input, on=['ID', 'URL'],
how='left', indicator=True)['_merge'] != 'both')
expired_filter = expired_filter & (~ still_active)
# Applying filters
df_database.loc[still_active, 'Status'] = 'Still Active'
df_database.loc[still_active, 'Last Seen'] = str(today)
df_database.loc[expired_filter, 'Status'] = 'Expired'
new_data = df_input[net_new_filter].copy()
new_data['Status'] = 'Net New'
new_data['Last Seen'] = str(today)
final_database = pd.concat([df_database, new_data],
ignore_index=True)
final_database.to_csv("Test_Data/Output.csv", index=False)
I enjoy writing DataFrame filters separately since it makes clearer what conditions you are applying.
You may use the method .isin() to verify if the elements of a column are located somewhere in another column. It directly detects the "Still Active" and the "Net New" conditions.
Then, if you want to check if two columns of the same row are equal two other columns simultaneously, you may use the .merge() method, such as described here. However, we must not overwrite the "Still Active" condition, therefore the additional expression expired_filter = expired_filter & (~ still_active).
Once you run the code, the following DataFrame will be saved:
ID Status Username URL Last Seen
0 1 Still Active Harry Potter google 2020-10-23 20:18:00
1 2 Still Active Hermione Granger yahoo 2020-10-23 20:18:00
2 3 Expired Ron Weasley bing 2020-10-20 12:00:00
3 4 Expired McGonagall facebook 2020-10-20 12:00:00
4 5 Still Active Dumbledore google 2020-10-23 20:18:00
5 6 Expired You-Know-Who badoo 2020-10-20 12:00:00
6 3 Net New Ron Weasley twitter 2020-10-23 20:18:00
7 4 Net New Hagrid instagram 2020-10-23 20:18:00
The rows with new URL are at the end, while those from the original database which didn't have an exact match over ID and URL were set to "Expired". Finally, the rows with corresponding URL were kept, and their Status was set to "Still Active".
I hope this example helps you out. At least it's a little bit faster because it doesn't rely on for loops.
If the data wasn't helpful, please provide a sample of your own data, so we can test using the same dataset.
I have a dataframe from a csv which contains userId, ISBN and ratings for a bunch of books. I want to find a subset of this dataframe in which both userIds occur more than 200 times and ISBNs occur more than 100 times.
Following is what I tried:
ratings = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userId', 'ISBN', 'bookRating']
# Choose users with more than 200 ratings and books with more than 100 ratings
user_rating_count = ratings['userId'].value_counts()
relevant_ratings = ratings[ratings['userId'].isin(user_rating_count[user_rating_count >= 200].index)]
print(relevant_ratings.head())
print(relevant_ratings.shape)
books_rating_count = relevant_ratings['ISBN'].value_counts()
relevant_ratings_book = relevant_ratings[relevant_ratings['ISBN'].isin(
books_rating_count[books_rating_count >= 100].index)]
print(relevant_ratings_book.head())
print(relevant_ratings_book.shape)
# Check that userId occurs more than 200 times
users_grouped = pd.DataFrame(relevant_ratings.groupby('userId')['bookRating'].count()).reset_index()
users_grouped.columns = ['userId', 'ratingCount']
sorted_users = users_grouped.sort_values('ratingCount')
print(sorted_users.head())
# Check that ISBN occurs more than 100 times
books_grouped = pd.DataFrame(relevant_ratings.groupby('ISBN')['bookRating'].count()).reset_index()
books_grouped.columns = ['ISBN', 'ratingCount']
sorted_books = books_grouped.sort_values('ratingCount')
print(sorted_books.head())
Following is the output I got:
userId ISBN bookRating
1456 277427 002542730X 10
1457 277427 0026217457 0
1458 277427 003008685X 8
1459 277427 0030615321 0
1460 277427 0060002050 0
(527556, 3)
userId ISBN bookRating
1469 277427 0060930535 0
1471 277427 0060934417 0
1474 277427 0061009059 9
1495 277427 0142001740 0
1513 277427 0312966091 0
(13793, 3)
userId ratingCount
73 26883 200
298 99955 200
826 252827 200
107 36554 200
240 83671 200
ISBN ratingCount
0 0330299891 1
132873 074939918X 1
132874 0749399201 1
132875 074939921X 1
132877 0749399295 1
As seen above when sorting the table in ascending order grouped by userId, it shows userIds only more than 200 times.
But when sorting the table in ascending order grouped by ISBN, it shows ISBNs which occurs even 1 time.
I expected both userIds and ISBNs to occur more than 200 and 100 times respectively.
Please let me know what I have done wrong and how to get the correct result.
You should try and produce a small version of the problem that can be solved without access to large csv files. Check this page for more details: https://stackoverflow.com/help/how-to-ask
That said, here is a dummy version of your dataset:
import pandas as pd
import random
import string
n=1000
isbn = [random.choice(['abc','def','ghi','jkl','mno']) for x in range(n)]
rating = [random.choice(range(9)) for x in range(n)]
userId = [random.choice(['x','y','z']) for x in range(n)]
df = pd.DataFrame({'isbn':isbn,'rating':rating,'userId':userId})
You can get the counts by userId and isbns this way:
df_userId_count = df.groupby('userId',as_index=False)['rating'].count()
df_isbn_count = df.groupby('isbn',as_index=False)['rating'].count()
and extract the unique values by:
userId_select = (df_userId_count[df_userId_count.rating>200].userId.values)
isbn_select = (df_isbn_count[df_isbn_count.rating>100].isbn.values)
So that your final filtered dataframe is:
df = df[df.userId.isin(userId_select) & df.isbn.isin(isbn_select) ]
I have a DataFrame df which contain three columns: ['mid','2014_amt','2015_amt']
I want to extract rows of a particular merchant. For example, consider my data is:
df['mid'] = ['as','fsd','qww','fd']
df['2014_amt] = [144,232,45,121]
df['2015_amt] = [676,455,455,335]
I want to extract the whole rows corresponding to mid = ['fsd','qww'] How is this best done? I tried with the below code:
df.query('mid== "fsd"')
If I want to run a loop, how can I use the above code to extract rows for specified values of mid?
for val in mid:
print df.query('mid' == "val"')
This is giving an error, as val is not specified.
Option 1
df.query('mid in ["fsd", "qww"]')
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455
Option 2
df[df['mid'].isin(['fsd', 'qww'])]
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455