Merge two dataframes based on condition - python

I have these two dataframes:
sp_client
ConnectionID Value
0 CN01493292 495
1 CN01492424 440
2 CN01491959 403
3 CN01493200 312
4 CN01493278 282
.. ... ...
110 CN01492864 1
111 CN01492513 1
112 CN01492899 1
113 CN01493010 1
114 CN01493032 1
[115 rows x 2 columns]
sp_server
ConnectionID Value
1 CN01491920 2
1 CN01491920 2
3 CN01491922 2
3 CN01491922 2
5 CN01491928 2
.. ... ...
595 CN01493166 3
595 CN01493166 3
595 CN01493166 3
597 CN01493163 2
597 CN01493163 2
[673 rows x 2 columns]
I would like to merge them in a way where sp_client['Value'] increments by addition of sp_sever['Value'] and sp_client['Value'] only when the rows satisfy the condition sp_sever['ConnectionID']==sp_client['ConnectionID'].
It was a little bit complicated for me but I tried the following, but I am missing the condition part. Maybe it does not need to be merged in the first place. Happy to hear suggestions.

as per my comment, try to append tables and group them by ID while summing Value column as per example:
all_data = pd.concat([sp_server,sp_client])
all_data = all_data.groupby('ConnectionID')['Value'].agg(sum).reset_index()
out:
ConnectionID Value
0 CN01491920 4
1 CN01491922 4
2 CN01491928 2
3 CN01491959 403
4 CN01492424 440
5 CN01493200 312

Related

Compare data series columns based on a column that contains duplicates

I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.
Requirements:
I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.
Here's my DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 84 170 265
1 OFFICE 1 2222278 26 103 287
2 OFFICE 1 2222278 97 167 288
3 OFFICE 2 2222289 7 167 288
4 OFFICE 2 2222289 3 130 295
5 OFFICE 2 2222289 9 195 257
6 OFFICE 3 1111111 1 2 286
7 OFFICE 3 1111111 5 2 287
8 OFFICE 3 1111112 9 7 230
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
Here's what I want to show in the final DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I was thinking of solving this by using df.itterows()
Here's what I've tried:
for n, v in df.iterrows():
if int(v['Sales']) > 10:
calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
if int(calls['Calls']) > 10:
target = df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
else:
print("No match found")
else:
sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]
However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?
This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.
If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.
A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.
df = pd.read_csv('./data.txt')
def eval_group(df, keys) :
for key in keys :
row_id = df[key].idxmax()
if df.loc[row_id][key] >= 10 or key == keys[-1] :
return row_id
row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
condition = df['NAME'] == name
row_ids.append( eval_group( df[condition], keys) )
df = df[ df.index.isin(row_ids) ]
df
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:
temp = (df
.drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
.groupby('NAME', sort = False)
.idxmax()
)
# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)
# groupby for each NAME
bools = bools.groupby(df.NAME, sort = False).all()
# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)
# get matching rows
df.loc[indices]
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298

Pandas cumsum + cumcount on multiple columns

Aloha,
I have the following DataFrame
stores = [1,2,3,4,5]
weeks = [1,1,1,1,1]
df = pd.DataFrame({'Stores' : stores,
'Weeks' : weeks})
df = pd.concat([df]*53)
df['Weeks'] = df['Weeks'].add(df.groupby('Stores').cumcount())
df['Target'] = np.random.randint(400,600,size=len(df))
df['Actual'] = np.random.randint(350,800,size=len(df))
df['Variance %'] = (df['Target'] - df['Actual']) / df['Target']
df.loc[df['Variance %'] >= 0.01, 'Status'] = 'underTarget'
df.loc[df['Variance %'] <= 0.01, 'Status'] = 'overTarget'
df['Status'] = df['Status'].fillna('atTarget')
df.sort_values(['Stores','Weeks'],inplace=True)
this gives me the following
print(df.head())
Stores Weeks Target Actual Variance % Status
0 1 1 430 605 -0.406977 overTarget
0 1 2 549 701 -0.276867 overTarget
0 1 3 471 509 -0.080679 overTarget
0 1 4 549 378 0.311475 underTarget
0 1 5 569 708 -0.244288 overTarget
0 1 6 574 650 -0.132404 overTarget
0 1 7 466 623 -0.336910 overTarget
now what I'm trying to do is do a cumulative count of Stores where they were either over or undertarget but reset when the status changes.
I thought this would be the best way to do this (and many variants of this) but this does not reset the counter.
s = df.groupby(['Stores','Weeks','Status'])['Status'].shift().ne(df['Status'])
df['Count'] = s.groupby(df['Stores']).cumsum()
my logic was to group by my relevant columns, and do a != shift to reset the cumsum
Naturally I've scoured lots of different questions but I can't seem to figure this out. Would anyone be so kind to explain to me what would be the best method to tackle this problem?
I hope everything here is clear and reproducible. Please let me know if you need any additional information.
Expected Output
Stores Weeks Target Actual Variance % Status Count
0 1 1 430 605 -0.406977 overTarget 1
0 1 2 549 701 -0.276867 overTarget 2
0 1 3 471 509 -0.080679 overTarget 3
0 1 4 549 378 0.311475 underTarget 1 # Reset here as status changes
0 1 5 569 708 -0.244288 overTarget 1 # Reset again.
0 1 6 574 650 -0.132404 overTarget 2
0 1 7 466 623 -0.336910 overTarget 3
Try pd.Series.groupby() after create the key by cumsum
s=df.groupby('Stores')['Status'].apply(lambda x : x.ne(x.shift()).ne(0).cumsum())
df['Count']=df.groupby([df.Stores,s]).cumcount()+1

Why i'm not getting my whole output in the run module?

I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99
The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.
To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)
By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

Reset secondary index in pandas dataframe to start at 1

Suppose I construct a multi-index dataframe like the one show here:
prim_ind=np.array(range(0,1000))
for i in range(0,1000):
prim_ind[i]=round(i/4)
d = {'prim_ind' :prim_ind,
'sec_ind' : np.array(range(1,1001)),
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df= pd.DataFrame(d).set_index(['prim_ind','sec_ind'])
The sec_ind runs sequentially from 1 upwards, but I want to reset this second index so that for each of the prim_ind levels the sec_ind always starts at 1. I have been trying to work out if I can use reset index to do this but am failing miserably.
I know i could iterate over the dataframe to get this outcome but that will be a horrible way to do it and there must be a more pythonic way - can anyone help?
Note: the dataframe i'm working with is actually imported from csv, the code above is just to illustrate this question.
You can use cumcount for count categories.
df.index = [df.index.get_level_values(0), df.groupby(level=0).cumcount() + 1]
Or better if want also index names is use MultiIndex.from_arrays:
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0),
df.groupby(level=0).cumcount() + 1],
names=df.index.names)
print (df)
a b
prim_ind sec_ind
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
So column sec_ind is not necessary, you can use also:
d = {'prim_ind' :prim_ind,
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df = pd.DataFrame(d)
print (df.head(8))
a b prim_ind
0 325 8318 0
1 326 8319 0
2 327 8320 0
3 328 8321 1
4 329 8322 1
5 330 8323 1
6 331 8324 2
7 332 8325 2
df = df.set_index(['prim_ind', df.groupby('prim_ind').cumcount() + 1]) \
.rename_axis(('first','second'))
print (df.head(8))
a b
first second
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
2 332 8325

Categories