Adding almost duplicates together and removing old duplicate in Pandas

Adding almost duplicates together and removing old duplicate in Pandas - python

Apologies for any inaccuracies in my wording, I'm fairly new to Python and brand new Pandas.
So I currently have a dataframe which contains about 1,000 accounts and their corresponding balance. However, some accounts exist twice in the data, once with their normal account number and once with "AM1" at the end of it. How can I create a new/edit the existing dataframe(either way works), so that 900002 and 900002AM1's balances are combined into 900002's balance and then 90002AM1 is removed from the df? Thank you. I know groupby would work (it's how I got to the current DF), but obviously I would need to be able to remove 'AM1' from all account numbers and then do :
df.groupby(['account#']).agg({'balance':'sum'}).resetindex)
Current DF:
account#
balance
900001
35.00
900002
25.00
900002AM1
25.00
900003
40.00
900004
20.00
900004AM1
10.00
Desired DF:
account#
balance
900001
35.00
900002
50.00
900003
40.00
900004
30.00

Extract number from account# column:
>>> df.groupby(df['account#'].str.extract(r'(^\d+)', expand=False)) \
.sum().reset_index()
account# balance
0 900001 35.0
1 900002 50.0
2 900003 40.0
3 900004 30.0
What str.extract does?
>>> df['account#'].str.extract(r'(^\d+)', expand=False)
0 900001
1 900002
2 900002 # <- 900002AM1
3 900003
4 900004
5 900004 # <- 900004AM1
Name: account#, dtype: object

Related

Compare two dataframes and add new values in second dataframe to the first data-frame

I have two dataframes with the same headers
df1\
**Date prix moyen mini maxi H-Value C-Value**
0 17/09/20 8 6 9 122 2110122\
1 15/09/20 8 6 9 122 2110122\
2 10/09/20 8 6 9 122 2110122
and
df2
**Date prix moyen mini maxi H-Value C-Value**\
1 07/09/17 1.80 1.50 2.00 170 3360170\
1 17/09/20 8.00 6.00 9.00 122 2110122\
2 17/09/20 9.00 8.00 12.00 122 2150122\
3 17/09/20 10.00 8.00 12.00 122 14210122
I want to compare the two dataframes alone 3 parameters (Date, H-Value and C-Value), identify the new values present in df2 (values which do not occur in df1) and then append them in df1.
I am using
df_unique = df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=True)
and it is not working in identifying the new values in df2. The resulting table only identifies some values and not others.
Where am I going wrong?

What is your question?
In [4]: df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']
...: ) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=Tru
...: e)
Out[4]:
Date prix moyen mini maxi H-Value C-Value
0 1 07/09/17 1.8 1.5 2.0 170 3360170
1 2 17/09/20 9.0 8.0 12.0 122 2150122
2 3 17/09/20 10.0 8.0 12.0 122 14210122
These are all rows in df2 that are not present in df1. Looks good to me...

I was actually able to solve the problem. The issue was not the command being used to compare the two datasets but rather the fact that one of the columns in df2 had a data format different from the same column in df1, rendering a direct comparison not possible.

Here's what I try
df1 = pd.concat([df1, df2[~df2.set_index(['Date', 'H-Value', 'C-Value']).index.isin(df1.set_index(['Date', 'H-Value', 'C-Value']).index)]])

column in my pandas Data Frame where both credit and debit are mixed, I want to create separate columns for credit and debit

df['Withdrawal (Dr)/ Deposit (Cr)']
Out[571]:
0 214.82 (Cr)
1 50.00 (Dr)
2 50.00 (Dr)
3 50.00 (Dr)
4 19.00 (Dr)
785 161.00 (Dr)
786 155.45 (Dr)
787 69.00 (Dr)
788 51.00 (Dr)
789 73.00 (Cr)
Name: Withdrawal (Dr)/ Deposit (Cr), Length: 790, dtype: object

df['dr/cr'] = df['Withdrawal (Dr)/ Deposit (Cr)'].apply(lambda x: x.split("(")[1])
df2 = pd.get_dummies(df['dr/cr'])
df = pd.concat([df,df2], axis = 1)
I hope that it will help. Or it will atleast give you an idea what you have to do

I suggest you split the data based on the identifier. Also, you most likely want the number as numbers. Lets generate some data:
my_df = pd.DataFrame({'mycol':['214.82 (Cr)','50.00 (Dr)','50.00 (Dr)','50.00
(Dr)','19.00 (Dr)','161.00 (Dr)','155.45 (Dr)','69.00 (Dr)',
'51.00 (Dr)','73.00 (Cr)'] })
I create new columns CR and DR based on the characters in the original column. Then I remove the string part. (You can do this with regEx tricks as well.) Lastly I convert the the figures into numbers for future use.
my_df['CR'] = my_df[my_df['mycol'].str.contains('Cr')]['mycol'].str.replace('\(Cr\)', '').astype(float)
my_df['DR'] = my_df[my_df['mycol'].str.contains('Dr')]['mycol'].str.replace('\(Dr\)', '').astype(float)

How to make column operations between two different data frames based on a condition in pandas python

I have been trying to perform mathematical operation between two data frames Data1 and Data 2 based on a condition.

Note that for some accounts you have more than one operation,
both incoming and outgoing.
So to reflect the transfers between accounts in their balances,
in the way compliant with the rules of accounting,
you should process your data as follows:
Set the index in data1 to Account Number (it will be needed for
easy access to the row concerning particular accout).
Create in data1 a new column called New Balance, initially filled with
data from Balance.
Loop over rows from data2 (concerning transfers between accounts).
In each loop read from the current row:
Outgoing Account,
Ingoing Account,
Amount.
Process the current transaction, performing the following operations:
locate in data1 a row with index == Outgoing Account,
subtract the Amount from New Amount field,
locate in data1 a row with index == Ingoing Account,
add the Amount to New Amount field.
The code to do it is:
data1.set_index('Account Number', inplace=True)
data1['New Balance'] = data1.Balance
for _, row in data2.iterrows():
outAcnt = int(row['Outgoing Account'])
inAcnt = int(row['Ingoing Account'])
amt = row.Amount
data1.at[outAcnt, 'New Balance'] -= amt
data1.at[inAcnt, 'New Balance'] += amt
When you print(data1) after the loop, for your data, you will get:
Balance New Balance
Account Number
1 2356.0 1945.5
3 452.5 -2874.5
5 120.0 2197.0
7 124.0 13.5
9 4582.0 4595.0
12 230.0 497.5
14 9.5 -2001.5
16 423.0 -381.0
18 235.0 -66.5
20 12.0 4619.0
This way the new balance reflects transfers performed during this business day,
e.g. for account No 1 there were performed:
output of $23.00 to account No 9,
input of $35.50 from account No 7,
output of $423.00 to account No 16.
(the currency may be different).
So:
2356.00 - 23.00 + 35.50 - 423.00 = 1945.50
(not 2333.0 as in the other solution).
How to handle attempts to overdraft
In the comment as of 23:32Z you asked about reverting transfers
leading to a negative balance of the outgoing account.
I think that instead of reverting them, the easier and more natural way is
to extend the existing loop by a check whether the outgoing account has
enough money (New Balance in the respective row of data1 >= row.Amount).
If not then:
neither subtraction from the outgoing account nor addition to the
ingoing (receiving) account should be performed,
this transaction should be marked (in data1) as refused.
In this variant, the code should be something like:
data1.set_index('Account Number', inplace=True)
data1['New Balance'] = data1.Balance
data2 = data2.assign(Refused=False)
for transKey, row in data2.iterrows():
outAcnt = int(row['Outgoing Account'])
inAcnt = int(row['Ingoing Account'])
amt = row.Amount
balBefore = data1.at[outAcnt, 'New Balance']
if balBefore >= amt: # The charged amount has enough money
data1.at[outAcnt, 'New Balance'] -= amt
data1.at[inAcnt, 'New Balance'] += amt
result = 'OK'
else:
data2.at[transKey, 'Refused'] = True
result = 'Refused'
print(f'{outAcnt:3} {inAcnt:3} {amt:8.2f} {balBefore:8.2f} {result}')
For demonstration purpose, I added a printout concerning both account
numbers, transfer amount, balance of the outgoing account before the
transaction and "OK / Refused" status.
This time the "trace" printout is:
1 9 23.00 2356.00 OK
18 12 242.50 235.00 Refused
7 14 120.00 124.00 OK
16 3 1227.00 423.00 Refused
20 7 45.00 12.00 Refused
5 18 59.00 120.00 OK
3 14 5.00 452.50 OK
3 20 4534.00 447.50 Refused
3 9 15.00 447.50 OK
7 1 35.50 4.00 Refused
9 16 25.00 4620.00 OK
16 12 25.00 448.00 OK
18 20 118.00 294.00 OK
14 5 2136.00 134.50 Refused
1 16 423.00 2333.00 OK
and the final result in data1 is as follows:
Balance New Balance
Account Number
1 2356.0 1910.0
3 452.5 432.5
5 120.0 61.0
7 124.0 4.0
9 4582.0 4595.0
12 230.0 255.0
14 9.5 134.5
16 423.0 846.0
18 235.0 176.0
20 12.0 130.0
Print also data2 with the added column, marking the refusal status.
Conversion of Time column
Note that Time column in data2 contains only digits and by default
is read as int64 type. If you want to convert Time column to time type,
run:
data2['Time'] = pd.to_datetime(data2['Time'], format='%H%M').dt.time
To be more precise, the pandasonic type of this column will be object
(actually something other than defined numeric data types) but if you
attempt to read an individual value, e.g. data2.iat[0,0], you will get
datetime.time(9, 15) (the same for all other elements from this column).

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object

Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)

If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.

If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.

You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding almost duplicates together and removing old duplicate in Pandas - python

Related

Compare two dataframes and add new values in second dataframe to the first data-frame

column in my pandas Data Frame where both credit and debit are mixed, I want to create separate columns for credit and debit

How to make column operations between two different data frames based on a condition in pandas python

Pandas join.fillna of two data frames replaces all all values with anf not only nan

Efficient operation over grouped dataframe Pandas

Categories

Resources