Pandas: Drop duplicates based on row value

Pandas: Drop duplicates based on row value - python

I have a dataframe and I want to drop duplicates based on different conditions....
A B
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
8 - 5.1
9 - 5.3
I want to drop all the duplicates from column A except rows with "-". After this, I want to drop duplicates from column A with "-" as a value based on their column B value. Given the input dataframe, this should return the following:-
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
I have the following code but it's not very efficient for very large amounts of data, how can I improve this....
def generate(df):
str_col = df[df["A"] == "-"]
df.drop(df[df["A"] == "-"].index, inplace=True)
df = df.drop_duplicates(subset="A")
str_col = b.drop_duplicates(subset="B")
bigdata = df.append(str_col, ignore_index=True)
return bigdata.sort_values("B")

duplicated and eq:
df[~df.duplicated('A') # keep those not duplicates in A
| (df['A'].eq('-') # or those '-' in A
& ~df['B'].duplicated())] # which are not duplicates in B
Output:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3

df.drop_duplicates(subset=['A', 'B'])
Given a full set of data:
A B C
0 1 1.0 0
1 1 1.0 1
2 2 2.0 2
3 2 2.0 3
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
8 - 5.1 8
9 - 5.3 9
Result:
A B C
0 1 1.0 0
2 2 2.0 2
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
9 - 5.3 9

groupby + head
df.groupby(['A','B']).head(1)
Out[7]:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3

Related

How can I introduce a new column which does not count in every line?

I have following dataframe about auctions:
id.product_permutation
id.iteration
property.product
property.price
1
1
1
200
1
2
1
300
1
3
1
400
1
4
3
100
1
5
3
200
1
6
3
300
1
7
2
500
1
8
2
600
2
1
3
300
2
2
3
400
2
3
1
200
2
4
1
300
2
5
2
700
2
6
2
800
2
7
2
900
2
8
2
700
3
1
1
200
...
...
...
...
There are 3 different products in the auction and the column property.product tells which product is for sale at the moment. If the product number in property.product changes then the product is sold.
property.price tells what the price is at the moment.
If the number in id.product_permutation changes then the whole auction is over, all 3 items are sold and a new auction starts (with the same 3 items).
Now I would like to introduce a new column amount_of_sold_items which counts how many products were already sold (like in the following). I tried a lot, but unfortunately I do not get the desired result. Can anyone help me please to solve this issue?
id.product_permutation
id.iteration
property.product
property.price
amount_of_sold_items
1
1
1
200
0
1
2
1
300
0
1
3
1
400
0
1
4
3
100
1
1
5
3
200
1
1
6
3
300
1
1
7
2
500
2
1
8
2
600
2
1
NaN
NaN
NaN
3
2
1
3
300
0
2
2
3
400
0
2
3
1
200
1
2
4
1
300
1
2
5
2
700
2
2
6
2
800
2
2
7
2
900
2
2
8
2
700
2
2
NaN
NaN
NaN
3
3
1
1
200
0
...
...
...
...
...

df["n_items_sold"] = (df.groupby("id.product_permutation")["property.product"]
.transform(lambda x: x.diff().ne(0, fill_value=0).cumsum()))
For each id.product_permutation group, we assign a new series that looks at the turning points via difference not being equal to 0 (fill_value=0 is there to prevent counting the very first one as a turning point). Cumulative sum of these turning points keeps track of the items sold thus far.
This gives:
id.product_permutation id.iteration property.product property.price n_items_sold
0 1 1 1 200 0
1 1 2 1 300 0
2 1 3 1 400 0
3 1 4 3 100 1
4 1 5 3 200 1
5 1 6 3 300 1
6 1 7 2 500 2
7 1 8 2 600 2
8 2 1 3 300 0
9 2 2 3 400 0
10 2 3 1 200 1
11 2 4 1 300 1
12 2 5 2 700 2
13 2 6 2 800 2
14 2 7 2 900 2
15 2 8 2 700 2
16 3 1 1 200 0
To put [id_prod_perm, NaN, NaN, NaN, 3] rows at the end of each id.product_permuation, we can detect the changing points of id.product_permuation and insert columns to the transposed frame which, in effect, inserts rows to the original one when transposed:
# following is [8, 16] for the above example
changing_points = np.where(df["id.product_permutation"]
.diff().ne(0, fill_value=0))[0].tolist()
# insert to transpose and then come back
df = df.T
offset = 0 # helper for insertion location
for j, point in enumerate(changing_points, start=1):
# to the given point, insert a column with the same name
df.insert(loc=point+offset, column=point, value=[j, *[np.nan]*3, 3],
allow_duplicates=True)
# since an insertion enlarges the frame, old changing points
# need to increase, this is handled by the `offset`
offset += 1
# go back to original form, and also reset the index to 0..N-1
df = df.T.reset_index(drop=True)
to get
>>> df
id.product_permutation id.iteration property.product property.price n_items_sold
0 1.0 1.0 1.0 200.0 0.0
1 1.0 2.0 1.0 300.0 0.0
2 1.0 3.0 1.0 400.0 0.0
3 1.0 4.0 3.0 100.0 1.0
4 1.0 5.0 3.0 200.0 1.0
5 1.0 6.0 3.0 300.0 1.0
6 1.0 7.0 2.0 500.0 2.0
7 1.0 8.0 2.0 600.0 2.0
8 1.0 NaN NaN NaN 3.0
9 2.0 1.0 3.0 300.0 0.0
10 2.0 2.0 3.0 400.0 0.0
11 2.0 3.0 1.0 200.0 1.0
12 2.0 4.0 1.0 300.0 1.0
13 2.0 5.0 2.0 700.0 2.0
14 2.0 6.0 2.0 800.0 2.0
15 2.0 7.0 2.0 900.0 2.0
16 2.0 8.0 2.0 700.0 2.0
17 2.0 NaN NaN NaN 3.0
18 3.0 1.0 1.0 200.0 0.0

How to vectorize a non-overlapped dataframe to overlapped shiftting dataframe?

I would like to transform a regular dataframe to a multi-index dataframe with overlap and shift.
For example, the input dataframe is like this sample code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 12).reshape(-1, 2), columns=['d1', 'd2'], dtype=float)
df.index.name = 'idx'
print(df)
Output:
d1 d2
idx
0 0.0 1.0
1 2.0 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
5 10.0 11.0
What I want to output is: Make it overlap by batch and shift one row per time (Add a column batchid to label every shift), like this (batchsize=4):
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0
My work so far:
I can make it work with iterations and concat them together. But it will take a lot of time.
batchsize = 4
ds, ids = [], []
idx = df.index.values
for bi in range(int(len(df) - batchsize + 1)):
ids.append(idx[bi:bi+batchsize])
for k, idx in enumerate(ids):
di = df.loc[pd.IndexSlice[idx], :].copy()
di['batchid'] = k
ds.append(di)
res = pd.concat(ds).fillna(0)
res.set_index('batchid', inplace=True, append=True)
Is there a way to vectorize and accelerate this process?
Thanks.

First we create a 'mask' that will tell us which elements go into which batch id
nrows = len(df)
batchsize = 4
mask_columns = {i:np.pad([1]*batchsize,(i,nrows-batchsize-i)) for i in range(nrows-batchsize+1)}
mask_df = pd.DataFrame(mask_columns)
df = df.join(mask_df)
this adds a few columns to df:
idx d1 d2 0 1 2
----- ---- ---- --- --- ---
0 0 1 1 0 0
1 2 3 1 1 0
2 4 5 1 1 1
3 6 7 1 1 1
4 8 9 0 1 1
5 10 11 0 0 1
This now looks like a df with 'dummies', and we need to 'reverse' the dummies:
df2 = df.set_index(['d1','d2'], drop=True)
df2[df2==1].stack().reset_index().drop(0,1).sort_values('level_2').rename(columns = {'level_2':'batchid'})
produces
d1 d2 batchid
-- ---- ---- ---------
0 0 1 0
1 2 3 0
3 4 5 0
6 6 7 0
2 2 3 1
4 4 5 1
7 6 7 1
9 8 9 1
5 4 5 2
8 6 7 2
10 8 9 2
11 10 11 2

You can accomplish this with list comprehension inside of a pd.concat with iloc using i as a variable that iterates through a range. This should be quicker:
batchsize = 4
df = (pd.concat([df.iloc[i:batchsize+i].assign(batchid=i)
for i in range(df.shape[0] - batchsize + 1)])
.set_index(['batchid'], append=True))
df
Out[1]:
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first

If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0

the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

How to restrict the area for operation in python pandas dataframes?

I`d like to qualify my dropna option within the first 3 rows of the dataframe. The original dataframe is:
A C
0 0.0 0
1 NaN 1
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
And I would love to see:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
With only the row indexed 1 removed. Is it possible to make it within just one line of code?
Thanks!

You could use
In [594]: df[df.notnull().all(1) | (df.index > 3)]
Out[594]:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!

Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Drop duplicates based on row value - python

duplicated and eq: df[~df.duplicated('A') # keep those not duplicates in A | (df['A'].eq('-') # or those '-' in A & ~df['B'].duplicated())] # which are not duplicates in B Output: A B 0 1 1.0 2 2 2.0 4 3 3.0 5 4 4.0 6 5 5.0 7 - 5.1 9 - 5.3

df.drop_duplicates(subset=['A', 'B']) Given a full set of data: A B C 0 1 1.0 0 1 1 1.0 1 2 2 2.0 2 3 2 2.0 3 4 3 3.0 4 5 4 4.0 5 6 5 5.0 6 7 - 5.1 7 8 - 5.1 8 9 - 5.3 9 Result: A B C 0 1 1.0 0 2 2 2.0 2 4 3 3.0 4 5 4 4.0 5 6 5 5.0 6 7 - 5.1 7 9 - 5.3 9

groupby + head df.groupby(['A','B']).head(1) Out[7]: A B 0 1 1.0 2 2 2.0 4 3 3.0 5 4 4.0 6 5 5.0 7 - 5.1 9 - 5.3

Related

How can I introduce a new column which does not count in every line?

How to vectorize a non-overlapped dataframe to overlapped shiftting dataframe?

Add list to Pandas Dateframe, but keep NaNs at the top

How to restrict the area for operation in python pandas dataframes?

Python pandas : groupby on two columns and create new variables

Categories

Resources