How to vectorize a non-overlapped dataframe to overlapped shiftting dataframe? - python

I would like to transform a regular dataframe to a multi-index dataframe with overlap and shift.
For example, the input dataframe is like this sample code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 12).reshape(-1, 2), columns=['d1', 'd2'], dtype=float)
df.index.name = 'idx'
print(df)
Output:
d1 d2
idx
0 0.0 1.0
1 2.0 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
5 10.0 11.0
What I want to output is: Make it overlap by batch and shift one row per time (Add a column batchid to label every shift), like this (batchsize=4):
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0
My work so far:
I can make it work with iterations and concat them together. But it will take a lot of time.
batchsize = 4
ds, ids = [], []
idx = df.index.values
for bi in range(int(len(df) - batchsize + 1)):
ids.append(idx[bi:bi+batchsize])
for k, idx in enumerate(ids):
di = df.loc[pd.IndexSlice[idx], :].copy()
di['batchid'] = k
ds.append(di)
res = pd.concat(ds).fillna(0)
res.set_index('batchid', inplace=True, append=True)
Is there a way to vectorize and accelerate this process?
Thanks.

First we create a 'mask' that will tell us which elements go into which batch id
nrows = len(df)
batchsize = 4
mask_columns = {i:np.pad([1]*batchsize,(i,nrows-batchsize-i)) for i in range(nrows-batchsize+1)}
mask_df = pd.DataFrame(mask_columns)
df = df.join(mask_df)
this adds a few columns to df:
idx d1 d2 0 1 2
----- ---- ---- --- --- ---
0 0 1 1 0 0
1 2 3 1 1 0
2 4 5 1 1 1
3 6 7 1 1 1
4 8 9 0 1 1
5 10 11 0 0 1
This now looks like a df with 'dummies', and we need to 'reverse' the dummies:
df2 = df.set_index(['d1','d2'], drop=True)
df2[df2==1].stack().reset_index().drop(0,1).sort_values('level_2').rename(columns = {'level_2':'batchid'})
produces
d1 d2 batchid
-- ---- ---- ---------
0 0 1 0
1 2 3 0
3 4 5 0
6 6 7 0
2 2 3 1
4 4 5 1
7 6 7 1
9 8 9 1
5 4 5 2
8 6 7 2
10 8 9 2
11 10 11 2

You can accomplish this with list comprehension inside of a pd.concat with iloc using i as a variable that iterates through a range. This should be quicker:
batchsize = 4
df = (pd.concat([df.iloc[i:batchsize+i].assign(batchid=i)
for i in range(df.shape[0] - batchsize + 1)])
.set_index(['batchid'], append=True))
df
Out[1]:
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0

Related

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first
If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0
the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

Pandas: Drop duplicates based on row value

I have a dataframe and I want to drop duplicates based on different conditions....
A B
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
8 - 5.1
9 - 5.3
I want to drop all the duplicates from column A except rows with "-". After this, I want to drop duplicates from column A with "-" as a value based on their column B value. Given the input dataframe, this should return the following:-
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
I have the following code but it's not very efficient for very large amounts of data, how can I improve this....
def generate(df):
str_col = df[df["A"] == "-"]
df.drop(df[df["A"] == "-"].index, inplace=True)
df = df.drop_duplicates(subset="A")
str_col = b.drop_duplicates(subset="B")
bigdata = df.append(str_col, ignore_index=True)
return bigdata.sort_values("B")
duplicated and eq:
df[~df.duplicated('A') # keep those not duplicates in A
| (df['A'].eq('-') # or those '-' in A
& ~df['B'].duplicated())] # which are not duplicates in B
Output:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
df.drop_duplicates(subset=['A', 'B'])
Given a full set of data:
A B C
0 1 1.0 0
1 1 1.0 1
2 2 2.0 2
3 2 2.0 3
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
8 - 5.1 8
9 - 5.3 9
Result:
A B C
0 1 1.0 0
2 2 2.0 2
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
9 - 5.3 9
groupby + head
df.groupby(['A','B']).head(1)
Out[7]:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3

How to restrict the area for operation in python pandas dataframes?

I`d like to qualify my dropna option within the first 3 rows of the dataframe. The original dataframe is:
A C
0 0.0 0
1 NaN 1
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
And I would love to see:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
With only the row indexed 1 removed. Is it possible to make it within just one line of code?
Thanks!
You could use
In [594]: df[df.notnull().all(1) | (df.index > 3)]
Out[594]:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!
Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Categories