increasing computing speed to find Constant Features across values - python

I'm using the following code to find rows across columns that have all the same values. I'm using the following code and would like to increase computation speed since I have a very big dataframe and would like to do the same operation for other columns subsets:
dfSPSSstudent[dfSPSSstudent.loc[:,['Q4_1a_1', 'Q4_1a_2', 'Q4_1a_3', 'Q4_1a_4', 'Q4_1a_5', 'Q4_1a_6']].nunique(axis=1) ==1]
What would you recommend. Many thanks for your help

You can try the following:
import pandas as pd
import numpy as np
#generate sample data
np.random.seed(0)
arr = np.random.randint(0, 5, (10**6, 5))
df = pd.DataFrame(arr, columns=list("abcde"))
print(df.head())
It gives:
a b c d e
0 4 0 3 3 3
1 1 3 2 4 0
2 0 4 2 1 0
3 1 1 0 1 4
4 3 0 3 0 2
Select rows where the columns a, b and c have equal values:
a = df[['a', 'b', 'c']].to_numpy()
ddf = df[np.all(a == a[:, 0].reshape(-1, 1), axis=1)]
print(ddf.head())
It gives:
a b c d e
11 1 1 1 3 3
21 3 3 3 2 3
41 1 1 1 3 2
52 1 1 1 1 2
137 2 2 2 3 2
The above code will omit rows where the columns a, b and c all have the NaN value. In order to include such rows in the results the code can be modified as follows:
a = df[['a', 'b', 'c']].to_numpy()
ddf = df[(np.all(a == a[:, 0].reshape(-1, 1), axis=1)) |
(df.loc[:, ['a', 'b', 'c']].isna().all(axis=1))]
Timing tests:
the above code: 14.7 ms ± 358 µs
the original code with nunique(): 5 s ± 291 ms

Related

Pandas DataFrame: resampling along integer index / grouping by groups of n elements

I know about pandas resampling functions using a DateTimeIndex.
But how can I easily resample/group along an integer index?
The following code illustrates the problem and works:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print(df)
A B
0 3 2
1 1 1
2 0 1
3 2 3
4 2 0
5 4 0
6 3 1
7 3 4
8 0 2
9 4 4
# sum of n consecutive elements
n = 3
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print(df_new)
A B
0 4 4
1 8 3
2 6 7
3 4 4
But isn't there a more elegant way to accomplish this?
The code seems a bit heavy-handed to me..
Thanks in advance!
You can floor divide index and aggregate some function:
df1 = df.groupby(df.index // n).sum()
If index is not default (integer, unique) aggregate by floor divided numpy.arange created by len of DataFrame:
df1 = df.groupby(np.arange(len(df)) // n).sum()
You can use group by on the integer division of the index by n. i.e.
df.groupby(lambda i: i//n).sum()
here is the code
import numpy as np
import pandas as pd
n=3
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print('df:')
print(df)
res = df.groupby(lambda i: i//n).sum()
print('using groupby:')
print(res)
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print('using your method:')
print(df_new)
and the output
df:
A B
0 1 0
1 3 0
2 1 1
3 0 4
4 3 4
5 0 1
6 0 4
7 4 0
8 0 2
9 2 2
using groupby:
A B
0 5 1
1 3 9
2 4 6
3 2 2
using you method:
A B
0 5 1
1 3 9
2 4 6
3 2 2

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

Drop pandas dataframe rows AND columns in a batch fashion based on value

Background: I have a matrix which represents the distance between two points. In this matrix both rows and columns are the data points. For example:
A B C
A 0 999 3
B 999 0 999
C 3 999 0
In this toy example let's say I want to drop C for some reason, because it is far away from any other point. So I first aggregate the count:
df["far_count"] = df[df == 999].count()
and then batch remove them:
df = df[df["far_count"] == 2]
In this example this looks a bit redundant but please imagine that I have many data points like this (say in the order of 10Ks)
The problem with the above batch removal is that I would like to remove rows and columns in the same time (instead of just rows) and it is unclear to me how to do so elegantly. A naive way is to get a list of such data points and put it in a loop and then:
for item in list:
df.drop(item, axis=1).drop(item, axis=0)
But I was wondering if there is a better way. (Bonus if we could skip the intermdiate step far_count)
np.random.seed([3,14159])
idx = pd.Index(list('ABCDE'))
a = np.random.randint(3, size=(5, 5))
df = pd.DataFrame(
a.T.dot(a) * (1 - np.eye(5, dtype=int)),
idx, idx)
df
A B C D E
A 0 4 2 4 2
B 4 0 1 5 2
C 2 1 0 2 6
D 4 5 2 0 3
E 2 2 6 3 0
l = ['A', 'C']
m = df.index.isin(l)
df.loc[~m, ~m]
B D E
B 0 5 2
D 5 0 3
E 2 3 0
For your specific case, because the array is symmetric you only need to check one dimension.
m = (df.values == 999).sum(0) == len(df) - 1
In [66]: x = pd.DataFrame(np.triu(df), df.index, df.columns)
In [67]: x
Out[67]:
A B C
A 0 999 3
B 0 0 999
C 0 0 0
In [68]: mask = x.ne(999).all(1) | x.ne(999).all(0)
In [69]: df.loc[mask, mask]
Out[69]:
A C
A 0 3
C 3 0

What is the most efficient way to create a DataFrame from two unrelated series?

I'm looking at creating a Dataframe that is the combination of two unrelated series.
If we take two dataframes:
A = ['a','b','c']
B = [1,2,3,4]
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
I'm looking for this output:
A B
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.
result = []
for i in A:
for j in B:
result.append([i,j])
result_DF = pd.DataFrame(result,columns=['A','B'])
Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:
from datetime import datetime
start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)
A = [UUID('d48259a6-80b5-43ca-906c-8405ab40f9a8'),
UUID('873a65d7-582c-470e-88b6-0d02df078c04'),
UUID('624c32a6-9998-49f4-92b6-70e712355073'),
UUID('7207ab0c-3c7f-477e-b5bc-fbb8059c1dec')]
dfA = pd.DataFrame(A)
result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
newdf = pd.DataFrame(index=benefit.index)
newdf['A'] = dfA.iloc[i,0]
newdf['month'] = newdf.index
result = pd.concat([result,newdf])
result
You can use np.meshgrid:
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
to get a roughly ~2000x speedup on DataFrame objects of length 300 and 400, respectively:
A = ['a', 'b', 'c'] * 100
B = [1, 2, 3, 4] * 100
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
np.meshgrid:
%%timeit
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
100 loops, best of 3: 8.45 ms per loop
vs cross:
%timeit cross(dfA, dfB)
1 loop, best of 3: 16.3 s per loop
So if I understand your example correctly, you could:
A = ['a', 'b', 'c']
dfA = pd.DataFrame(A)
start = datetime(year=2016, month=1, day=1)
end = datetime(year=2016, month=4, day=1)
months = pd.DatetimeIndex(start=start, end=end, freq="MS")
dfB = pd.DataFrame(months.month)
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
to also get:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
Using itertools.product:
from itertools import product
result = pd.DataFrame(list(product(dfA.iloc[:,0], dfB.iloc[:,0])))
Not quite as efficient as np.meshgrid, but it's more efficient than the other solutions.
Alternatively
a = [1,2,3]
b = ['a','b','c']
x,y = zip(*[i for i in zip(np.tile(a,len(a)),np.tile(b,len(a)))])
pd.DataFrame({'x':x,'y':y})
Outputs:
x y
0 1 a
1 2 b
2 3 c
3 1 a
4 2 b
5 3 c
6 1 a
7 2 b
8 3 c
%%timeit
1000 loops, best of 3: 559 µs per loop
EDIT: You don't actually need np.tile. A simple comprehension will do
x,y = zip(*[(i,j) for i in a for j in b])
One liner approach
pd.DataFrame(0, A, B).stack().index.to_series().apply(pd.Series).reset_index(drop=True)
Or:
pd.MultiIndex.from_product([A, B]).to_series().apply(pd.Series).reset_index(drop=True)
From dataframes, assuming the information is in the first column.
pd.MultiIndex.from_product([dfA.iloc[:, 0], dfB.iloc[:, 0]]).to_series().apply(pd.Series).reset_index(drop=True)
Functionalized:
def cross(df1, df2):
s1 = df1.iloc[:, 0]
s2 = df2.iloc[:, 0]
midx = pd.MultiIndex.from_product([s1, s2])
df = midx.to_series().apply(pd.Series).reset_index(drop=True)
df.columns = [s1.name, s2.name if s1.name != s2.name else 1]
return df
print cross(dfA, dfB)
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

Categories