I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40
Related
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])
To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]
I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
suppose I have the pd.Series
import pandas as pd
import numpy as np
s = pd.Series(np.arange(10), list('abcdefghij'))
I'd like to "shuffle" this series like a deck of cards by interweaving the top half with the bottom half.
I'd expect results like this
a 0
f 5
b 1
g 6
c 2
h 7
d 3
i 8
e 4
j 9
dtype: int32
Conclusions
final function
def perfect_shuffle(s):
n = s.values.shape[0] # get length of s
l = (n + 1) // 2 * 2 # get next even number after n
# use even number to reshape and only use n of them after ravel
a = np.arange(l).reshape(2, -1).T.ravel()[:n]
# construct new series slicing both values and index
return pd.Series(s.values[a], s.index.values[a])
demonstration
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
a 0
g 6
b 1
h 7
c 2
i 8
d 3
j 9
e 4
k 10
f 5
dtype: int64
order='F' vs T
I had suggested using T.ravel() as opposed to ravel(order='F')
After investigation, it hardly matters but ravel(order='F') is better for larger arrays.
d = pd.DataFrame(dict(T=[], R=[]))
for n in np.power(10, np.arange(1, 8)):
a = np.arange(n).reshape(2, -1)
stamp = pd.datetime.now()
for _ in range(100):
a.ravel(order='F')
d.loc[n, 'R'] = (pd.datetime.now() - stamp).total_seconds()
stamp = pd.datetime.now()
for _ in range(100):
a.T.ravel()
d.loc[n, 'T'] = (pd.datetime.now() - stamp).total_seconds()
d
d.plot()
Thanks unutbu and Warren Weckesser
In then special case where the length of the Series is even, you can to do a perfectly shuffle by reshaping its values into two rows and then using ravel(order='F') to read the items off in Fortran order:
In [12]: pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
Out[12]:
a 0
b 5
c 1
d 6
e 2
f 7
g 3
h 8
i 4
j 9
dtype: int64
Fortran order makes the left-most axis increment fastest. So in a 2D array the
values are read off by going down the rows of one column before progressing to
the next column. This has the effect of interleaving the values, compared to the
usual C-order.
In the general case where the length of the Series could be odd,
perhaps the fastest way is to reassign the values using shifted slices:
import numpy as np
import pandas as pd
def perfect_shuffle(ser):
arr = ser.values
result = np.empty_like(arr)
N = (len(arr)+1)//2
result[::2] = arr[:N]
result[1::2] = arr[N:]
result = pd.Series(result, index=ser.index)
return result
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
yields
a 0
b 6
c 1
d 7
e 2
f 8
g 3
h 9
i 4
j 10
k 5
dtype: int64
To add to #unutbu's answer, some benchmarks:
>>> import timeit
>>> import numpy as np
>>>
>>> setup = '''
... import pandas as pd
... import numpy as np
... s = pd.Series(list('abcdefghij'), np.arange(10))
... '''
>>>
>>> funcs = ['s[np.random.permutation(s.index)]', "pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)",
... 's.iloc[np.random.permutation(s.index)]', "s.values.reshape(-1, 2, order='F').ravel()"]
>>>
>>> for f in funcs:
... print(f)
... print(min(timeit.Timer(f, setup).repeat(3, 50)))
...
s[np.random.permutation(s.index)]
0.029795593000017107
pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
0.0035402200010139495
s.iloc[np.random.permutation(s.index)]
0.010904800990829244
s.values.reshape(-1, 2, order='F').ravel()
0.00019640100072138011
The final f in funcs is > 99% faster than the first np.random.permutation approach, so that's probably your best bet.
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )