Lets say I have the following data set:
A B
10.1 53
12.5 42
16.0 37
20.7 03
25.6 16
30.1 01
40.9 19
60.5 99
I have a the following list of ranges.
[[9,15],[19,22],[39,50]]
How do I efficiently pull rows that lie in those ranges?
Wanted Output
A B
10.1 53
12.5 42
20.7 03
40.9 19
Edit:
Needs to work for floating points
Update for modified question
For floats, you can construct a mask using NumPy array operations:
L = np.array([[9,15],[19,22],[39,50]])
A = df['A'].values
mask = ((A >= L[:, 0][:, None]) & (A <= L[:, 1][:, None])).any(0)
res = df[mask]
print(res)
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
Previous answer to original question
For integers, you can use numpy.concatenate with numpy.arange:
L = [[9,15],[19,22],[39,50]]
vals = np.concatenate([np.arange(i, j) for i, j in L])
res = df[df['A'].isin(vals)]
print(res)
A B
0 10 53
1 12 42
3 20 3
6 40 19
An alternative solution with itertools.chain and range:
from itertools import chain
vals = set(chain.from_iterable(range(i, j) for i, j in L))
res = df[df['A'].isin(vals)]
Here's another method (edit: works with floats or integers). #jpp's might be faster, but this code is easier to understand (in my opinion).
df = pd.DataFrame([[10.1,53],[12.5,42],[16.0,37],[20.7,3],[25.6,16],[30.1,1],[40.9,19],[60.5,99]],columns=list('AB'))
ranges = [[9,15],[19,22],[39,50]]
result = pd.DataFrame(columns=list('AB'))
for r in ranges:
result = result.append(df[df['A'].between(r[0], r[1], inclusive=False)])
print (result)
Here's the output:
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
PS: the following one-line list comprehension also works:
result = result.append([source[source['A'].between(r[0], r[1], inclusive=False)] for r in ranges])
Related
I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])
To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]
I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40
I have data of the form in a text file.
Text file entry
#x y z
1 1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64 512
9 81 729
10 100 1000
11 121
12 144 1728
13 169
14 196
15 225
16 256 4096
17 289
18 324
19 361 6859
20 400
21 441 9261
22 484
23 529 12167
24 576
25 625
Some of the entries in the third column are empty. I am trying to create an array of x (column 1) and z (column 3) ignoring nan. Let the array be B. The contents of B should be:
1 1
8 512
9 729
10 1000
12 1728
16 4096
19 6859
21 9261
23 12167
I tried doing this using the code:
import numpy as np
A = np.genfromtxt('data.dat', comments='#', delimiter='\t')
B = []
for i in range(len(A)):
if ~ np.isnan(A[i, 2]):
B = np.append(B, np.column_stack((A[i, 0], A[i, 2])))
print B.shape
This does not work. It creates a column vector. How can this be done in Python?
Using pandas would make your life quite easier (note the regular expression to define delimiter):
from pandas import read_csv
data = read_csv('data.dat', delimiter='\s+').values
print(data[~np.isnan(data[:, 2])][:, [0, 2]])
Which results in:
array([[ 8.00000000e+00, 5.12000000e+02],
[ 9.00000000e+00, 7.29000000e+02],
[ 1.00000000e+01, 1.00000000e+03],
[ 1.20000000e+01, 1.72800000e+03],
[ 1.60000000e+01, 4.09600000e+03],
[ 1.90000000e+01, 6.85900000e+03],
[ 2.10000000e+01, 9.26100000e+03],
[ 2.30000000e+01, 1.21670000e+04]])
If you read your data.dat file and assign the content to a variable, say data:
You can iterate over the lines and split them and process only the ones that have 3 elements:
B=[]
for line in data.split('\n'):
if len(line.split()) == 3:
x,y,z = line.split()
B.append((x,z)) # or B.append(str(x)+'\t'+str(z)+'\n')
# or any othr format you need
Not always the functions provided by the libraries are easy to use, as you found out. The following program does it manually, and creates an array with the values from the datafile.
import numpy as np
def main():
B = np.empty([0, 2], dtype = int)
with open("data.dat") as inf:
for line in inf:
if line[0] == "#": continue
l = line.split()
if len(l) == 3:
l = [int(d) for d in l[1:]]
B = np.vstack((B, l))
print B.shape
print B
return 0
if __name__ == '__main__':
main()
Note that:
1) The append() function works on lists, not on arrays - at least not in the syntax you used. The easiest way to extend arrays is 'piling' rows, using vstack (or hstack for columns)
2) Specifying a delimiter in genfromtxt() can come to bite you. By default the delimiter is any white space, which is normally what you want.
From your input dataframe:
In [33]: df.head()
Out[33]:
x y z
0 1 1 1
1 2 4 NaN
2 3 9 NaN
3 4 16 NaN
4 5 25 NaN
.. you can get to the output dataframe B by doing this :
In [34]: df.dropna().head().drop('y', axis=1)
Out[34]:
x z
0 1 1
7 8 512
8 9 729
9 10 1000
11 12 1728
I would like to know if there is a faster and more "pythonic" way of doing the following, e.g. using some built in methods.
Given a pandas DataFrame or numpy array of floats, if the value is equal or smaller than 0.5 I need to calculate the reciprocal value and multiply with -1 and replace the old value with the newly calculated one.
"Transform" is probably a bad choice of words, please tell me if you have a better/more accurate description.
Thank you for your help and support!!
Data:
import numpy as np
import pandas as pd
dicti = {"A" : np.arange(0.0, 3, 0.1),
"B" : np.arange(0, 30, 1),
"C" : list("ELVISLIVES")*3}
df = pd.DataFrame(dicti)
my function:
def transform_colname(df, colname):
series = df[colname]
newval_list = []
for val in series:
if val <= 0.5:
newval = (1/val)*-1
newval_list.append(newval)
else:
newval_list.append(val)
df[colname] = newval_list
return df
function call:
transform_colname(df, colname="A")
**--> I'm summing up the results here, since comments wouldn't allow to post code (or I don't know how to do it).**
Thank you all for your fast and great answers!!
using ipython "%timeit" with "real" data:
my function:
10 loops, best of 3: 24.1 ms per loop
from jojo:
def transform_colname_v2(df, colname):
series = df[colname]
df[colname] = np.where(series <= 0.5, 1/series*-1, series)
return df
100 loops, best of 3: 2.76 ms per loop
from FooBar:
def transform_colname_v3(df, colname):
df.loc[df[colname] <= 0.5, colname] = - 1 / df[colname][df[colname] <= 0.5]
return df
100 loops, best of 3: 3.32 ms per loop
from dmvianna:
def transform_colname_v4(df, colname):
df[colname] = df[colname].where(df[colname] <= 0.5, (1/df[colname])*-1)
return df
100 loops, best of 3: 3.7 ms per loop
Please tell/show me if you would implement your code in a different way!
One final QUESTION: (answered)
How could "FooBar" and "dmvianna" 's versions be made "generic"? I mean, I had to write the name of the column into the function (since using it as a variable didn't work). Please explain this last point!
--> thanks jojo, ".loc" isn't the right way, but very simple df[colname] is sufficient. changed the functions above to be more "generic". (also changed ">" to be "<=", and updated timing)
Thank you very much!!
The typical trick is to write a general mathematical operation to apply to the whole column, but then use indicators to select rows for which we actually apply it:
df.loc[df.A < 0.5, 'A'] = - 1 / df.A[df.A < 0.5]
In[13]: df
Out[13]:
A B C
0 -inf 0 E
1 -10.000000 1 L
2 -5.000000 2 V
3 -3.333333 3 I
4 -2.500000 4 S
5 0.500000 5 L
6 0.600000 6 I
7 0.700000 7 V
8 0.800000 8 E
9 0.900000 9 S
10 1.000000 10 E
11 1.100000 11 L
12 1.200000 12 V
13 1.300000 13 I
14 1.400000 14 S
15 1.500000 15 L
16 1.600000 16 I
17 1.700000 17 V
18 1.800000 18 E
19 1.900000 19 S
20 2.000000 20 E
21 2.100000 21 L
22 2.200000 22 V
23 2.300000 23 I
24 2.400000 24 S
25 2.500000 25 L
26 2.600000 26 I
27 2.700000 27 V
28 2.800000 28 E
29 2.900000 29 S
If we are talking about arrays:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print 1 / a[a <= 0.5] * (-1)
This will, however only return the values smaller than 0.5.
Alternatively use np.where:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print np.where(a < 0.5, 1 / a * (-1), a)
Talking about pandas DataFrame:
As in #dmvianna's answer (so give some credit to him ;) ), adapting it to pd.DataFrame:
df.a = df.a.where(df.a > 0.5, (1 / df.a) * (-1))
As in #jojo's answer, but using pandas:
df.A = df.A.where(df.A > 0.5, (1/df.A)*-1)
or
df.A.where(df.A > 0.5, (1/df.A)*-1, inplace=True) # this should be faster
.where docstring:
Definition: df.A.where(self, cond, other=nan, inplace=False,
axis=None, level=None, try_cast=False, raise_on_error=True)
Docstring:
Return an object of same shape as self and whose corresponding entries
are from self where cond is True and otherwise are from other.
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )