I would like to know if there is a faster and more "pythonic" way of doing the following, e.g. using some built in methods.
Given a pandas DataFrame or numpy array of floats, if the value is equal or smaller than 0.5 I need to calculate the reciprocal value and multiply with -1 and replace the old value with the newly calculated one.
"Transform" is probably a bad choice of words, please tell me if you have a better/more accurate description.
Thank you for your help and support!!
Data:
import numpy as np
import pandas as pd
dicti = {"A" : np.arange(0.0, 3, 0.1),
"B" : np.arange(0, 30, 1),
"C" : list("ELVISLIVES")*3}
df = pd.DataFrame(dicti)
my function:
def transform_colname(df, colname):
series = df[colname]
newval_list = []
for val in series:
if val <= 0.5:
newval = (1/val)*-1
newval_list.append(newval)
else:
newval_list.append(val)
df[colname] = newval_list
return df
function call:
transform_colname(df, colname="A")
**--> I'm summing up the results here, since comments wouldn't allow to post code (or I don't know how to do it).**
Thank you all for your fast and great answers!!
using ipython "%timeit" with "real" data:
my function:
10 loops, best of 3: 24.1 ms per loop
from jojo:
def transform_colname_v2(df, colname):
series = df[colname]
df[colname] = np.where(series <= 0.5, 1/series*-1, series)
return df
100 loops, best of 3: 2.76 ms per loop
from FooBar:
def transform_colname_v3(df, colname):
df.loc[df[colname] <= 0.5, colname] = - 1 / df[colname][df[colname] <= 0.5]
return df
100 loops, best of 3: 3.32 ms per loop
from dmvianna:
def transform_colname_v4(df, colname):
df[colname] = df[colname].where(df[colname] <= 0.5, (1/df[colname])*-1)
return df
100 loops, best of 3: 3.7 ms per loop
Please tell/show me if you would implement your code in a different way!
One final QUESTION: (answered)
How could "FooBar" and "dmvianna" 's versions be made "generic"? I mean, I had to write the name of the column into the function (since using it as a variable didn't work). Please explain this last point!
--> thanks jojo, ".loc" isn't the right way, but very simple df[colname] is sufficient. changed the functions above to be more "generic". (also changed ">" to be "<=", and updated timing)
Thank you very much!!
The typical trick is to write a general mathematical operation to apply to the whole column, but then use indicators to select rows for which we actually apply it:
df.loc[df.A < 0.5, 'A'] = - 1 / df.A[df.A < 0.5]
In[13]: df
Out[13]:
A B C
0 -inf 0 E
1 -10.000000 1 L
2 -5.000000 2 V
3 -3.333333 3 I
4 -2.500000 4 S
5 0.500000 5 L
6 0.600000 6 I
7 0.700000 7 V
8 0.800000 8 E
9 0.900000 9 S
10 1.000000 10 E
11 1.100000 11 L
12 1.200000 12 V
13 1.300000 13 I
14 1.400000 14 S
15 1.500000 15 L
16 1.600000 16 I
17 1.700000 17 V
18 1.800000 18 E
19 1.900000 19 S
20 2.000000 20 E
21 2.100000 21 L
22 2.200000 22 V
23 2.300000 23 I
24 2.400000 24 S
25 2.500000 25 L
26 2.600000 26 I
27 2.700000 27 V
28 2.800000 28 E
29 2.900000 29 S
If we are talking about arrays:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print 1 / a[a <= 0.5] * (-1)
This will, however only return the values smaller than 0.5.
Alternatively use np.where:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print np.where(a < 0.5, 1 / a * (-1), a)
Talking about pandas DataFrame:
As in #dmvianna's answer (so give some credit to him ;) ), adapting it to pd.DataFrame:
df.a = df.a.where(df.a > 0.5, (1 / df.a) * (-1))
As in #jojo's answer, but using pandas:
df.A = df.A.where(df.A > 0.5, (1/df.A)*-1)
or
df.A.where(df.A > 0.5, (1/df.A)*-1, inplace=True) # this should be faster
.where docstring:
Definition: df.A.where(self, cond, other=nan, inplace=False,
axis=None, level=None, try_cast=False, raise_on_error=True)
Docstring:
Return an object of same shape as self and whose corresponding entries
are from self where cond is True and otherwise are from other.
Related
I am working on a large dataframe with multiple columns. However, some of columns have data in the form of arrays with in arrays (single value). I need to convert the dataframe columns with only cell values i.e., without the array element style. I have tried flatten, squeeze in different ways, but could not get the output in the way I am looking.
Following code reproduces the data format I am working at present:
import pandas as pd
a = [[[10]],[[20]],[[30]],[[40]]]
b=[[50],[60],[70],[80]]
c=[90,100,110,120]
df = pd.DataFrame(list(zip(a,b,c)),columns=['a','b','c'])
print(df)
The output of the above is:
a b c
0 [[10]] [50] 90
1 [[20]] [60] 100
2 [[30]] [70] 110
3 [[40]] [80] 120
However, I am looking to get the output as below:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120
It would really help, if you could suggest how to approach this problem.
The head of the actual dataframe is given below:
acoeff bcoeff refdiff ref18
0 [[0.33907555]] [11.51908656] 0.000 0.001
1 [[0.34024954]] [11.45693353] 0.001 0.001
2 [[0.34134777]] [11.40045124] 0.002 0.001
3 [[0.34297324]] [11.33036004] 0.004 0.001
4 [[0.34373931]] [11.2991075] 0.005 0.001
The head in the dictionary format given below:
{'acoeff': {0: '[[0.33907555]]', 1: '[[0.34024954]]', 2: '[[0.34134777]]', 3: '[[0.34297324]]', 4: '[[0.34373931]]'}, 'bcoeff': {0: '[11.51908656]', 1: '[11.45693353]', 2: '[11.40045124]', 3: '[11.33036004]', 4: '[11.2991075]'}, 'refdiff': {0: 0.0, 1: 0.001, 2: 0.002, 3: 0.004, 4: 0.005}, 'ref18': {0: 0.001, 1: 0.001, 2: 0.001, 3: 0.001, 4: 0.001}}
strings
strip the [] and convert to numeric:
(df.update(df.select_dtypes(exclude='number')
.apply(lambda c: pd.to_numeric(c.str.strip('[]'))))
)
print(df)
real lists
You can unnest the list with the str locator:
df['a'].str[0].str[0]
output:
0 10
1 20
2 30
3 40
Name: a, dtype: int64
To automatize things a bit, you can use a recursive function:
def unnest(x):
from pandas.api.types import is_numeric_dtype
if is_numeric_dtype(x):
return x
else:
return unnest(x.str[0])
df2 = df.apply(unnest)
variant using the first item of each Series to determine the nesting level:
def unnest(x):
from pandas.api.types import is_numeric_dtype
if len(x)>0 and isinstance(x.iloc[0], list):
return unnest(x.str[0])
else:
return x
df2 = df.apply(unnest)
output:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120
arbitrary nesting
If you had an arbitrary nesting for each cell, you could use the same logic per element:
def unnest(x):
if isinstance(x, list) and len(x)>0:
return unnest(x[0])
else:
return x
df2 = df.applymap(unnest)
Maybe not the best solution. But it works.
def ravel_series(s):
try:
return np.concatenate(s).ravel()
except ValueError:
return s
df.apply(ravel_series)
You can try this,
Code:
def clean(el):
if any(isinstance(i, list) for i in el):
return el[0][0]
elif isinstance(row, list):
return el[0]
df['a'] = df.a.apply(clean)
df['b'] = df.b.apply(clean)
print(df)
Output:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120
I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])
To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]
Lets say I have the following data set:
A B
10.1 53
12.5 42
16.0 37
20.7 03
25.6 16
30.1 01
40.9 19
60.5 99
I have a the following list of ranges.
[[9,15],[19,22],[39,50]]
How do I efficiently pull rows that lie in those ranges?
Wanted Output
A B
10.1 53
12.5 42
20.7 03
40.9 19
Edit:
Needs to work for floating points
Update for modified question
For floats, you can construct a mask using NumPy array operations:
L = np.array([[9,15],[19,22],[39,50]])
A = df['A'].values
mask = ((A >= L[:, 0][:, None]) & (A <= L[:, 1][:, None])).any(0)
res = df[mask]
print(res)
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
Previous answer to original question
For integers, you can use numpy.concatenate with numpy.arange:
L = [[9,15],[19,22],[39,50]]
vals = np.concatenate([np.arange(i, j) for i, j in L])
res = df[df['A'].isin(vals)]
print(res)
A B
0 10 53
1 12 42
3 20 3
6 40 19
An alternative solution with itertools.chain and range:
from itertools import chain
vals = set(chain.from_iterable(range(i, j) for i, j in L))
res = df[df['A'].isin(vals)]
Here's another method (edit: works with floats or integers). #jpp's might be faster, but this code is easier to understand (in my opinion).
df = pd.DataFrame([[10.1,53],[12.5,42],[16.0,37],[20.7,3],[25.6,16],[30.1,1],[40.9,19],[60.5,99]],columns=list('AB'))
ranges = [[9,15],[19,22],[39,50]]
result = pd.DataFrame(columns=list('AB'))
for r in ranges:
result = result.append(df[df['A'].between(r[0], r[1], inclusive=False)])
print (result)
Here's the output:
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
PS: the following one-line list comprehension also works:
result = result.append([source[source['A'].between(r[0], r[1], inclusive=False)] for r in ranges])
I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )