Python: numpy/pandas change values on condition - python

I would like to know if there is a faster and more "pythonic" way of doing the following, e.g. using some built in methods.
Given a pandas DataFrame or numpy array of floats, if the value is equal or smaller than 0.5 I need to calculate the reciprocal value and multiply with -1 and replace the old value with the newly calculated one.
"Transform" is probably a bad choice of words, please tell me if you have a better/more accurate description.
Thank you for your help and support!!
Data:
import numpy as np
import pandas as pd
dicti = {"A" : np.arange(0.0, 3, 0.1),
"B" : np.arange(0, 30, 1),
"C" : list("ELVISLIVES")*3}
df = pd.DataFrame(dicti)
my function:
def transform_colname(df, colname):
series = df[colname]
newval_list = []
for val in series:
if val <= 0.5:
newval = (1/val)*-1
newval_list.append(newval)
else:
newval_list.append(val)
df[colname] = newval_list
return df
function call:
transform_colname(df, colname="A")
**--> I'm summing up the results here, since comments wouldn't allow to post code (or I don't know how to do it).**
Thank you all for your fast and great answers!!
using ipython "%timeit" with "real" data:
my function:
10 loops, best of 3: 24.1 ms per loop
from jojo:
def transform_colname_v2(df, colname):
series = df[colname]
df[colname] = np.where(series <= 0.5, 1/series*-1, series)
return df
100 loops, best of 3: 2.76 ms per loop
from FooBar:
def transform_colname_v3(df, colname):
df.loc[df[colname] <= 0.5, colname] = - 1 / df[colname][df[colname] <= 0.5]
return df
100 loops, best of 3: 3.32 ms per loop
from dmvianna:
def transform_colname_v4(df, colname):
df[colname] = df[colname].where(df[colname] <= 0.5, (1/df[colname])*-1)
return df
100 loops, best of 3: 3.7 ms per loop
Please tell/show me if you would implement your code in a different way!
One final QUESTION: (answered)
How could "FooBar" and "dmvianna" 's versions be made "generic"? I mean, I had to write the name of the column into the function (since using it as a variable didn't work). Please explain this last point!
--> thanks jojo, ".loc" isn't the right way, but very simple df[colname] is sufficient. changed the functions above to be more "generic". (also changed ">" to be "<=", and updated timing)
Thank you very much!!

The typical trick is to write a general mathematical operation to apply to the whole column, but then use indicators to select rows for which we actually apply it:
df.loc[df.A < 0.5, 'A'] = - 1 / df.A[df.A < 0.5]
In[13]: df
Out[13]:
A B C
0 -inf 0 E
1 -10.000000 1 L
2 -5.000000 2 V
3 -3.333333 3 I
4 -2.500000 4 S
5 0.500000 5 L
6 0.600000 6 I
7 0.700000 7 V
8 0.800000 8 E
9 0.900000 9 S
10 1.000000 10 E
11 1.100000 11 L
12 1.200000 12 V
13 1.300000 13 I
14 1.400000 14 S
15 1.500000 15 L
16 1.600000 16 I
17 1.700000 17 V
18 1.800000 18 E
19 1.900000 19 S
20 2.000000 20 E
21 2.100000 21 L
22 2.200000 22 V
23 2.300000 23 I
24 2.400000 24 S
25 2.500000 25 L
26 2.600000 26 I
27 2.700000 27 V
28 2.800000 28 E
29 2.900000 29 S

If we are talking about arrays:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print 1 / a[a <= 0.5] * (-1)
This will, however only return the values smaller than 0.5.
Alternatively use np.where:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print np.where(a < 0.5, 1 / a * (-1), a)
Talking about pandas DataFrame:
As in #dmvianna's answer (so give some credit to him ;) ), adapting it to pd.DataFrame:
df.a = df.a.where(df.a > 0.5, (1 / df.a) * (-1))

As in #jojo's answer, but using pandas:
df.A = df.A.where(df.A > 0.5, (1/df.A)*-1)
or
df.A.where(df.A > 0.5, (1/df.A)*-1, inplace=True) # this should be faster
.where docstring:
Definition: df.A.where(self, cond, other=nan, inplace=False,
axis=None, level=None, try_cast=False, raise_on_error=True)
Docstring:
Return an object of same shape as self and whose corresponding entries
are from self where cond is True and otherwise are from other.

Related

cleaning dataframe columns with single cell arrays of different types

I am working on a large dataframe with multiple columns. However, some of columns have data in the form of arrays with in arrays (single value). I need to convert the dataframe columns with only cell values i.e., without the array element style. I have tried flatten, squeeze in different ways, but could not get the output in the way I am looking.
Following code reproduces the data format I am working at present:
import pandas as pd
a = [[[10]],[[20]],[[30]],[[40]]]
b=[[50],[60],[70],[80]]
c=[90,100,110,120]
df = pd.DataFrame(list(zip(a,b,c)),columns=['a','b','c'])
print(df)
The output of the above is:
a b c
0 [[10]] [50] 90
1 [[20]] [60] 100
2 [[30]] [70] 110
3 [[40]] [80] 120
However, I am looking to get the output as below:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120
It would really help, if you could suggest how to approach this problem.
The head of the actual dataframe is given below:
acoeff bcoeff refdiff ref18
0 [[0.33907555]] [11.51908656] 0.000 0.001
1 [[0.34024954]] [11.45693353] 0.001 0.001
2 [[0.34134777]] [11.40045124] 0.002 0.001
3 [[0.34297324]] [11.33036004] 0.004 0.001
4 [[0.34373931]] [11.2991075] 0.005 0.001
The head in the dictionary format given below:
{'acoeff': {0: '[[0.33907555]]', 1: '[[0.34024954]]', 2: '[[0.34134777]]', 3: '[[0.34297324]]', 4: '[[0.34373931]]'}, 'bcoeff': {0: '[11.51908656]', 1: '[11.45693353]', 2: '[11.40045124]', 3: '[11.33036004]', 4: '[11.2991075]'}, 'refdiff': {0: 0.0, 1: 0.001, 2: 0.002, 3: 0.004, 4: 0.005}, 'ref18': {0: 0.001, 1: 0.001, 2: 0.001, 3: 0.001, 4: 0.001}}
strings
strip the [] and convert to numeric:
(df.update(df.select_dtypes(exclude='number')
.apply(lambda c: pd.to_numeric(c.str.strip('[]'))))
)
print(df)
real lists
You can unnest the list with the str locator:
df['a'].str[0].str[0]
output:
0 10
1 20
2 30
3 40
Name: a, dtype: int64
To automatize things a bit, you can use a recursive function:
def unnest(x):
from pandas.api.types import is_numeric_dtype
if is_numeric_dtype(x):
return x
else:
return unnest(x.str[0])
df2 = df.apply(unnest)
variant using the first item of each Series to determine the nesting level:
def unnest(x):
from pandas.api.types import is_numeric_dtype
if len(x)>0 and isinstance(x.iloc[0], list):
return unnest(x.str[0])
else:
return x
df2 = df.apply(unnest)
output:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120
arbitrary nesting
If you had an arbitrary nesting for each cell, you could use the same logic per element:
def unnest(x):
if isinstance(x, list) and len(x)>0:
return unnest(x[0])
else:
return x
df2 = df.applymap(unnest)
Maybe not the best solution. But it works.
def ravel_series(s):
try:
return np.concatenate(s).ravel()
except ValueError:
return s
df.apply(ravel_series)
You can try this,
Code:
def clean(el):
if any(isinstance(i, list) for i in el):
return el[0][0]
elif isinstance(row, list):
return el[0]
df['a'] = df.a.apply(clean)
df['b'] = df.b.apply(clean)
print(df)
Output:
a b c
0 10 50 90
1 20 60 100
2 30 70 110
3 40 80 120

Optimising itertools combination with grouped DataFrame and post filter

I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])
To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]

Pulling Multiple Ranges from Dataframe Pandas

Lets say I have the following data set:
A B
10.1 53
12.5 42
16.0 37
20.7 03
25.6 16
30.1 01
40.9 19
60.5 99
I have a the following list of ranges.
[[9,15],[19,22],[39,50]]
How do I efficiently pull rows that lie in those ranges?
Wanted Output
A B
10.1 53
12.5 42
20.7 03
40.9 19
Edit:
Needs to work for floating points
Update for modified question
For floats, you can construct a mask using NumPy array operations:
L = np.array([[9,15],[19,22],[39,50]])
A = df['A'].values
mask = ((A >= L[:, 0][:, None]) & (A <= L[:, 1][:, None])).any(0)
res = df[mask]
print(res)
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
Previous answer to original question
For integers, you can use numpy.concatenate with numpy.arange:
L = [[9,15],[19,22],[39,50]]
vals = np.concatenate([np.arange(i, j) for i, j in L])
res = df[df['A'].isin(vals)]
print(res)
A B
0 10 53
1 12 42
3 20 3
6 40 19
An alternative solution with itertools.chain and range:
from itertools import chain
vals = set(chain.from_iterable(range(i, j) for i, j in L))
res = df[df['A'].isin(vals)]
Here's another method (edit: works with floats or integers). #jpp's might be faster, but this code is easier to understand (in my opinion).
df = pd.DataFrame([[10.1,53],[12.5,42],[16.0,37],[20.7,3],[25.6,16],[30.1,1],[40.9,19],[60.5,99]],columns=list('AB'))
ranges = [[9,15],[19,22],[39,50]]
result = pd.DataFrame(columns=list('AB'))
for r in ranges:
result = result.append(df[df['A'].between(r[0], r[1], inclusive=False)])
print (result)
Here's the output:
A B
0 10.1 53
1 12.5 42
3 20.7 3
6 40.9 19
PS: the following one-line list comprehension also works:
result = result.append([source[source['A'].between(r[0], r[1], inclusive=False)] for r in ranges])

Vectorize calculation of a Pandas Dataframe

I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40

Write 2d dictionary into a dataframe or tab-delimited file using python

I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )

Categories