How to pass condition into lambda? - python

I have a dictionary like this:
Dict={'A':0.0697,'B':0.1136,'C':0.2227,'D':0.2725,'E':0.4555}
I want my output like this:
Return A,B,C,D,E if the value in my dataframe is LESS THAN 0.0697,0.1136,0.2227,0.2725,0.4555 respectively; else return F
I tried:
TrainTest['saga1'] = TrainTest['saga'].apply(lambda x,v: Dict[x] if x<=v else 'F')
But it returns an error:
TypeError: <lambda>() takes exactly 2 arguments (1 given)

Let's make some test data:
saga = pd.Series([0.1, 0.2, 0.3, 0.4, 0.5, 0.9])
Next, recognize that Dict is a dict and has no ordering, so let's get that sorted by the numbers in reverse order:
thresh = sorted(Dict.items(), key=lambda t: t[1], reverse=True)
Finally, solve the problem by looping not over saga but over thresh, because loops (and apply()) in Python/Pandas are slow and we assume saga is much longer than thresh:
result = pd.Series('F', saga.index) # all F's to start
for name, value in thresh:
result[saga < value] = name
Now result is a series of values A,B,C,D,E,F as appropriate--we loop in reverse order because e.g. 0 is smaller than all the values and should be labeled A, not E.

Regarding run-times:
In [160]:%%timeit
# loop over smaller thresh, not << saga
for name, value in thresh:
result[saga < value] = name
100 loops, best of 3: 2.59 ms per loop
Here are pandas run-times:
saga1 = pd.DataFrame([0.05,0.1, 0.2, 0.3, 0.4, 0.5, 0.9],columns=['c1'])
def mapF(s):
# descending
curr='F'
for name, value in thresh:
if s < value:
curr = name
return curr
Using map/apply:
In [149]: %%timeit
saga1['result'] = saga1['c1'].map(lambda x: mapF(x) )
1000 loops, best of 3: 311 µs per loop
Using vectorization:
In [166]:%%timeit
import numpy as np
saga1['result'] = np.vectorize(mapF)(saga1['c1'])
1000 loops, best of 3: 244 µs per loop
** saga1
+---+------+--------+
| | c1 | result |
+---+------+--------+
| 0 | 0.05 | A |
| 1 | 0.1 | B |
| 2 | 0.2 | C |
| 3 | 0.3 | E |
| 4 | 0.4 | E |
| 5 | 0.5 | F |
| 6 | 0.9 | F |
+---+------+--------+

Related

How should I solve logic error in timestamp using Python?

I have written a code to calculate a, b, and c. They were initialized at 0.
This is my input file
-------------------------------------------------------------
| Line | Time | Command | Data |
-------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | WRITING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | WRITING | |
and the timestamps and commands are used to process the calculation for a, b, and c.
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
#print(m[2])
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
#print(m[3], m[2])
else:
timestamps[m[3]] = [m[2]]
#print(m[3], m[2])
a = b = c = 0
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
if timestamps["ACTIVE"] > timestamps["PRECHARGE"]: #line causing logic error
a = a + 1
print(a)
Before getting into the calculation, I assign the timestamps with respect to the commands. This is the output for this section.
ACTIVE : 3, ['0015', '0150', '0318']
WRITING : 4, ['0030', '0200', '0447', '0503']
WRITING_A : 3, ['0100', '0345', '0430']
PRECHARGE : 2, ['0115', '0314']
REFRESH : 1, ['0120']
To get a, the timestamps of ACTIVE must be greater than PRECHARGE and WRITING must be greater than ACTIVE. (Line 4, 6, 7 will contribute to the first a and Line 8, 9, and 12 contributes to the second a)
To get b, the timestamps of WRITING must be greater than ACTIVE. For the lines that contribute to a such as Line 4, 6, 7, 8, 9, and 12, they cannot be used to calculate b. So, Line 1 and 2 contribute to b.
To get c, the rest of the unused lines containing WRITING will contribute to c.
The expected output:
a = 2
b = 1
c = 1
However, in my code, when I print a, it displays 0, which shows the logic has some error. Any suggestion to amend my code to achieve the goal? I have tried for a few days and the problem is not solved yet.
I made a function that will return the commands in order that match a pattern with gaps allowed.
I also made a more compact version of your file reading.
There is probably a better version to divide the list into two parts, the problem was to only allow elements in that match the whole pattern. In this one I iterate over the elements twice.
import re
commands = list()
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
_, line, time, command, data, _ = m
commands.append((line,time,command))
def search_pattern(pattern, iterable, key=None):
iter = 0
count = 0
length = len(pattern)
results = []
sentinel = object()
for elem in iterable:
original_elem = elem
if key is not None:
elem = key(elem)
if elem == pattern[iter]:
iter += 1
results.append((original_elem,sentinel))
if iter >= length:
iter = iter % length
count += length
else:
results.append((sentinel,original_elem))
matching = []
nonmatching = []
for res in results:
first,second = res
if count > 0:
if second is sentinel:
matching.append(first)
count -= 1
elif first is sentinel:
nonmatching.append(second)
else:
value = first if second is sentinel else second
nonmatching.append(value)
return matching, nonmatching
pattern_a = ['PRECHARGE','ACTIVE','WRITING']
pattern_b = ['ACTIVE','WRITING']
pattern_c = ['WRITING']
matching, nonmatching = search_pattern(pattern_a, commands, key=lambda t: t[2])
a = len(matching)//len(pattern_a)
matching, nonmatching = search_pattern(pattern_b, nonmatching, key=lambda t: t[2])
b = len(matching)//len(pattern_b)
matching, nonmatching = search_pattern(pattern_c, nonmatching, key=lambda t: t[2])
c = len(matching)//len(pattern_c)
print(f'{a=}')
print(f'{b=}')
print(f'{c=}')
Output:
a=2
b=1
c=1

Efficient way to select subset dataframe

I've 9.5M rows in a DataFrame of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1 | 0 | 0 | Fail
...
w6000 | 45| 45| Pass
What is the most efficient way to select subset DataFrame for each "Id" and do processing with that?
As of now I'm doing following:
I already have set of possible "Id"s from another DataFrame
for id in uniqueIds:
subsetDF = mainDF[mainDF["Id"] == id]
predLabel = predict(subsetDF)
But this seems to have severe performance issue as there're 6.7K such possible id and each repeating 1.4K times. I've done some profiling using cProfile that does not point to this line but I see some scalar op call taking time which is has exact 6.7K call count.
EDIT2: The requirement for the subset-dataframe is that all rows should have same Id - finally for the training or predict 'Id' is not that important but the X,Y location and pass/fail in that location is important.
The subsetDF should be of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1399 | 0 | 0 | Fail
...
w1399 |45 |45 | Pass
Conclusion:
Winner: groupby
According to the result of my experiments, the most efficient way to select a subset DataFrame for each "Id" and do processing with is to use the groupby method.
Code (Jupyter Lab):
# Preparation:
import pandas as pd
import numpy as np
# Create a sample dataframe
n = 6700 * 1400 # = 9380000
freq = 1400
mainDF = pd.DataFrame({
'Id': ['w{:04d}'.format(i//freq) for i in range(n)],
'X': np.random.randint(0, 46, n),
'Y': np.random.randint(0, 46, n),
'Pass_Fail_Status': [('Pass', 'Fail')[i] for i in np.random.randint(0, 2, n)]
})
uniqueIds = set(mainDF['Id'])
# Experiments:
# Experiment (a): apply pandas mask (the OP's method)
def exp_a():
for _id in uniqueIds:
subsetDF = mainDF[mainDF['Id'] == _id]
print('Experiment (a):')
%timeit exp_a()
# Experiment (b): use set_index
def exp_b():
df_b = mainDF.set_index('Id')
for _id in uniqueIds:
subsetDF = df_b.loc[_id]
print('Experiment (b):')
%timeit exp_b()
# Experiment (c): use groupby
def exp_c():
for _, subsetDF in mainDF.groupby('Id'):
pass
print('Experiment (c):')
%timeit exp_c()
Output:
Experiment (a): # apply pandas mask (the OP's method)
39min 46s ± 992 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (b): # use set_index
1.19 s ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (c): # use groupby
997 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sample dataframe:
Id
X
Y
Pass_Fail_Status
0
w0000
9
28
Fail
1
w0000
42
28
Pass
2
w0000
26
36
Pass
9379997
w6699
12
14
Fail
9379998
w6699
8
40
Fail
9379999
w6699
17
21
Pass
IIUC, you could use groupby + sample to randomly sample a certain fraction of the original df to split into train and test DataFrames:
train = df.groupby('Id').sample(frac=0.7)
test = df[~df.index.isin(train.index)]
For example, in the sample you have in the OP, the above code produces:
train:
Id X Y Pass_Fail_Status
0 w0 0 0 Pass
2 w1 0 0 Fail
3 w6000 45 45 Pass
test:
Id X Y Pass_Fail_Status
1 w0 0 1 Fail
I have tried the suggestion for 'groupby' by both 'enke' and 'quasi-human'. It improved the overall performance (including other operations) by 6X (I measured the numbers 3 times for each approaches and this gain is based on avg) - now the for loop is like following:
for id, subsetDF in mainDF.groupby("Id", as_index=False):
predLabel = predict(subsetDF)

Python: Find Average Y-Value for Each X-Value in [X,Y] Coordinates

Let's say I have a list of x,y coordinates like this:
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
I want to find the average y-value associated with each x-value. So for instance, there's only one "2" x-value in the dataset, so the average y-value would be "5". However, there are three 8's and the average y-value would be 11 [ (8+11+14) / 3 ].
What would be the most efficient way to do this?
y_values_by_x = {}
for x, y in coordinate_list:
y_values_by_x.setdefault(x, []).append(y)
average_y_by_x = {k: sum(v)/len(v) for k, v in y_values_by_x.items()}
You can use pandas
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
import pandas as pd
df = pd.DataFrame(coordinate_list)
df
df.groupby([0]).mean()
| 0 | | 1 |
| --- | --- |
| -2 | -2 |
| 0 | 2 |
| 2 | 5 |
| 4 | 6 |
| 8 | 11 |
Try the mean() function from statistics module with list comprehension
from statistics import mean
x0_filter_value = 0 # can be any value of your choice for finding average
result = mean([x[1] for x in coordinate_list if x[0] == x0_filter_value])
print(result)
And to print means for all X[0] values:
for i in set([x[0] for x in coordinate_list]):
print (i,mean([x[1] for x in coordinate_list if x[0] == i]))

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Grouping floating point numbers

I have an application where I need to block average a list of data (currently in a pandas.DataFrame) according to a timestamp, which may be a floating point value. For example, I may need to average the following df into groups of 0.3 secs:
+------+------+ +------+------+
| secs | A | | secs | A |
+------+------+ +------+------+
| 0.1 | .. | | 0.3 | .. | <-- avg of 0.1, 0.2, 0.3
| 0.2 | .. | --> | 0.6 | .. | <-- avg of 0.4, 0.5, 0.6
| 0.3 | .. | | ... | ... | <-- etc
| 0.4 | .. | +------+------+
| 0.5 | .. |
| 0.6 | .. |
| ... | ... |
+------+------+
Currently I am using the following (minimal) solution:
import pandas as pd
import numpy as np
def block_avg ( df : pd.DataFrame, duration : float ) -> pd.DataFrame:
grouping = (df['secs'] - df['secs'][0]) // duration
df = df.groupby( grouping, as_index=False ).mean()
df['secs'] = duration * np.arange(1,1+len(df))
return df
which works just fine for integer durations, but floating point values at the edges of blocks can fall on the wrong side. A simple test that the blocks are being created properly is to average by the same duration that the data is already in (0.1 in this example). This should return the input, but often doesn't. (e.g. x=.1*np.arange(1,20); (x-x[0])//.1).)
I found that the error with this method is usually that the LSB is 1 low, so a tentative fix is to add np.spacing(df['secs']) to the numerator in the grouping. (That is, x=.1*np.arange(1,20); all( (x-x[0]+np.spacing(x)) // .1 == np.arange(19) ) returns True.)
However, I am concerned that this is not a robust solution. Is there a better or preferred way to group floats which passes the above test?
I have had similar issues with a (perhaps more straightforward) algorithm which groups using x[ (duration*i < x) & (x <= duration*(i+1)) ] and looping i over an appropriate range.
To be extra careful (of float inaccuracy) I'd round early before doing the groupby:
In [11]: np.round(300 + df.secs * 1000).astype(int) // 300
Out[11]:
0 1
1 1
2 1
3 2
4 2
5 2
Name: secs, dtype: int64
In [12]: (np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3
Out[12]:
0 0.3
1 0.3
2 0.3
3 0.6
4 0.6
5 0.6
Name: secs, dtype: float64
In [13]: df.groupby(by=(np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3)["A"].sum()
Out[13]:
secs
0.3 1.753843
0.6 2.687098
Name: A, dtype: float64
I would prefer to use a timedelta:
In [21]: s = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [22]: df["secs"] = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [23]: df.groupby(pd.Grouper(key="secs", freq="0.3S")).sum()
Out[23]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
or with a resample:
In [24]: res = df.set_index("secs").resample("300ms").sum()
In [25]: res
Out[25]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
you can set the index to correct the labelling*
In [26]: res.index += np.timedelta64(300, "ms")
In [27]: res
Out[27]:
A
secs
00:00:00.300000 1.753843
00:00:00.600000 2.687098
* There ought to be a way to set that through a resample argument, but they don't seem to work...

Categories