Here's my data
id
123246512378
632746378456
378256364036
159204652855
327445634589
I want to make data that consist of data that consist dual three consecutive numbers like 123246512378, 3274456|34589 is reduced
id
632746378456
378256364036
159204652855
First, turn df.id into a an array of single digit integers.
a = np.array(list(map(list, map(str, df.id))), dtype=int)
Then check to see if one digit is one less than the next digit... twice
first = a[:, :-2] == a[:, 1:-1] - 1
second = a[:, 1:-1] == a[:, 2:] - 1
Create a mask for when we have this happen more than once
mask = np.count_nonzero(first & second, axis=1) < 2
df[mask]
id
1 632746378456
2 378256364036
3 159204652855
Not sure if this is faster than #piRSquared as I'm not good enough with pandas to generate my own test data, but it seems like it should be:
def mask_cons(df):
a = np.array(list(map(list, df.id.astype(str))), dtype = float)
# same as piRSquared, but float
g_a = np.gradient(a, axis = 1)[:,1:-1]
# 3 consecutive values will give grad(a) = +/-1
mask = (np.abs(g_a) == 1).sum(1) > 1
# this assumes 4 consecutive values count as 2 instances of 3 consecutive values
# otherwise more complicated methods are needed (probably #jit)
return df[mask]
Related
I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37
I'm new in Python.
I have a 2D np.array (ex. 50 rows and 12 columns) and I need the mean of the 3rd column when the 1st column==x and the 9th column==y.
I can't figure out how to do it without using ifs...
Any help would be appreciated.
Let's assume your array is called arr. In this case, you want to apply two different filters first 1st column==x the second 9th column==y. To begin with, you should create each filter (mask) separately and then see what you want to do with them in terms of logical relation between them and your expected output.
mask1 = arr[:, 0] == x # 1st column==x
mask1 = arr[:, 8] == y # 9th column==y
Now you can use or, and, or any other logical operator to create your final mask which in this case it's and. For that sake in numpy you can use logical functions.
final_mask = np.logical_and(mask1, mask2)
And finally all you need is to filter your array based on the final_mask and perform the calculations you intended to do:
filtered_3rd_column = arr[:, final_mask]
_mean = filtered_3rd_column.mean()
You can use np.where():
x = 1
y = 2
a[np.where((a[:, 0] == x) & (a[:, 8] == y)), 3].mean()
I solved the problem as follows (thanks to Kasrâmvd):
mask1 = arr[:, 0] == x # 1st column==x
mask1 = arr[:, 8] == y # 9th column==y
final_mask = np.logical_and(mask1, mask2)
filtered_arr = arr[final_mask,:]
mean_3rd_column = filtered_arr[:,2].mean()
The following is my script. Each equal part has self.number samples, in0 is input sample. There is an error as follows:
pn[i] = pn[i] + d
IndexError: list index out of range
Is this the problem about the size of pn? How can I define a list with a certain size but no exact number in it?
for i in range(0,len(in0)/self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
if pn[i] >= self.alpha:
out[i] = 1
elif pn[i] <= self.beta:
out[i] = 0
else:
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
There are a number of problems in the code as posted, however, the gist seems to be something that you'd want to do with numpy arrays instead of iterating over lists.
For example, the set of if/else cases that check if pn[i] >= some_value and then sets a corresponding entry into another list with the result (true/false) could be done as a one-liner with an array operation much faster than iterating over lists.
import numpy as np
# for example, assuming you have 9 numbers in your list
# and you want them divided into 3 sublists of 3 values each
# in0 is your original list, which for example might be:
in0 = [1.05, -0.45, -0.63, 0.07, -0.71, 0.72, -0.12, -1.56, -1.92]
# convert into array
in2 = np.array(in0)
# reshape to 3 rows, the -1 means that numpy will figure out
# what the second dimension must be.
in2 = in2.reshape((3,-1))
print(in2)
output:
[[ 1.05 -0.45 -0.63]
[ 0.07 -0.71 0.72]
[-0.12 -1.56 -1.92]]
With this 2-d array structure, element-wise summing is super easy. So is element-wise threshold checking. Plus 'vectorizing' these operations has big speed advantages if you are working with large data.
# add corresponding entries, we want to add the columns together,
# as each row should correspond to your sub-lists.
pn = in2.sum(axis=0) # you can sum row-wise or column-wise, or all elements
print(pn)
output: [ 1. -2.72 -1.83]
# it is also trivial to check the threshold conditions
# here I check each entry in pn against a scalar
alpha = 0.0
out1 = ( pn >= alpha )
print(out1)
output: [ True False False]
# you can easily convert booleans to 1/0
x = out1.astype('int') # or simply out1 * 1
print(x)
output: [1 0 0]
# if you have a list of element-wise thresholds
beta = np.array([0.0, 0.5, -2.0])
out2 = (pn >= beta)
print(out2)
output: [True False True]
I hope this helps. Using the correct data structures for your task can make the analysis much easier and faster. There is a wealth of documentation on numpy, which is the standard numeric library for python.
You initialize pn to an empty list just inside the for loop, never assign anything into it, and then attempt to access an index i. There is nothing at index i because there is nothing at any index in pn yet.
for i in range(0, len(in0) / self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
If you are trying to add the value d to the pn list, you should do this instead:
pn.append(d)
I'm writing a few Python lines of code doing the following:
I have two arrays a and b, b contains (non strictly) increasing integers.
I want to extract from a the values for which the values of b is a multiple of 20 but I don't want duplicates, in the sense that if b has values : ...,40,40,41,... I only want the first value in a corresponding the 40 not the second one.
That's why a[b%20==0] does not work.
I've been using:
factors = [20*i for i in xrange(1,int(b[-1]/20 +1))]
sample = numpy.array([a[numpy.nonzero(b==factor)[0][0]] for factor in factors])
but it is both slow and fairly inelegant.
Is there a Pythonista 'cute' way of doing it?
a[(b % 20 == 0) & np.r_[True, np.diff(b) > 0]]
The b % 20 == 0 part gives a binary mask that selects all the elements of b that are a factor of 20. The np.r_[True, np.diff(b) > 0] part creates a binary mask that selects only the elements that differ from the previous element (we explicitly add a True at the beginning, as the first element does not have a previous element). Add the masks together and voila!
Let's say we create a boolean array wich marks the unique values on b:
c = np.zeros(b.shape, dtype=np.bool)
c[np.unique(b, return_index = True)[1]] = True
Now you can do:
a[np.logical_and(b % 20 == 0, c)]
If your b is sorted, using diff should be a bit faster than using unique:
import numpy
a = numpy.random.random_integers(0, 1000, 1000)
b = numpy.random.random_integers(0, 1000, 1000)
b.sort()
subset = a[(numpy.diff(b) != 0) * (b[:-1]%20 == 0)]
I am trying to group on two columns to get an aggregated value and then test that value to see if it is greater or smaller than a threshold. What I have:
SEGMENT = df.groupby(['Col_1','Col_2'])['Number'].apply(lambda x: '1_5' if sum(x) <6 else '6+'
It is slow. Is there a fundamental error in this approach? Thanks.
Edit:
SEGMENT = df.groupby(['Col_1','Col_2'])['Number'].sum().apply(lambda x: '1_5' if x <6 else '6+'
This is speeds it up 3x.
You can do a transform and use it as a boolean mask:
g = df.groupby(['Col_1','Col_2'])
mask = g["Number"].transform("sum") < 6
df[mask] # with group sum smaller than 6
df[~mask] # with group sum greater or equal 6
You're can also use filter:
g.filter(lambda x: x.sum() >= 6)