i'll explain for simple example then go into the deep
if i have a list of number consist of
t_original = [180,174,168,166,162,94,70,80,128,131,160,180]
if we graph this so it goes down from 180 to 70 then it ups to 180 again
but if we suddenly change the fourth value (166) by 450 then the list will be
t = [180,174,168,700,162,94,70,80,128,131,160,180]
which dose not make sense in the graph
i wanna treat the fourth value (700) as a wrong value
i want to replace it with a relative value even if not as the original value but relative to the previous two elements (168,174)
i wanna do the same for the whole list if another wrong value appeared again
we can call that [Filling gaps between list of numbers]
so i'm tryig to do the same idea but for bigger example
the method i have tried
and i'll share my code with output , filtered means applied filling gap function
my code
def preprocFN(*U):
prePlst=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2
c1 =-1 #(previous)
c2 =0 #(current)
c3 = 1 #(next)
preP = U[0] # original list
if c2 == 0:
prePlst.append(preP[0])
prePlst.append(preP[1])
c1+=2
c2+=2
c0+=2
oldlen = len(preP)
while oldlen > c2:
Equ = abs(2*preP[c1] - preP[c0]) #fn of preprocessing #removed abs()
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
prePlst.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
return prePlst
with my input : https://textuploader.com/t1py9
the output will be : https://textuploader.com/t1pyk
and when printing the values higher than 180 (wrong values)
result_list = [item for item in list if item > 180]
which dosen't make sense that any joint of human can pass the angle of 180
the output was [183.6, 213.85, 221.62, 192.05, 203.39, 197.22, 188.45, 182.48, 180.41, 200.09, 200.67, 198.14, 199.44, 198.45, 200.55, 193.25, 204.19, 204.35, 200.59, 211.4, 180.51, 183.4, 217.91, 218.94, 213.79, 205.62, 221.35, 182.39, 180.62, 183.06, 180.78, 231.09, 227.33, 224.49, 237.02, 212.53, 207.0, 212.92, 182.28, 254.02, 232.49, 224.78, 193.92, 216.0, 184.82, 214.68, 182.04, 181.07, 234.68, 233.63, 182.84, 193.94, 226.8, 223.69, 222.77, 180.67, 184.72, 180.39, 183.99, 186.44, 233.35, 228.02, 195.31, 183.97, 185.26, 182.13, 207.09, 213.21, 238.41, 229.38, 181.57, 211.19, 180.05, 181.47, 199.69, 213.59, 191.99, 194.65, 190.75, 199.93, 221.43, 181.51, 181.42, 180.22]
so the filling gaps fn from proposed method dosen't do it's job
any suggestion for applying the same concept with a different way ?
Extra Info may help
the filtered graph consists of filling gap function and then applying normalize function
i don't think the problem is from the normalizing function since the output from the filling gaps function isn't correct in my opinion maybe i'm wrong but anyway i provide the normalize steps so you get how the final filtered graph has been made
fn :
My Code :
def outLiersFN(*U):
outliers=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2 #from original
c1 =-1 #(previous) #from original
c2 =0 #(current) #from original
c3 = 1 #(next) #from original
preP = U[0] # original list
if c2 == 0:
outliers.append(preP[0])
c1+=1
c2+=1
c0+=1
c3+=1
oldlen = len(preP)
M_RangeOfMotion = 90
while oldlen > c2 :
if c3 == oldlen:
outliers.insert(c2, preP[c2]) #preP[c2] >> last element in old list
break
if (preP[c2] > M_RangeOfMotion and preP[c2] < (preP[c1] + preP[c3])/2) or (preP[c2] < M_RangeOfMotion and preP[c2] > (preP[c1] + preP[c3])/2): #Check Paper 3.3.1
Equ = (preP[c1] + preP[c3])/2 #fn of preprocessing # From third index # ==== inserting current frame
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
outliers.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
c3+=1
else :
Equ = preP[c2] # fn of preprocessing #put same element (do nothing)
formatted_float = "{:.2f}".format(Equ) # with .2 number only
equu = float(formatted_float) # from string float to float
outliers.insert(c2, equu) # insert the preprocessed value to the List
c1 += 1
c2 += 1
c0 += 1
c3 += 1
return outliers
I suggest the following algorithm:
data point t[i] is considered an outlier if it deviates from the average of t[i-2], t[i-1], t[i], t[i+1], t[i+2] by more than the standard deviation of these 5 elements.
outliers are replaced by the average of the two elements around them.
import matplotlib.pyplot as plt
from statistics import mean, stdev
t = [180,174,168,700,162,94,70,80,128,131,160,180]
def smooth(t):
new_t = []
for i, x in enumerate(t):
neighbourhood = t[max(i-2,0): i+3]
m = mean(neighbourhood)
s = stdev(neighbourhood, xbar=m)
if abs(x - m) > s:
x = ( t[i - 1 + (i==0)*2] + t[i + 1 - (i+1==len(t))*2] ) / 2
new_t.append(x)
return new_t
new_t = smooth(t)
plt.plot(t)
plt.plot(new_t)
plt.show()
** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.
I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?
You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184
First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])
Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())
I'm trying to generate a list of 8 numbers with the following for code:
import numpy as np
import pandas as pd
n2 = np.random.uniform(0.1, 1.5)
c2 = np.random.uniform(4,14)
c3 = np.random.uniform(0.1,2.9)
ic4 = np.random.uniform(0.01,1)
nc4 = np.random.uniform(0.01,1)
ic5 = np.random.uniform(0,0.2)
nc5 = np.random.uniform(0,0.01)
comp_list = []
for i in range(1):
if n2/c2 <= 0.11:
comp_list.append(c2)
comp_list.append(n2)
if c3/c2 <= 0.26:
comp_list.append(c3)
if ic4/c3 <= 0.27:
comp_list.append(ic4)
if nc4/ic4 <= 1:
comp_list.append(nc4)
if ic5/nc4 <= 0.06:
comp_list.append(ic5)
if nc5/ic5 <= 0.25:
comp_list.append(nc5)
sum = n2+c2+c3+ic4+nc4+ic5+nc5
c1 = 100-sum
comp_list.append(c1)
df = pd.Series(comp_list)
print df
However, when i run the code, the amount of numbers outputted is not consistent and can range from 3 to 5. for example, 1 run would give me:
0 1.560
1 0.251
2 0.008
3 86.665
a second run would give me:
0 12.929
1 1.015
2 2.126
3 0.093
4 0.0025
5 83.376
I have no idea why the output is not consistent.
Maybe I need to iterate through a random distribution until all the if statements are satisfied? or am I missing something obvious?
You want to enclose all that in a while loop. Roll for all your numbers. If they all satisfy your conditions, create the necessary list. As long as one of the conditions isn't satisfied, all numbers will be rerolled.
The reason why you want to reroll all numbers is so that they are all independent of each other. For example, c2 is involved in both n2/c2 and c3/c2. If n2/c2 is satisfied and you keep those two numbers, the values of c3 you can have in order to satisfy the condition c3/c2 are constrained by what you have already chosen for c2.
import numpy as np
while True:
n2 = np.random.uniform(0.1, 1.5)
c2 = np.random.uniform(4,14)
c3 = np.random.uniform(0.1,2.9)
ic4 = np.random.uniform(0.01,1)
nc4 = np.random.uniform(0.01,1)
ic5 = np.random.uniform(0,0.2)
nc5 = np.random.uniform(0,0.01)
if (n2/c2 <= 0.11 and
c3/c2 <= 0.26 and
ic4/c3 <= 0.27 and
nc4/ic4 <= 1 and
ic5/nc4 <= 0.06 and
nc5/ic5 <= 0.25):
comp_list = [n2, c2, c3, ic4, nc4, ic5, nc5]
comp_list.append(100 - sum(comp_list))
break
Edit: In order to generate a list of such lists, iterate this code as many times as necessary and append the result of comp_list each time.
big_list = []
for _ in range(10):
# while stuff
big_list.append(comp_list)
If this is something you expect to run in various locations in your code, you may as well put that in a function.
def generate_numbers():
while True:
...
return comp_list
And then you can do big_list = [generate_numbers() for _ in range(10)].
I am new to numpy.I have referred to the following SO question:
Why NumPy instead of Python lists?
The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.
I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.
My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.
Please find the Ipython code snippet below:
for id1 in range(1,count+1):
data1 = df[df.movie_id == id1].set_index('user_id')[cols]
sim_score = {}
for id2 in range (1, count+1):
if id1 != id2:
data2 = df[df.movie_id == id2].set_index('user_id')[cols]
sim = calculatePearsonCorrUnified(data1, data2)
else:
sim = 1
sim_matrix_panel[id1]['Sim'][id2] = sim
import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):
sim_score = 0
common_movies_or_users = []
for temp_id in df1.index:
if temp_id in df2.index:
common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
return sim_score;
# Ratings corresponding to user_1 / movie_1, present in the common list
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values
sum1 = sum (rating1)
sum2 = sum (rating2)
# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))
# Sum up the products
pSum = sum(np.multiply(rating1, rating2))
# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)
return sim_score
What would be the best way to most precisely time the runtime with either of these options?
Any pointers would be greatly appreciated.