Data Preprocessing Using Zip Pandas

Data Preprocessing Using Zip Pandas - python

Having done the clustering:
from sklearn.cluster import AffinityPropagation
final_n_clusters = []
preference = np.arange(-20,1,0.3) #preference values
iter_value = np.arange(1, 20, 1) #Maximum number of iterations
for k in preference:
n_cluster_list = []
for j in iter_value:
af = AffinityPropagation(preference = k, max_iter = j, affinity='precomputed').fit(X)
labels = af.labels_
n_clusters = len(np.unique(labels))
n_cluster_list.append(n_clusters)
final_n_clusters.append(n_cluster_list)
And for final_n_clusters I get:
[[1, 97, 97, 97, 97, 1, 1, 97, 97, 97, 97, 97, 1, 97, 97, 97, 21, 1, 97.
...
[1, 30, 37, 5, 45, 33, 13, 8, 7, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7]]
It means: every row are values of "preference" starting from "-20". Every number in row is a different values of "iter_value" starting from "1".
My question is:
Can I get a data frame like this, by applying a "zip"? Or Any other method?
I already have cluster numbers in a "final_n_clusters"
preference iter_value number_of_cluster
-20 1 1 #as you can see number of clusters are from `final_n_clusters`
-20 2 97
... ... ...
-20 3 1
-20 4 97

Use enumerate and list comprehension:
data = [(kv, jv, final_n_clusters[ki][ji]) for ji,jv in enumerate(iter_value) for ki,kv in enumerate(preference)]
df = pd.DataFrame(data, columns=['preference', 'iter_value', 'number_of_cluster'])

Related

Generate 3 random lists and create another one with the sum of their elements

I want to create a NxN matrix (represented as lists of lists), where the first n-1 columns have random numbers in the range 1 to 10, and the last column contains the result of adding the numbers in previous commons.
import random
randomlist1 = []
for i in range(1,10):
n = random.randint(1,100)
randomlist1.append(n)
print(randomlist1)
randomlist2 = []
for i in range(1,10):
n = random.randint(1,100)
randomlist2.append(n)
print(randomlist2)
randomlist3 = []
for i in range(1,10):
n = random.randint(1,100)
randomlist3.append(n)
print(randomlist3)
# I have problems here
lists_of_lists = [sum(x) for x in (randomlist1, randomlist2,randomlist3)]
[sum(x) for x in zip(*lists_of_lists)]
print(lists_of_lists)

Your question calls for a few comments:
the title does not correspond to the question, and the code matches the title, not the question;
the rows randomlist1 , randomlist1 , randomlist1 are not in a matrix;
the final value is not a square matrix;
You write "the columns have random numbers in the range of 1 to 10" but your code randint(1,100) creates numbers in the range [1..100].
Solution to the question
import random
N = 5
# create a N by N-1 matrix of random integers
matrix = [[random.randint(1, 10) for j in range(N-1)] for i in range(N)]
print(f"{N} by {N-1} matrix:\n{matrix}")
# add a column as sum of the previous ones
for line in matrix:
line.append(sum(line))
print(f"{N} by {N} matrix with the last column as sum of the previous ones:\n{matrix}")
Ouput:
5 by 4 matrix:
[[7, 10, 5, 6], [4, 10, 9, 3], [5, 5, 4, 9], [10, 7, 2, 4], [8, 8, 5, 3]]
5 by 5 matrix with the last column as sum of the previous ones:
[[7, 10, 5, 6, 28], [4, 10, 9, 3, 26], [5, 5, 4, 9, 23], [10, 7, 2, 4, 23], [8, 8, 5, 3, 24]]

IIUC try with numpy
import numpy as np
np.random.seed(1) # just for demo purposes
# lists comprehensions to create your 3 lists inside a list
lsts = [np.random.randint(1,100, 10).tolist() for i in range(3)]
np.sum(lsts, axis=0)
# array([145, 100, 131, 105, 215, 115, 194, 247, 116, 45])
lsts
[[38, 13, 73, 10, 76, 6, 80, 65, 17, 2],
[77, 72, 7, 26, 51, 21, 19, 85, 12, 29],
[30, 15, 51, 69, 88, 88, 95, 97, 87, 14]]

Based on #It_is_Chris answer, I propose this as a numpy only implementation, without using lists:
np.random.seed(1)
final_shape = (3, 10)
lsts = np.random.randint(1, 100, np.prod(final_shape)).reshape(final_shape)
lstsum = np.sum(lsts, axis=0)

How to generate random numbers with if-statement in Python?

I would like to generate random numbers with a specific restriction using python. The code should do the following:
If an entered number is:
0, then generate 0 random non-recurrent numbers
<1, then generate 1 random non-recurrent numbers
<9, then generate 2 random non-recurrent numbers
<15, then generate 3 random non-recurrent numbers
<26, then generate 5 random non-recurrent numbers
<51, then generate 8 random non-recurrent numbers
<91, then generate 13 random non-recurrent numbers
<151, then generate 20 random non-recurrent numbers
<281, then generate 32 random non-recurrent numbers
The value of the random numbers should be limited by the value of the entered number. So if a 75 is entered, then the code should generate 13 random numbers with being 75 the highest value of the 13 numbers. 75 doesn't have to be the actual highest number, just in terms of max value.
My guess was to use numpy. Here is what I got until now (with an users help).
num_files=[0,1,9,...]
num_nums=[0,1,2,3,5,...]
for zipp in zip(num_files,num_nums)
if len(docx_files)<zipp[0]:
list_of_rands=np.random.choice(len(docx_files)+1,
zipp[1],replace=False)
Any ideas or more starting points?

Here's one way of doing it. Just zip the lists of numbers and the cutoffs, and check if the number input (the variable number in the code below) is above the cutoff. Note that this doesn't handle the case of numbers larger than 281, since I'm not sure what's supposed to happen there based on your description.
import numpy as np
number = 134
parameters = zip([9, 15, 26, 51, 91, 151], [3, 5, 8, 13, 20, 32])
nums = 2
for item in parameters:
if number > item[0]:
nums = item[1]
np.random.choice(number, nums)

You could define a function using a dictionary with ranges as keys and number of random numbers as values:
import random
def rand_nums(input_num):
d = {26: 5, 51: 8, 91: 13}
for k, v in d.items():
if input_num in range(k):
nums = random.sample(range(k+1), v)
return nums
print(rand_nums(20))
print(rand_nums(50))
print(rand_nums(88))
[14, 23, 11, 9, 5]
[9, 49, 23, 16, 8, 50, 47, 33]
[20, 16, 28, 77, 21, 87, 85, 82, 10, 47, 43, 90, 57]
>>>

You can avoid a many-branched if-elif-else using np.searchsorted:
import numpy as np
def generate(x):
boundaries = np.array([1, 2, 9, 15, 26, 51, 91, 151, 281])
numbers = np.array([0, 1, 2, 3, 5, 8, 13, 20, 32])
return [np.random.choice(j, n, False)+1 if j else np.array([], np.int64)
for j, n in np.broadcast(x, numbers[boundaries.searchsorted(x, 'right')])]
# demo
from pprint import pprint
# single value
pprint(generate(17))
# multiple values in one go
pprint(generate([19, 75, 3, 1, 2, 0, 8, 9]))
# interactive
i = int(input('Enter number: '))
pprint(generate(i))
Sample output:
[array([ 9, 1, 14, 4, 12])]
[array([ 8, 12, 6, 17, 4]),
array([17, 29, 2, 20, 16, 37, 36, 13, 34, 58, 49, 72, 41]),
array([1, 3]),
array([1]),
array([2, 1]),
array([], dtype=int64),
array([1, 8]),
array([3, 2, 6])]
Enter number: 280
[array([184, 73, 80, 280, 254, 164, 192, 145, 176, 29, 58, 251, 37,
107, 5, 51, 7, 128, 142, 125, 135, 87, 259, 83, 260, 10,
108, 210, 8, 36, 181, 64])]

How about:
def gen_rand_array(n):
mapping = np.array([[1,1],
[26,5],
[51,8],
[91,13]])
k = mapping[np.max(np.where(n > mapping[:,0])),1]
return np.random.choice(n+1,k)
Example:
>>> gen_rand_array(27)
array([ 0, 21, 26, 25, 23])
>>> gen_rand_array(27)
array([21, 5, 10, 3, 13])
>>> gen_rand_array(57)
array([30, 26, 50, 31, 44, 51, 39, 13])
>>> gen_rand_array(57)
array([21, 18, 35, 8, 13, 13, 20, 3])
Here's a screen shot putting it all together:
Explanation:
The line k = mapping[np.max(np.where(n > mapping[:,0])),1] is just finding the number of random values needed from the array mapping. n > mapping[:,0] return a boolean array whose values will be True for all the numbers smaller then n, False otherwise. np.where(...) will return the indexes of the elements of the array that are true. Since the values in the first column of mapping (i.e. mapping[:,0]) are ascending, we can find the index of the largest one that is less than n be calling np.max(...). Finally we want the corresponding value from the second column which is why we pass the result of that as an index to mapping again i.e. mapping[...,1] where the 1 is for the second column.

I don't know how to implement it in your code but with this code you then you get the randoms:
import random
x = 51
if x < 26:
ar_Random = [None]*5
for i in range(0, 6):
ar_Random[i] = random.randint(startNumOfRandom, stopNumOfRandom)
elif x < 51:
ar_Random = [None]*8
for i in range (0,9):
ar_Random[i] = random.randint(startNumOfRandom, stopNumOfRandom)
...

I'm not sure how you're mapping the length to the input but this is how you generate N random numbers with a maximum using Numpy.
import numpy as np
//set entered_num and desired_length to whatever you want
random_nums = np.random.randint(entered_num, size = desired_length)

import random
Starting_Number = int(input())
if Starting_Number < 26:
print(random.sample(range(1, 26), 5))
elif Starting_Number < 51:
print(random.sample(range(1, 51), 8))
elif Starting_Number < 91:
print(random.sample(range(1, 91), 13))
Here you go!!!
random.sample is the module you are looking for.
Have a good one!

Repeat calculations over repeated blocks of 5 rows within numpy

I have an array, of which this is a small sample. It repeats measurements 5 times, and I want to collate these blocks of five into a new array, where each block of five rows is now one row giving mean, median and standard deviation of the five initial rows
data =
[[1, 9, 66, 74, -0.274035]
[1, 9, 66, 74, -0.269245]
[1, 9, 66, 74, -0.271161]
[1, 9, 66, 74, -0.269245]
[1, 9, 66, 74, -0.266370]
[2, 10, 65, 73, 0.085277]
[2, 10, 65, 73, 0.086235]
[2, 10, 65, 73, 0.090068]
[2, 10, 65, 73, 0.087193]
[2, 10, 65, 73, 0.085277]
What I would like to do is keep the value of the value in the block for the first 4 column, then find the mean, median and standard deviation of the next column, working iteratively over blocks of five rows.
data2 =
[[1, 9, 66, 74, mean[0:5,4], median[0:5,4], std[0:5,4]]
[2, 10, 65, 73, mean[5:10,4], median[5:10,4], std[5:10,4]]]
or in numerical terms:
[[1, 9, 66, 74, -0.270011, -0.269245, 0.002528]
[2, 10, 65, 73, 0.08681, 0.086235, 0.001777]]
I've tried this, but just get are zeroes as an output:
index.shape
Out[119]: (10,)
repeat = 5
a = 0
b = repeat
length = int((len(index) - repeat) / repeat)
meanVre = np.zeros(length)
for _ in range(length):
np.append(meanVre, np.mean(data[a:b,5]))
a = a+5
b = b+5
(repeat is used as a variable rather than 5, as the amount of rows in the block is liable to change at a later date).
Any help you can give would be really appreciated.

def block_stats(data, blocksize = 5):
inputs = data[::blocksize, :4]
data_stat = data[:, 4].reshape(-1, blocksize)
means = np.mean(data_stat, axis = 1, keepdims = 1)
medians = np.median(data_stat, axis = 1, keepdims = 1)
stds = np.std(data_stat, axis = 1, keepdims = 1)
return np.vstack([inputs, means, medians, stds])

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.

You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.

It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Numpy: find index of the elements within range

I have a numpy array of numbers, for example,
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
I would like to find all the indexes of the elements within a specific range. For instance, if the range is (6, 10), the answer should be (3, 4, 5). Is there a built-in function to do this?

You can use np.where to get indices and np.logical_and to set two conditions:
import numpy as np
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
np.where(np.logical_and(a>=6, a<=10))
# returns (array([3, 4, 5]),)

As in #deinonychusaur's reply, but even more compact:
In [7]: np.where((a >= 6) & (a <=10))
Out[7]: (array([3, 4, 5]),)

Summary of the answers
For understanding what is the best answer we can do some timing using the different solution.
Unfortunately, the question was not well-posed so there are answers to different questions, here I try to point the answer to the same question. Given the array:
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
The answer should be the indexes of the elements between a certain range, we assume inclusive, in this case, 6 and 10.
answer = (3, 4, 5)
Corresponding to the values 6,9,10.
To test the best answer we can use this code.
import timeit
setup = """
import numpy as np
import numexpr as ne
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
# or test it with an array of the similar size
# a = np.random.rand(100)*23 # change the number to the an estimate of your array size.
# we define the left and right limit
ll = 6
rl = 10
def sorted_slice(a,l,r):
start = np.searchsorted(a, l, 'left')
end = np.searchsorted(a, r, 'right')
return np.arange(start,end)
"""
functions = ['sorted_slice(a,ll,rl)', # works only for sorted values
'np.where(np.logical_and(a>=ll, a<=rl))[0]',
'np.where((a >= ll) & (a <=rl))[0]',
'np.where((a>=ll)*(a<=rl))[0]',
'np.where(np.vectorize(lambda x: ll <= x <= rl)(a))[0]',
'np.argwhere((a>=ll) & (a<=rl)).T[0]', # we traspose for getting a single row
'np.where(ne.evaluate("(ll <= a) & (a <= rl)"))[0]',]
functions2 = [
'a[np.logical_and(a>=ll, a<=rl)]',
'a[(a>=ll) & (a<=rl)]',
'a[(a>=ll)*(a<=rl)]',
'a[np.vectorize(lambda x: ll <= x <= rl)(a)]',
'a[ne.evaluate("(ll <= a) & (a <= rl)")]',
]
rdict = {}
for i in functions:
rdict[i] = timeit.timeit(i,setup=setup,number=1000)
print("%s -> %s s" %(i,rdict[i]))
print("Sorted:")
for w in sorted(rdict, key=rdict.get):
print(w, rdict[w])
Results
The results are reported in the following plot for a small array (on the top the fastest solution) as noted by #EZLearner they may vary depending on the size of the array. sorted slice could be faster for larger arrays, but it requires your array to be sorted, for arrays with over 10 M of entries ne.evaluate could be an option. Is hence always better to perform this test with an array of the same size as yours:
If instead of the indexes you want to extract the values you can perform the tests using functions2 but the results are almost the same.

I thought I would add this because the a in the example you gave is sorted:
import numpy as np
a = [1, 3, 5, 6, 9, 10, 14, 15, 56]
start = np.searchsorted(a, 6, 'left')
end = np.searchsorted(a, 10, 'right')
rng = np.arange(start, end)
rng
# array([3, 4, 5])

a = np.array([1,2,3,4,5,6,7,8,9])
b = a[(a>2) & (a<8)]

Other way is with:
np.vectorize(lambda x: 6 <= x <= 10)(a)
which returns:
array([False, False, False, True, True, True, False, False, False])
It is sometimes useful for masking time series, vectors, etc.

This code snippet returns all the numbers in a numpy array between two values:
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56] )
a[(a>6)*(a<10)]
It works as following:
(a>6) returns a numpy array with True (1) and False (0), so does (a<10). By multiplying these two together you get an array with either a True, if both statements are True (because 1x1 = 1) or False (because 0x0 = 0 and 1x0 = 0).
The part a[...] returns all values of array a where the array between brackets returns a True statement.
Of course you can make this more complicated by saying for instance
...*(1-a<10)
which is similar to an "and Not" statement.

a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
np.argwhere((a>=6) & (a<=10))

Wanted to add numexpr into the mix:
import numpy as np
import numexpr as ne
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
np.where(ne.evaluate("(6 <= a) & (a <= 10)"))[0]
# array([3, 4, 5], dtype=int64)
Would only make sense for larger arrays with millions... or if you hitting a memory limits.

This may not be the prettiest, but works for any dimension
a = np.array([[-1,2], [1,5], [6,7], [5,2], [3,4], [0, 0], [-1,-1]])
ranges = (0,4), (0,4)
def conditionRange(X : np.ndarray, ranges : list) -> np.ndarray:
idx = set()
for column, r in enumerate(ranges):
tmp = np.where(np.logical_and(X[:, column] >= r[0], X[:, column] <= r[1]))[0]
if idx:
idx = idx & set(tmp)
else:
idx = set(tmp)
idx = np.array(list(idx))
return X[idx, :]
b = conditionRange(a, ranges)
print(b)

s=[52, 33, 70, 39, 57, 59, 7, 2, 46, 69, 11, 74, 58, 60, 63, 43, 75, 92, 65, 19, 1, 79, 22, 38, 26, 3, 66, 88, 9, 15, 28, 44, 67, 87, 21, 49, 85, 32, 89, 77, 47, 93, 35, 12, 73, 76, 50, 45, 5, 29, 97, 94, 95, 56, 48, 71, 54, 55, 51, 23, 84, 80, 62, 30, 13, 34]
dic={}
for i in range(0,len(s),10):
dic[i,i+10]=list(filter(lambda x:((x>=i)&(x<i+10)),s))
print(dic)
for keys,values in dic.items():
print(keys)
print(values)
Output:
(0, 10)
[7, 2, 1, 3, 9, 5]
(20, 30)
[22, 26, 28, 21, 29, 23]
(30, 40)
[33, 39, 38, 32, 35, 30, 34]
(10, 20)
[11, 19, 15, 12, 13]
(40, 50)
[46, 43, 44, 49, 47, 45, 48]
(60, 70)
[69, 60, 63, 65, 66, 67, 62]
(50, 60)
[52, 57, 59, 58, 50, 56, 54, 55, 51]

You can use np.clip() to achieve the same:
a = [1, 3, 5, 6, 9, 10, 14, 15, 56]
np.clip(a,6,10)
However, it holds the values less than and greater than 6 and 10 respectively.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Preprocessing Using Zip Pandas - python

Use enumerate and list comprehension: data = [(kv, jv, final_n_clusters[ki][ji]) for ji,jv in enumerate(iter_value) for ki,kv in enumerate(preference)] df = pd.DataFrame(data, columns=['preference', 'iter_value', 'number_of_cluster'])

Related

Generate 3 random lists and create another one with the sum of their elements

How to generate random numbers with if-statement in Python?

Repeat calculations over repeated blocks of 5 rows within numpy

loop for computing average of selected data in dataframe using pandas

Numpy: find index of the elements within range

Categories

Resources