loop to normalize range (0,10) in to (0,1) - python

What I need is to divide/spread 0 to 1. according to single number which is more than 2.
like number 5 so 0 to 5 will be divided like this
0.00
0.25
0.50
0.75
1.00
5 values in a list
and my other question is what to do to get a sequence like this where middle number is 1 and first and last number is 0 , if number is 10.
0.00
0.25
0.50
0.75
1.00
1.00
0.75
0.50
0.25
0.00

The upper bound of the range(..) is exclusive (meaning it is not enumerated), so you need to add one step to the range(..) function:
for i in range(0,11):
b = i*(1.0/10)
print b
That being said, if you want to create such array, you can use numpy.arange(..):
>>> import numpy as np
>>> np.arange(0, 1.1, 0.1)
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
This thus allows you to specify floats for the offset, end, and step parameter.
As for your second question, you can itertools.chain iterables together, like:
from itertools import chain
for i in chain(range(0, 11), range(10, -1, -1)):
print(i/10.0)
Here we thus have one range(..) that iterates from 0 to 10 (both inclusive), and one that iterates from 10, to 0 (both inclusive).

You should use range(0,11) to get all the numbers from 0 to 10.

range 0 to 10 will give you numbers from 0 to 9. Here is some practical to explain:
>>> list(range(0,10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(0,11))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>>
>>> list(range(0,1))
[0]
>>>

Related

Adding commas after imported numerical values

I need the values from a CSV to have a comma after each individual value as well at the end of each row/array.
I have used tolist() before having these changes. The conversion of numerical values to strings is not wanted.
The code below is what I currently have.
import numpy as np
dataset = open("Dataset.csv")
next(dataset) # Skips first line of dataset
games = np.loadtxt(dataset, delimiter=",")
dataset.close()
print(games)
This is what the code outputs:
[[ 0.228 0.5 0.685 0.378 0.439 0.183 0.387 0.25 0.169]
[ 0.206 0.125 0.686 0.069 0.131 0.778 2.71 0.75 -0.092]]
I am looking for the code to output this:
[[0.228,0.5 ,0.685,0.378,0.439,0.183,0.387,0.25 ,0.169],
[0.206,0.125 ,0.686 ,0.069 ,0.131,0.778 ,2.71 ,0.75 ,-0.092]
You can basically set any formatter you desire to print your output with via np.set_print_optiones (this does not change your original array type and only change the printing format, which I think is what you are looking for). I think this is what you are looking for, but if it is not, you can define your desirable format through this:
#be mindful this creates comma after each float number including the last number in sub-arrays
float_formatter = "{:},".format
np.set_printoptions(formatter={'float_kind':float_formatter})
print(games)
output:
[[0.228, 0.5, 0.685, 0.378, 0.439, 0.183, 0.387, 0.25, 0.169,]
[0.206, 0.125, 0.686, 0.069, 0.131, 0.778, 2.71, 0.75, -0.092,]]
and your datatype is float:
print(games.dtype)
float64
A better option mentioned by #David Buck in comments is to use repr
print(repr(games))
output:
array([[ 0.228, 0.5 , 0.685, 0.378, 0.439, 0.183, 0.387, 0.25 ,
0.169],
[ 0.206, 0.125, 0.686, 0.069, 0.131, 0.778, 2.71 , 0.75 ,
-0.092 ]])
Make sure you understand what python object you have, and what the commas, or lack, means.
With loadtxt you created a numpy array. A simpler way of doing the same:
In [212]: arr = np.arange(12).reshape(2,6)
The repr display for an array is:
In [213]: arr
Out[213]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
The str dislay omits the commas. That's intentional, helping to distinguish an array from a list:
In [214]: print(arr)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]
In [215]: type(arr)
Out[215]: numpy.ndarray
The print display of a list has commas:
In [216]: print(arr.tolist())
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11]]
The distinction between a list (or list of lists) and an array is important. Whether the display uses commas or not is superficial.

Numpy dtype=int

In the below code .I get the expected results of x1
import numpy as np
x1 = np.arange(0.5, 10.4, 0.8)
print(x1)
[ 0.5 1.3 2.1 2.9 3.7 4.5 5.3 6.1 6.9 7.7 8.5 9.3 10.1]
But in the code below, when i set dtype=int why the result of x2 is not [ 0 1 2 2 3 4 5 6 6 7 8 9 10] and Instead I am getting the value of x2 as [ 0 1 2 3 4 5 6 7 8 9 10 11 12] where last value 12 overshoots the end value of 10.4.Please clarify my concept regarding this.
import numpy as np
x2 = np.arange(0.5, 10.4, 0.8, dtype=int)
print(x2)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12]
According to the docs: https://docs.scipy.org/doc/numpy1.15.0/reference/generated/numpy.arange.html
stop : number
End of interval. The interval does not include this value, except in some cases where step is not an integer and floating point round-off affects the length of out.
arange : ndarray
Array of evenly spaced values.
For floating point arguments, the length of the result is ceil((stop - start)/step). Because of floating point overflow, this rule may result in the last element of out being greater than stop.
So here the last element will be.
In [33]: np.ceil((10.4-0.5)/0.8)
Out[33]: 13.0
Hence we see the overshoot to 12 in case of np.arange(0.5, 10.4, 0.8, dtype=int), since stop=13 in the above case, and the default value is 0,
hence the output we observe is.
In [35]: np.arange(0.5, 10.4, 0.8, dtype=int)
Out[35]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
Hence the better way of generating integer ranges, is to use integer parameters like so:
In [25]: np.arange(0, 11, 1)
Out[25]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Using Pandas for Multple Time-Varying Trials

I'm just getting into Pandas, and want to figure out a good way of holding time-varying data corresponding to multiple trials.
A concrete example might be:
Trial 1: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 2: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 3: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)
Trial 4: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)
Where you'll notice that experiments can be repeated multiple times with the same initial parameters (the salinity), but with different time-varying variables (pH).
A DataFrame is 2-dimensional, so I would have to create a DataFrame for each trial. Is this the best way to go about it, and how would I be able to combine them (ex: get the average pH over time for trials with the same initial setup)?
You can aggregate the data across Trials in a single pd.DataFrame. Below is an example.
df = pd.DataFrame({'Trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'Date': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Salinity': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
'pH': [2, 4, 1, 4, 6, 8, 3, 2, 9, 3, 1, 4, 6, 11, 4, 6]})
df = df.set_index(['Trial', 'Date', 'Salinity'])
# pH
# Trial Date Salinity
# 1 1 0.1 2
# 2 0.1 4
# 3 0.1 1
# 4 0.1 4
# 2 1 0.1 6
# 2 0.1 8
# 3 0.1 3
# 4 0.1 2
# 3 1 0.2 9
# 2 0.2 3
# 3 0.2 1
# 4 0.2 4
# 4 1 0.2 6
# 2 0.2 11
# 3 0.2 4
# 4 0.2 6
Explanation
In your dataframe construction, assign an identifier column, in this case Trial with an integer identifier.
Setting index by ['Trial', 'Date', 'Salinity'] provides a natural index for pandas to use for grouping, indexing and slicing.
For example, df.loc[(1, 2, 0.1)] will return a pd.Series derived from the dataframe indicating pH = 4.

Accelerate a pandas operation involving several dataframes

Hello everyone
For a school project, I am stuck with the duration of an operation with Pandas Dataframe.
I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine.
They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.
In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'
The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.
An example of my code is below
df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix
for index, row in df.iterrows():
cycle = row["Cycle"]
time = row["CycleTime"]
anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
>>>anomaly_matrix
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06
This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.
Do you have any idea how to accelerate this process ? Thanks
You can use:
df1 = df.set_index(['Cycle', 'CycleTime'])
mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06

Sort a list based on a given distribution

Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.

Categories