How to use pd.cut to bin data in a natural way?

How to use pd.cut to bin data in a natural way? - python

Say I have a pandas Series of 100 float data points and I need to put them into 10 equally wide bins, and I need to access, say, the indices of the data in the fourth bin. Then what I tried is:
import pandas as pd; import numpy as np
np.random.seed(1)
s = pd.Series(np.random.randn(100))
cut = pd.cut(s, bins=10, labels=range(10))
fourth_bin = s[cut == 4]
fourth_bin
Out[101]:
9 -0.249370
12 -0.322417
13 -0.384054
16 -0.172428
26 -0.122890
28 -0.267888
31 -0.396754
40 -0.191836
51 -0.352250
53 -0.349343
54 -0.208894
63 -0.298093
65 -0.075572
71 -0.504466
76 -0.306204
80 -0.222328
81 -0.200758
92 -0.375285
96 -0.343854
dtype: float64
which isn't quite natural and looks even a bit clumsy. For example, can I avoid manually setting the labels and just start from pd.cut(s, bins=10)? This way I want to do something like
s[s in pd.cut(s, bins=10).categories[4]]
since categories is a list of Intervals, but this doesn't work out.
Is there a more natural way to do this so I don't have to manually set labels?

pd.qcut
For evenly sized bins
np.random.seed(1)
s = pd.Series(np.random.randn(100))
cut = pd.qcut(s, 10, labels=False)
fourth_bin = s[cut == 4]
fourth_bin
16 -0.172428
18 0.042214
26 -0.122890
35 -0.012665
40 -0.191836
44 0.050808
54 -0.208894
65 -0.075572
81 -0.200758
97 0.043597
dtype: float64
pd.cut
For evenly spaced bins
np.random.seed(1)
s = pd.Series(np.random.randn(100))
cut = pd.cut(s, 10, labels=False)
fourth_bin = s[cut == 4]
fourth_bin
9 -0.249370
12 -0.322417
13 -0.384054
16 -0.172428
26 -0.122890
28 -0.267888
31 -0.396754
40 -0.191836
51 -0.352250
53 -0.349343
54 -0.208894
63 -0.298093
65 -0.075572
71 -0.504466
76 -0.306204
80 -0.222328
81 -0.200758
92 -0.375285
96 -0.343854
dtype: float64

Related

How to read one column data as one by one row in csv file using python

Here I have a dataset with three inputs. Three inputs x1,x2,x3. Here I want to read just x2 column and in that column data stepwise row by row.
Here I wrote a code. But it is just showing only letters.
Here is my code
data = pd.read_csv('data6.csv')
row_num =0
x=[]
for col in data:
if (row_num==1):
x.append(col[0])
row_num =+ 1
print(x)
result : x1,x2,x3
What I expected output is:
expected output x2 (read one by one row)
65
32
14
25
85
47
63
21
98
65
21
47
48
49
46
43
48
25
28
29
37
Subset of my csv file :
x1 x2 x3
6 65 78
5 32 59
5 14 547
6 25 69
7 85 57
8 47 51
9 63 26
3 21 38
2 98 24
7 65 96
1 21 85
5 47 94
9 48 15
4 49 27
3 46 96
6 43 32
5 48 10
8 25 75
5 28 20
2 29 30
7 37 96
Can anyone help me to solve this error?

If you want list from x2 use:
x = data['x2'].tolist()

I am not sure I even get what you're trying to do from your code.
What you're doing (after fixing the indentation to make it somewhat correct):
Iterate through all columns of your dataframe
Take the first character of the column name if row_num is equal to 1.
Based on this guess:
import pandas as pd
data = pd.read_csv("data6.csv")
row_num = 0
x = []
for col in data:
if row_num == 1:
x.append(col[0])
row_num = +1
print(x)
What you probably want to do:
import pandas as pd
data = pd.read_csv("data6.csv")
# Make a list containing the values in column 'x2'
x = list(data['x2'])
# Print all values at once:
print(x)
# Print one value per line:
for val in x:
print(val)

When you are using pandas you can use it. You can try this to get any specific column values by using list to direct convert into a list.For loop not needed
import pandas as pd
data = pd.read_csv('data6.csv')
print(list(data['x2']))

Remove Specific Indices From 2D Numpy Array

If I have a set of data that's of shape (1000,1000) and I know that the values I need from it are contained within the indices (25:888,11:957), how would I go about separating the two sections of data from one another?
I couldn't figure out how to get np.delete() to like the specific 2D case and I also need both the good and the bad sections of data for analysis, so I can't just specify my array bounds to be within the good indices.
I feel like there's a simple solution I'm missing here.

Is this how you want to divide the array?
In [364]: arr = np.ones((1000,1000),int)
In [365]: beta = arr[25:888, 11:957]
In [366]: beta.shape
Out[366]: (863, 946)
In [367]: arr[:25,:].shape
Out[367]: (25, 1000)
In [368]: arr[888:,:].shape
Out[368]: (112, 1000)
In [369]: arr[25:888,:11].shape
Out[369]: (863, 11)
In [370]: arr[25:888,957:].shape
Out[370]: (863, 43)
I'm imaging a square with a rectangle cut out of the middle. It's easy to specify that rectangle, but the frame is has to be viewed as 4 rectangles - unless it is described via the mask of what is missing.
Checking that I got everything:
In [376]: x = np.array([_366,_367,_368,_369,_370])
In [377]: np.multiply.reduce(x, axis=1).sum()
Out[377]: 1000000

Let's say your original numpy array is my_arr
Extracting the "Good" Section:
This is easy because the good section has a rectangular shape.
good_arr = my_arr[25:888, 11:957]
Extracting the "Bad" Section:
The "bad" section doesn't have a rectangular shape. Rather, it has the shape of a rectangle with a rectangular hole cut out of it.
So, you can't really store the "bad" section alone, in any array-like structure, unless you're ok with wasting some extra space to deal with the cut out portion.
What are your options for the "Bad" Section?
Option 1:
Be happy and content with having extracted the good section. Let the bad section remain as part of the original my_arr. While iterating trough my_arr, you can always discriminate between good and and bad items based on the indices. The disadvantage is that, whenever you want to process only the bad items, you have to do it through a nested double loop, rather than use some vectorized features of numpy.
Option 2:
Suppose we want to perform some operations such as row-wise totals or column-wise totals on only the bad items of my_arr, and suppose you don't want the overhead of the nested for loops. You can create something called a numpy masked array. With a masked array, you can perform most of your usual numpy operations, and numpy will automatically exclude masked out items from the calculations. Note that internally, there will be some memory wastage involved, just to store an item as "masked"
The code below illustrates how you can create a masked array called masked_arr from your original array my_arr:
import numpy as np
my_size = 10 # In your case, 1000
r_1, r_2 = 2, 8 # In your case, r_1 = 25, r_2 = 889 (which is 888+1)
c_1, c_2 = 3, 5 # In your case, c_1 = 11, c_2 = 958 (which is 957+1)
# Using nested list comprehension, build a boolean mask as a list of lists, of shape (my_size, my_size).
# The mask will have False everywhere, except in the sub-region [r_1:r_2, c_1:c_2], which will have True.
mask_list = [[True if ((r in range(r_1, r_2)) and (c in range(c_1, c_2))) else False
for c in range(my_size)] for r in range(my_size)]
# Your original, complete 2d array. Let's just fill it with some "toy data"
my_arr = np.arange((my_size * my_size)).reshape(my_size, my_size)
print (my_arr)
masked_arr = np.ma.masked_where(mask_list, my_arr)
print ("masked_arr is:\n", masked_arr, ", and its shape is:", masked_arr.shape)
The output of the above is:
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
masked_arr is:
[[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 -- -- 25 26 27 28 29]
[30 31 32 -- -- 35 36 37 38 39]
[40 41 42 -- -- 45 46 47 48 49]
[50 51 52 -- -- 55 56 57 58 59]
[60 61 62 -- -- 65 66 67 68 69]
[70 71 72 -- -- 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]] , and its shape is: (10, 10)
Now that you have a masked array, you will be able to perform most of the numpy operations on it, and numpy will automatically exclude the masked items (the ones that appear as "--" when you print the masked array)
Some examples of what you can do with the masked array:
# Now, you can print column-wise totals, of only the bad items.
print (masked_arr.sum(axis=0))
# Or row-wise totals, for that matter.
print (masked_arr.sum(axis=1))
The output of the above is:
[450 460 470 192 196 500 510 520 530 540]
[45 145 198 278 358 438 518 598 845 945]

Linear regression:ValueError: all the input array dimensions except for the concatenation axis must match exactly

I am looking for a solution for the following problem and it just won't work the way I want to.
So my goal is to calculate a regression analysis and get the slope, intercept, rvalue, pvalue and stderr for multiple rows (this could go up to 10000). In this example, I have a file with 15 rows. Here are the first two rows:
array([
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24],
[ 100, 10, 61, 55, 29, 77, 61, 42, 70, 73, 98,
62, 25, 86, 49, 68, 68, 26, 35, 62, 100, 56,
10, 97]]
)
Full trial data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268
The first row is the x-variable and this is the independent variable. This has to be kept fixed while iterating over every following row.
For the following row, the y-variable and thus the dependent variable, I want to calculate the slope, intercept, rvalue, pvalue and stderr and have them in a dataframe (if possible added to the same dataframe, but this is not necessary).
I tried the following code:
import pandas as pd
import scipy.stats
import numpy as np
df = pd.read_excel("Directory\\file.xlsx")
def regr(row):
r = scipy.stats.linregress(df.iloc[1:, :], row)
return r
full_dataframe = None
for index,row in df.iterrows():
x = regr(index)
if full_dataframe is None:
full_dataframe = x.T
else:
full_dataframe = full_dataframe.append([x.T])
full_dataframe.to_excel('Directory\\file.xlsx')
But this fails and gives the following error:
ValueError: all the input array dimensions except for the concatenation axis
must match exactly
I'm really lost in here.
So, I want to achieve that I have the slope, intercept, pvalue, rvalue and stderr per row, starting from the second one, because the first row is the x-variable.
Anyone has an idea HOW to do this and tell me WHY mine isn't working and WHAT the code should look like?
Thanks!!

Guessing the issue
Most likely, your problem is the format of your numbers, there are Unicode String dtype('<U21') instead of being Integer or Float.
Always check types:
df.dtypes
Cast your dataframe using:
df = df.astype(np.float64)
Below a small example showing the issue:
import numpy as np
import pandas as pd
# DataFrame without numbers (will not work for Math):
df = pd.DataFrame(['1', '2', '3'])
df.dtypes # object: placeholder for everything that is not number or timestamps (string, etc...)
# Casting DataFrame to make it suitable for Math Operations:
df = df.astype(np.float64)
df.dtypes # float64
But it is difficult to be sure of this without having the original file or data you are working with.
Carefully read the Exception
This is coherent with the Exception you get:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U21') dtype('<U21') dtype('<U21')
The method scipy.stats.linregress raises a TypeError (so it is about type) and is telling you than it cannot perform add operation because adding String dtype('<U21') does not make any sense in the context of a Linear Regression.
Understand the Design
Loading the data:
import io
fh = io.StringIO("""1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268""")
df = pd.read_fwf(fh).astype(np.float)
Then we can regress the second row vs the first:
scipy.stats.linregress(df.iloc[0,:].values, df.iloc[1,:].values)
It returns:
LinregressResult(slope=0.12419744768547877, intercept=49.60998434527584, rvalue=0.11461693561751324, pvalue=0.5938303095361301, stderr=0.22949908667668056)
Assembling all together:
result = pd.DataFrame(columns=["slope", "intercept", "rvalue"])
for i, row in df.iterrows():
fit = scipy.stats.linregress(df.iloc[0,:], row)
result.loc[i] = (fit.slope, fit.intercept, fit.rvalue)
Returns:
slope intercept rvalue
0 1.000000 0.000000 1.000000
1 0.124197 49.609984 0.114617
2 -1.095801 289.293224 -0.205150
Which is, as far as I understand your question, what you expected.
The second exception you get comes because of this line:
x = regr(index)
You sent the index of the row instead of the row itself to the regression method.

Understand tensorflow slice operation

I am confused about the follow code:
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.framework import dtypes
'''
Randomly crop a tensor, then return the crop position
'''
def random_crop(value, size, seed=None, name=None):
with ops.name_scope(name, "random_crop", [value, size]) as name:
value = ops.convert_to_tensor(value, name="value")
size = ops.convert_to_tensor(size, dtype=dtypes.int32, name="size")
shape = array_ops.shape(value)
check = control_flow_ops.Assert(
math_ops.reduce_all(shape >= size),
["Need value.shape >= size, got ", shape, size],
summarize=1000)
shape = control_flow_ops.with_dependencies([check], shape)
limit = shape - size + 1
begin = tf.random_uniform(
array_ops.shape(shape),
dtype=size.dtype,
maxval=size.dtype.max,
seed=seed) % limit
return tf.slice(value, begin=begin, size=size, name=name), begin
sess = tf.InteractiveSession()
size = [10]
a = tf.constant(np.arange(0, 100, 1))
print (a.eval())
a_crop, begin = random_crop(a, size = size, seed = 0)
print ("offset: {}".format(begin.eval()))
print ("a_crop: {}".format(a_crop.eval()))
a_slice = tf.slice(a, begin=begin, size=size)
print ("a_slice: {}".format(a_slice.eval()))
assert (tf.reduce_all(tf.equal(a_crop, a_slice)).eval() == True)
sess.close()
outputs:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
offset: [46]
a_crop: [89 90 91 92 93 94 95 96 97 98]
a_slice: [27 28 29 30 31 32 33 34 35 36]
There are two tf.slice options:
(1). called in function random_crop, such as tf.slice(value, begin=begin, size=size, name=name)
(2). called as a_slice = tf.slice(a, begin=begin, size=size)
The parameters (values, begin and size) of those two slice operations are the same.
However, why the printed values a_crop and a_slice are different and tf.reduce_all(tf.equal(a_crop, a_slice)).eval() is True?
Thanks
EDIT1
Thanks #xdurch0, I understand the first question now.
Tensorflow random_uniform seems like a random generator.
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
size = [10]
np_begin = np.random.randint(0, 50, size=1)
tf_begin = tf.random_uniform(shape = [1], minval=0, maxval=50, dtype=tf.int32, seed = 0)
a = tf.constant(np.arange(0, 100, 1))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, np_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
a_slice = tf.slice(a, tf_begin, size = size)
print ("a_slice: {}".format(a_slice.eval()))
sess.close()
output
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [42 43 44 45 46 47 48 49 50 51]
a_slice: [41 42 43 44 45 46 47 48 49 50]
a_slice: [29 30 31 32 33 34 35 36 37 38]

The confusing thing here is that tf.random_uniform (like every random operation in TensorFlow) produces a new, different value on each evaluation call (each call to .eval() or, in general, each call to tf.Session.run). So if you evaluate a_crop you get one thing, if you then evaluate a_slice you get a different thing, but if you evaluate tf.reduce_all(tf.equal(a_crop, a_slice)) you get True, because all is being computed in a single evaluation step, so only one random value is produced and it determines the value of both a_crop and a_slice. Another example is this, if you run tf.stack([a_crop, a_slice]).eval() you will get a tensor with to equal rows; again, only one random value was produced. More generally, if you call tf.Session.run with multiple tensors to evaluate, all the computations in that call will use the same random values.
As a side note, if you actually need a random value in a computation that you want to maintain for a later computation, the easiest thing would be to just retrieve if with tf.Session.run, along with any other needed computation, to feed it back later through feed_dict; or you could have a tf.Variable and store the random value there. A more advanced possibility would be to use partial_run, an experimental API that allows you to evaluate part of the computation graph and continue evaluating it later, while maintaining the same state (i.e. the same random values, among other things).

Python/Pandas Select Columns based on Best Value Distribution

I have a dataframe (df) in pandas/python with ['Product','OrderDate','Sales'].
I noticed that some rows, values have better Distribution (like in a Histogram) than others. By "Best" meaning, the shape is more spread, or the spread of values make the shape looks wider than for other rows.
If I want to pick from say +700 Product's, those with more spread values, is there a way to do that easily in pandas/python?
txs in advance.

Caveat here is that I'm not a stats expert but basically scipy has a number of tests you can conduct on your data to test whether it could be considered to be a normalised Gaussian distribution.
Here I create 2 series one is simple a linear range and the other is a random normalised sampling with mean set to 50 and variance set to 25.
In [48]:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'linear':arange(100), 'normal':np.random.normal(50, 25, 100)})
df
Out[48]:
linear normal
0 0 66.565374
1 1 63.453899
2 2 65.736406
3 3 65.848908
4 4 56.916032
5 5 93.870682
6 6 89.513998
7 7 9.949555
8 8 9.727099
9 9 47.072785
10 10 62.849321
11 11 33.263309
12 12 42.168484
13 13 38.488933
14 14 51.833459
15 15 54.911915
16 16 62.372709
17 17 96.928452
18 18 65.333546
19 19 26.341462
20 20 41.692790
21 21 22.852561
22 22 15.799415
23 23 50.600141
24 24 14.234088
25 25 72.428607
26 26 45.872601
27 27 80.783253
28 28 29.561586
29 29 51.261099
.. ... ...
70 70 32.826052
71 71 35.413106
72 72 49.415386
73 73 28.998378
74 74 32.237667
75 75 86.622402
76 76 105.098296
77 77 53.176413
78 78 -7.954881
79 79 60.313761
80 80 42.739641
81 81 56.667834
82 82 68.046688
83 83 72.189683
84 84 67.125708
85 85 24.798553
86 86 58.845761
87 87 54.559792
88 88 93.116777
89 89 30.209895
90 90 80.952444
91 91 57.895433
92 92 47.392336
93 93 13.136111
94 94 26.624532
95 95 53.461421
96 96 28.782809
97 97 16.342756
98 98 64.768579
99 99 68.410021
[100 rows x 2 columns]
From this page there are a number of tests we can use which are combined to for the normaltest, namely the skewtest and kurtosistest, I cannot explain these but you can see that the p-value is poor for the linear series and is relatively closer to 1 for the normalised data:
In [49]:
print('linear skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['linear']))
print('normal skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['normal']))
print('linear kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['linear']))
print('normal kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['normal']))
print('linear normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['linear']))
print('normal normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['normal']))
linear skewtest teststat = 1.022 pvalue = 0.3070
normal skewtest teststat = -0.170 pvalue = 0.8652
linear kurtoisis teststat = -5.799 pvalue = 0.0000
normal kurtoisis teststat = -1.113 pvalue = 0.2656
linear normaltest teststat = 34.674 pvalue = 0.0000
normal normaltest teststat = 1.268 pvalue = 0.5304
From the scipy site:
When testing for normality of a small sample of t-distributed
observations and a large sample of normal distributed observation,
then in neither case can we reject the null hypothesis that the sample
comes from a normal distribution. In the first case this is because
the test is not powerful enough to distinguish a t and a normally
distributed random variable in a small sample.
So you'll have to try the above and see if it fits with what you want, hope this helps.

Sure. What you'd like to do here is find the 700 entries with the largest standard deviation.
pandas.DataFrame.std() will return the standard deviation for an axis, and then you just need to keep track of the entries with the highest corresponding values.
Large Standard Deviation vs. Small Standard Deviation

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use pd.cut to bin data in a natural way? - python

Related

How to read one column data as one by one row in csv file using python

Remove Specific Indices From 2D Numpy Array

Linear regression:ValueError: all the input array dimensions except for the concatenation axis must match exactly

Understand tensorflow slice operation

Python/Pandas Select Columns based on Best Value Distribution

Categories

Resources