array numpy regression linear - python

I have an numpy array that has 3 rows and 3 columns .Like this:
100 200 300
233 699 999
566 655 895
and I want to create a numpy array like this for my linear regression:
100 200 300 1
233 699 999 1
566 655 895 1
This my code:
X=np.hstack((x[:,0]),x[:,1]),x[:,2]) ,np.ones(x.shape[0])))
Please how can I edit my code to get my target?

You may not break x apart in hstack
You should put the 2-dimension shape in ones because the 1-dimension shape arrays are not what you expect.
X = np.hstack((x, np.ones((x.shape[0], 1))))

Related

Which ML algorithm would be appropriate for clustering a combination of categorical and numerical dataframe?

I wish to cluster a DataFrame with a dimension of (120000 x 4).
It consists of two string-based "label" columns (str1 and str2), and two numerical columns which looks like the following:
Str1 Str2 Energy intensity
0 713 599 7678.159 5367.276014
1 715 598 7678.182 6576.100453
2 714 597 7678.183 5675.788001
3 684 587 7678.493 3040.650157
4 693 588 7678.585 5585.908164
5 695 586 7678.615 3184.001905
6 684 584 7678.674 4896.774505
7 799 509 7693.645 4907.484401
8 798 508 7693.754 4075.800912
9 797 507 7693.781 4407.800702
10 796 506 7694.043 3138.073328
11 794 505 7694.049 3653.699936
12 795 504 7694.077 3875.120022
13 675 277 7694.948 3081.797654
14 709 221 7698.216 3587.704908
15 708 220 7698.252 4070.050144
...........
What would be the best ML algorithm to cluster/categorize this data?
I have tried plotting individual energy&intensity components belonging to one specific category Str1== "713" etc, which didn't give me much information. I am in need of somewhat more compact clustering, if possible.
You can try to do categorical encoding or one-hot encoding to Str1 and Str2 (categorical encoding is suitable for the class with magnitude relation, while one-hot encoding is more widely used), these will convert the string into numerical data, can you can just simply use any regression model.

How do I sum all the numbers in a list of list by column without numpy?

So in Python NumPy, I have a list of list from 0 to 99 divided by 5:
array_b = np.arange(0,100).reshape(5, 20)
list_a = array_b.tolist()
I want to add the numbers in the list by column so that the result will be:
[200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295]
I know how to do it in the array version, but I want to do the same thing in the list version (without using np.sum(array_b, axis=0)).
Any help?
Without numpy this can be done with zip and map quite elegantly:
list(map(sum, zip(*list_a)))
Explanation:
zip(*list_a) aggregates the lists element-wise
map(sum, ...) tells to apply the sum on each of these aggregations
finally, list(..) simply unpacks the iterator returned by map into a list.
Easy as (num)py...
Use .sum(axis=0) on a numpy array
import numpy as np
result = np.array(values).sum(axis=0)
# [200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295]
With the other axis possibilities
result = np.array(values).sum(axis=1) # [ 190 590 990 1390 1790]
result = np.array(values).sum() # 4950
import numpy as np
a = [[...]]
sum_array = np.sum(a, axis=0)

check numpy array values b/w 2 columns of pandas Dataframe

I have a numpy array of size (352, 5) and also have a pandas DataFrame.
Objective is to check whether first and second columns of the numpy array exist in a range of 2 pandas columns and if yes then get the index of that particular row and then do something
Example:-
stats = [[ 246 1102 1678 2214 172182]
[ 678 1005 1688 2214 3528850]
[ 1031 241 17 23 331]]
df:-
hpos hpos_end vpos vpos_end
245 298 1100 1124
672 685 1000 1010
Result:-
stats[0] is present in the very first row of df since 246 lies b/w 245 and 298 and 1102 lies b/w 1100 and 1124 and same goes for next element of stats. I want to obtain the index of the row it lies in(if it does).
My approach till now:-
for x, y, w, h, area in stats[:]:
for row in df.itertuples():
if x in range(int(df['HPOS'][row.Index]), int(df['HPOS_END'][row.Index])) and y in range(
int(df['VPOS'][row.Index]), int(df['VPOS_END'][row.Index])):
desired_index = row.Index
is there a faster/optimum way to achieve my objective as iterating a df would be the last thing I would like to do
Note: both numpy array and the df are already in sorted in ascending order based on the first two columns for numpy array and ['hpos', 'vpos] for df
Any help will be appreciated, Thank you :)

Keep one line out of many that starts from one point

I am working on a project with OpenCV and python but stuck on this small problem.
I have end-points' coordinates on many lines stored in a list. Sometimes a case is appearing that from a single point, more than one line is detected. From among these lines, I want to keep the line of shortest length and eliminate all the other lines thus my image will contain no point from where more than one line is drawn.
My variable which stores the information(coordinates of both the end-points) of all the lines initially detected is as follows:
var = [[Line1_EndPoint1, Line1_EndPoint2],
[Line2_EndPoint1, Line2_EndPoint2],
[Line3_EndPoint1, Line3_EndPoint2],
[Line4_EndPoint1, Line4_EndPoint2],
[Line5_EndPoint1, Line5_EndPoint2]]
where, LineX_EndPointY(line number "X", endpoint "Y" of that line) is of type [x, y] where x and y are the coordinates of that point in the image.
Can someone suggest me how to solve this problem.
You can modify the way data of the lines are stored. If you modify, please explain your data structure and how it is created
Example of such data:
[[[551, 752], [541, 730]],
[[548, 738], [723, 548]],
[[285, 682], [226, 676]],
[[416, 679], [345, 678]],
[[345, 678], [388, 674]],
[[249, 679], [226, 676]],
[[270, 678], [388, 674]],
[[472, 650], [751, 473]],
[[751, 473], [716, 561]],
[[731, 529], [751, 473]]]
Python code would be appreciable.
A Numpy solution
The same result as in my first answer can be achieved based solely
on Numpy.
First define of 2 functions:
Compute square of the length of a line:
def sqLgth(line):
p1, p2 = line
return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2
Convert a vector (1D array) to a column array (a 2D array
with a single column:
def toColumn(tbl):
return tbl.reshape(-1, 1)
Both will be used later.
Then proceed as follows:
Get the number of lines:
lineNo = var.shape[0]
Generate line indices (the content of lineInd column in points array
(will be created later)):
id = np.repeat(np.arange(lineNo), 2)
Generate "origin indicators" (1 - start, 2 - end), to ease analysis
of any intermediate printouts:
origin = np.tile(np.array([1, 2]), lineNo)
Compute line lengths (the content of lgth column in points):
lgth = np.repeat([ sqLgth(line) for line in var ], 2)
Create a list of points with some additional data (consecutive
columns contain origin, lineInd, x, y and lgth):
points = np.hstack([toColumn(origin), toColumn(id),
var.reshape(-1, 2), toColumn(lgth)])
Compute the "criterion array" to sort:
r = np.core.records.fromarrays(points[:, 2:].transpose(),
names='x, y, lgth')
Sort points (by x, y and lgth):
points = points[r.argsort()]
Compute "inverse unique indices" to points:
_, inv = np.unique(points[:,2:4], axis=0, return_inverse=True)
Shift inv by 1 position:
rInv = np.roll(inv,1)
Will be used in the next step, to get the previous element.
Generate a list of line indices to drop:
toDrop = points[[ i for i in range(2 * lineNo)
if inv[i] == rInv[i] ], 1]
Row indices (in points array) are indices of repeated points (elements
in inv equal to the previous element).
Column index (1) - specifies lineInd column.
The whole result (toDrop) is a list of indices of "owning" lines
(containing the repeated points).
Generate the result: var stripped from lines selected in the
previous step:
var2 = np.delete(var, toDrop, axis=0)
To print the reduced list of lines, you can run:
for line in var2:
print(f'{line[0]}, {line[1]}')
The result is:
[551 752], [541 730]
[548 738], [723 548]
[345 678], [388 674]
[249 679], [226 676]
[731 529], [751 473]
To fully comprehend how this code works:
execute each step separately,
print the result,
compare it with printouts from previous steps.
Sometimes it is instructive to print separately even some expressions
(parts of instructions), e.g. var.reshape(-1, 2) - converting your
var (of shape (10, 2, 2)) into a 2D array of points (each row is
a point).
The whole result is of course just the same as in my first answer,
but as you wrote you had little experience in Pandas, now you can
compare both methods and see the cases where Pandas allows to do
something easier and more intuitive.
Good examples are e.g. sort by some columns or finding duplicated rows.
In Pandas it is a matter of a single instruction, with suitable
parameters, whereas in Numpy you have to use more instructions
and know various details and tricks how to do just the same.
I decided that it is easier to write a solution based on Pandas.
The reasons are that:
I can use column names (the code is better readable),
Pandas API is more powerful, although it works slower than "pure" Numpy.
Proceed as follows:
Convert var to a DataFrame:
lines = pd.DataFrame(var.reshape(10,4), columns=pd.MultiIndex.from_product(
(['P1', 'P2'], ['x','y'])))
The initial part of lines is:
P1 P2
x y x y
0 551 752 541 730
1 548 738 723 548
2 285 682 226 676
3 416 679 345 678
Compute the square of the length of each line:
lines[('', 'lgth')] = (lines[('P1', 'x')] - lines[('P2', 'x')]) ** 2\
+ (lines[('P1', 'y')] - lines[('P2', 'y')]) ** 2
lines.columns = lines.columns.droplevel()
I deliberately "stopped" at squares of length, because it is
enough to compare lenghs (computing the root will not change the
result of comparison).
Note also that the first level of the MultiIndex on columns was needed
only to easier express the columns of interest. Further on they will
not be needed, so I dropped it.
This time I put the full content of lines:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
2 285 682 226 676 3517
3 416 679 345 678 5042
4 345 678 388 674 1865
5 249 679 226 676 538
6 270 678 388 674 13940
7 472 650 751 473 109170
8 751 473 716 561 8969
9 731 529 751 473 3536
The next step is to compute points DataFrame, where all points (start and
end of each line) are in the same columns, along with the (square) length of
the corresponding line:
points = pd.concat([lines.iloc[:,[0, 1, 4]],
lines.iloc[:,[2, 3, 4]]], keys=['P1', 'P2'])\
.sort_values(['x', 'y', 'lgth']).reset_index(level=1)
Now I used iloc to specify columns (the first time for starting points
and second for ending points).
To easier read this DataFrame, I passed keys, to include "origin
indicators" and then I sorted rows.
The content is:
level_1 x y lgth
P2 5 226 676 538
P2 2 226 676 3517
P1 5 249 679 538
P1 6 270 678 13940
P1 2 285 682 3517
P1 4 345 678 1865
P2 3 345 678 5042
P2 4 388 674 1865
P2 6 388 674 13940
P1 3 416 679 5042
P1 7 472 650 109170
P2 0 541 730 584
P1 1 548 738 66725
P1 0 551 752 584
P2 8 716 561 8969
P2 1 723 548 66725
P1 9 731 529 3536
P2 9 751 473 3536
P1 8 751 473 8969
P2 7 751 473 109170
Note e.g. that point 226, 676 occurs twice. The first time it occurred
in line 5 and the second in line 2 (indices in var and lines).
To find indices of rows to drop, run:
toDrop = points[points.duplicated(subset=['x', 'y'])]\
.level_1.reset_index(drop=True);
To easier comprehend how this code works, run it step by step and
inspect results of each step.
The result is:
0 2
1 3
2 6
3 8
4 7
Name: level_1, dtype: int64
Note that the left column above is the index only (it dosn't matter).
The real information is in the right column (values).
To show lines that should be left, run:
result = lines.drop(toDrop)
getting:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
4 345 678 388 674 1865
5 249 679 226 676 538
9 731 529 751 473 3536
The above result doesn't contain e.g.:
line 2, as point 226, 676 occurred in line 5,
line 3, as point 345, 678 occurred in line 4,
Just these lines (2 and 3) have been dropped because they are
longer than both second mentioned lines (see earlier partial results).
Maybe this is enough, or if you need to drop the "duplicated" lines from
var (the original Numpy array), and save the result in another
variable, run:
var2 = np.delete(var, toDrop, axis=0)

Selecting Column from pandas Series

I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.

Categories