Writing a loop that chooses random values from columns - Python - python

Created a 2d array with (170,10) shape
i = np.array(range(1,1701)).reshape(170,10)
Want to write a loop that choses 5 random values without replacement from each column (or n number of columns) of i and output as below:
Group 1: [ 7 37 124 41 17]
Group 2: [302 261 257 323 234]
Group 3: [464 486 463 440 474]
So far i can pull random values from a single column:
print(np.random.choice(i[:,0],5 ,replace=False))
How do i put this in a for loop and pull for n number of columns with the displayed output (will use .function for that)

Let's use shape, arange, and random.choice:
for c in np.arange(i.shape[1]-1):
print('Group {}: '.format(str(c+1)) + str(np.random.choice(i[:,c],5,replace=False)))
Output:
Group 1: [1521 231 671 801 711]
Group 2: [ 612 192 1172 1242 1322]
Group 3: [ 543 213 1453 723 973]
Group 4: [ 404 1334 474 294 1044]
Group 5: [1615 1455 1025 1665 1395]
Group 6: [1116 1336 1086 1626 536]
Group 7: [367 347 887 297 237]
Group 8: [1088 1188 1288 58 608]
Group 9: [1439 1289 869 349 1589]

for n in xrange(np.shape(i)[-1]):
print np.random.choice(i[:, n], 5,replace=False)
n here is the number of column;
np.shape[-1] would give the total number of columns;
xrange is a generator object that will generate the number of column through the loop.
print statement here is reused from the question, except for the first column, it goes through all columns, with the use of the for loop.
(Thanks rene for pointing me to the right way of doing things.)

Related

Keep one line out of many that starts from one point

I am working on a project with OpenCV and python but stuck on this small problem.
I have end-points' coordinates on many lines stored in a list. Sometimes a case is appearing that from a single point, more than one line is detected. From among these lines, I want to keep the line of shortest length and eliminate all the other lines thus my image will contain no point from where more than one line is drawn.
My variable which stores the information(coordinates of both the end-points) of all the lines initially detected is as follows:
var = [[Line1_EndPoint1, Line1_EndPoint2],
[Line2_EndPoint1, Line2_EndPoint2],
[Line3_EndPoint1, Line3_EndPoint2],
[Line4_EndPoint1, Line4_EndPoint2],
[Line5_EndPoint1, Line5_EndPoint2]]
where, LineX_EndPointY(line number "X", endpoint "Y" of that line) is of type [x, y] where x and y are the coordinates of that point in the image.
Can someone suggest me how to solve this problem.
You can modify the way data of the lines are stored. If you modify, please explain your data structure and how it is created
Example of such data:
[[[551, 752], [541, 730]],
[[548, 738], [723, 548]],
[[285, 682], [226, 676]],
[[416, 679], [345, 678]],
[[345, 678], [388, 674]],
[[249, 679], [226, 676]],
[[270, 678], [388, 674]],
[[472, 650], [751, 473]],
[[751, 473], [716, 561]],
[[731, 529], [751, 473]]]
Python code would be appreciable.
A Numpy solution
The same result as in my first answer can be achieved based solely
on Numpy.
First define of 2 functions:
Compute square of the length of a line:
def sqLgth(line):
p1, p2 = line
return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2
Convert a vector (1D array) to a column array (a 2D array
with a single column:
def toColumn(tbl):
return tbl.reshape(-1, 1)
Both will be used later.
Then proceed as follows:
Get the number of lines:
lineNo = var.shape[0]
Generate line indices (the content of lineInd column in points array
(will be created later)):
id = np.repeat(np.arange(lineNo), 2)
Generate "origin indicators" (1 - start, 2 - end), to ease analysis
of any intermediate printouts:
origin = np.tile(np.array([1, 2]), lineNo)
Compute line lengths (the content of lgth column in points):
lgth = np.repeat([ sqLgth(line) for line in var ], 2)
Create a list of points with some additional data (consecutive
columns contain origin, lineInd, x, y and lgth):
points = np.hstack([toColumn(origin), toColumn(id),
var.reshape(-1, 2), toColumn(lgth)])
Compute the "criterion array" to sort:
r = np.core.records.fromarrays(points[:, 2:].transpose(),
names='x, y, lgth')
Sort points (by x, y and lgth):
points = points[r.argsort()]
Compute "inverse unique indices" to points:
_, inv = np.unique(points[:,2:4], axis=0, return_inverse=True)
Shift inv by 1 position:
rInv = np.roll(inv,1)
Will be used in the next step, to get the previous element.
Generate a list of line indices to drop:
toDrop = points[[ i for i in range(2 * lineNo)
if inv[i] == rInv[i] ], 1]
Row indices (in points array) are indices of repeated points (elements
in inv equal to the previous element).
Column index (1) - specifies lineInd column.
The whole result (toDrop) is a list of indices of "owning" lines
(containing the repeated points).
Generate the result: var stripped from lines selected in the
previous step:
var2 = np.delete(var, toDrop, axis=0)
To print the reduced list of lines, you can run:
for line in var2:
print(f'{line[0]}, {line[1]}')
The result is:
[551 752], [541 730]
[548 738], [723 548]
[345 678], [388 674]
[249 679], [226 676]
[731 529], [751 473]
To fully comprehend how this code works:
execute each step separately,
print the result,
compare it with printouts from previous steps.
Sometimes it is instructive to print separately even some expressions
(parts of instructions), e.g. var.reshape(-1, 2) - converting your
var (of shape (10, 2, 2)) into a 2D array of points (each row is
a point).
The whole result is of course just the same as in my first answer,
but as you wrote you had little experience in Pandas, now you can
compare both methods and see the cases where Pandas allows to do
something easier and more intuitive.
Good examples are e.g. sort by some columns or finding duplicated rows.
In Pandas it is a matter of a single instruction, with suitable
parameters, whereas in Numpy you have to use more instructions
and know various details and tricks how to do just the same.
I decided that it is easier to write a solution based on Pandas.
The reasons are that:
I can use column names (the code is better readable),
Pandas API is more powerful, although it works slower than "pure" Numpy.
Proceed as follows:
Convert var to a DataFrame:
lines = pd.DataFrame(var.reshape(10,4), columns=pd.MultiIndex.from_product(
(['P1', 'P2'], ['x','y'])))
The initial part of lines is:
P1 P2
x y x y
0 551 752 541 730
1 548 738 723 548
2 285 682 226 676
3 416 679 345 678
Compute the square of the length of each line:
lines[('', 'lgth')] = (lines[('P1', 'x')] - lines[('P2', 'x')]) ** 2\
+ (lines[('P1', 'y')] - lines[('P2', 'y')]) ** 2
lines.columns = lines.columns.droplevel()
I deliberately "stopped" at squares of length, because it is
enough to compare lenghs (computing the root will not change the
result of comparison).
Note also that the first level of the MultiIndex on columns was needed
only to easier express the columns of interest. Further on they will
not be needed, so I dropped it.
This time I put the full content of lines:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
2 285 682 226 676 3517
3 416 679 345 678 5042
4 345 678 388 674 1865
5 249 679 226 676 538
6 270 678 388 674 13940
7 472 650 751 473 109170
8 751 473 716 561 8969
9 731 529 751 473 3536
The next step is to compute points DataFrame, where all points (start and
end of each line) are in the same columns, along with the (square) length of
the corresponding line:
points = pd.concat([lines.iloc[:,[0, 1, 4]],
lines.iloc[:,[2, 3, 4]]], keys=['P1', 'P2'])\
.sort_values(['x', 'y', 'lgth']).reset_index(level=1)
Now I used iloc to specify columns (the first time for starting points
and second for ending points).
To easier read this DataFrame, I passed keys, to include "origin
indicators" and then I sorted rows.
The content is:
level_1 x y lgth
P2 5 226 676 538
P2 2 226 676 3517
P1 5 249 679 538
P1 6 270 678 13940
P1 2 285 682 3517
P1 4 345 678 1865
P2 3 345 678 5042
P2 4 388 674 1865
P2 6 388 674 13940
P1 3 416 679 5042
P1 7 472 650 109170
P2 0 541 730 584
P1 1 548 738 66725
P1 0 551 752 584
P2 8 716 561 8969
P2 1 723 548 66725
P1 9 731 529 3536
P2 9 751 473 3536
P1 8 751 473 8969
P2 7 751 473 109170
Note e.g. that point 226, 676 occurs twice. The first time it occurred
in line 5 and the second in line 2 (indices in var and lines).
To find indices of rows to drop, run:
toDrop = points[points.duplicated(subset=['x', 'y'])]\
.level_1.reset_index(drop=True);
To easier comprehend how this code works, run it step by step and
inspect results of each step.
The result is:
0 2
1 3
2 6
3 8
4 7
Name: level_1, dtype: int64
Note that the left column above is the index only (it dosn't matter).
The real information is in the right column (values).
To show lines that should be left, run:
result = lines.drop(toDrop)
getting:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
4 345 678 388 674 1865
5 249 679 226 676 538
9 731 529 751 473 3536
The above result doesn't contain e.g.:
line 2, as point 226, 676 occurred in line 5,
line 3, as point 345, 678 occurred in line 4,
Just these lines (2 and 3) have been dropped because they are
longer than both second mentioned lines (see earlier partial results).
Maybe this is enough, or if you need to drop the "duplicated" lines from
var (the original Numpy array), and save the result in another
variable, run:
var2 = np.delete(var, toDrop, axis=0)

How to use certain rows of a dataframe in a formula

So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.
Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))
I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

Sum of specific rows in a dataframe (Pandas)

I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')

Finding the rows of data in a database

I'm working on my python script as I want to count for every 69 row in a database to get a list of values. I need some help with my code because some of them are wrong.
When I try this:
test = con.cursor()
test.execute("SELECT COUNT(*) FROM programs")
x = test.fetchone()[0]
running_total = 1 #Again, you want to start counting at 1 for some reason
while running_total < x:
running_total += 69
print running_total
I will get the result like this:
70
139
208
277
346
415
484
553
622
691
760
829
898
967
1036
1105
Here is what I want to achieve:
1
70
139
208
277
346
415
484
553
622
691
760
829
898
967
1036
I want to output the values that start with 1, then 70, 139, 208...etc until I get the last 69 values in a database. Example: I have the 1104 rows in my database and the last 69 rows I want to get is 1036 and ignore the last row 1104.
Another example: If I add another 69 rows of data in a database that I have the rows 1104 to get full rows in total 1173, the last 69 rows I want to get is 1104 and ignore the last row 1173. It will be depends on how many rows of data that I add in a database.
Can you please help me how I can output the values that start with 1, then add up 69 each time to make 70, 139, 208...etc until I get the last 69 rows and then to ignore the last row?
You should make use of the range/xrange function:
range (Python3) and xrange (Python2) have the following form: range(start, end, increment) where start is inclusive and end is non-inclusive.
test = con.cursor()
test.execute("SELECT COUNT(*) FROM programs")
x = test.fetchone()[0]
for running_total in xrange(1, x + 1, 69):
print(running_total)
This should get you the desired output without resorting to C-style incrementing.

Categories