I want to send an array with size of 336 in splitted parts to my 8 workers. I want the workers 0-8 to get the sizes 12,18,30,36,48,54,66 and 72. So add 6 then 12 and 6 and so on... To this point I was able to cut the array into pieces of 10.
This is what I came up with:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
v=np.random.rand(100,1) #array
if rank == 0:
# Process to send data to the different processes. Just send evenly chunks to the processes.
for i in range(1, size):
v_splitted=[np.array_split(v, 10)[i-1]]
comm.send(v_splitted, dest=i, tag=i)
# worker processes
else:
# each worker process receives data from master process
data = comm.recv(source=0, tag=rank)
How do I make sure, that each worker gets the desired size?
you can use accumulate from itertools and zip to build a list of slices. Then use that to break down your array in chunks of the desired sizes:
from itertools import accumulate
sizes = [12,18,30,36,48,54,66,72] # or [*accumulate([12,6]*4)]
breaks = [*accumulate(sizes)]
slices = [slice(s,e) for s,e in zip([0]+breaks,breaks)]
v = list(range(336))
for i,chunk in enumerate(slices):
print(len(v[chunk]),":",*v[chunk][:3],"...",*v[chunk][-3:])
# comm.send(v[chunk], dest=i, tag=i)
output:
12 : 0 1 2 ... 9 10 11
18 : 12 13 14 ... 27 28 29
30 : 30 31 32 ... 57 58 59
36 : 60 61 62 ... 93 94 95
48 : 96 97 98 ... 141 142 143
54 : 144 145 146 ... 195 196 197
66 : 198 199 200 ... 261 262 263
72 : 264 265 266 ... 333 334 335
how it works
The breaks list contains the cumulative numbers of items that are processed at the end of each chunk:
[12, 30, 60, 96, 144, 198, 264, 336]
These numbers correspond to the end of index ranges that would represent each chunk of data. To obtain the start of these ranges, we simply need to pair each end value with the end value of the preceding chunk (the first chunk starting at zero):
starts (s): [0] [12, 30, 60, 96, 144, 198, 264, 336]
ends (e): [12, 30, 60, 96, 144, 198, 264, 336]
ranges: (0,12), (12,30), (30,60) ... (264,336)
This is what the slices variable will contain except that, to facilitate usage later on, it returns a list of slice() objects instead of a list of range() objects. The slice objects can be used directly as subscripts to the list or array containing the data (e.g. v[slice]). The zip() function is used here to create pairs of end values where the previous end (i.e. the start) is obtained by offsetting the breaks with one extra entry (of zero)
Related
I have an array which is 1 -> 160. I want to split this into 10 arrays that are split every sixteen numbers. This is what I have so far:
amplitude=[]
for i in range (0,160):
amplitude.append(i+1)
print(amplitude)
#split arrays up into a line for each sample
traceno=10 #number of traces in file
samplesno=16 #number of samples in each trace. This wont change.
amplitude_split=np.zeros((traceno,samplesno) ,dtype=np.int)
#fill in the arrays with amplitude/sample numbers
for i in range(len(amplitude)):
for j in range(traceno):
for k in range(samplesno):
amplitude_split[j,k]=amplitude[i]
print(amplitude_split[1,:])
As an output I only get [160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160]
Where I require something along the lines of:
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
[17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]
etc...
You are nesting the loops. So you consistently fill the new array with the same number from the first one, and end with the last one 160 repeated everywhere.
You only need to copy the list into a 1D numpy array, and then reshape it:
amplitude_split=np.array(amplitude, dtype=np.int).reshape((traceno,samplesno))
Well, if we're using Numpy arrays, we can use Numpy functionality:
amplitude = np.arange(1, 161)
amplitude_split = amplitude.reshape(10, 16)
Otherwise, you've already been linked to how to do it for plain lists, but I'd like to point out that you still don't need a loop to fill amplitude in the first place:
amplitude = list(range(1, 161))
In general, with Python you should be trying hard not to think in terms of starting with an initially blank "storage" area that you then fill in. Just create the data you want directly - by conversions of the sort above, by list comprehensions etc., or if necessary by .append() ing - rather than overwriting a dummy value.
See grouper in https://docs.python.org/2/library/itertools.html#recipes
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
I am looking for a solution for the following problem and it just won't work the way I want to.
So my goal is to calculate a regression analysis and get the slope, intercept, rvalue, pvalue and stderr for multiple rows (this could go up to 10000). In this example, I have a file with 15 rows. Here are the first two rows:
array([
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24],
[ 100, 10, 61, 55, 29, 77, 61, 42, 70, 73, 98,
62, 25, 86, 49, 68, 68, 26, 35, 62, 100, 56,
10, 97]]
)
Full trial data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268
The first row is the x-variable and this is the independent variable. This has to be kept fixed while iterating over every following row.
For the following row, the y-variable and thus the dependent variable, I want to calculate the slope, intercept, rvalue, pvalue and stderr and have them in a dataframe (if possible added to the same dataframe, but this is not necessary).
I tried the following code:
import pandas as pd
import scipy.stats
import numpy as np
df = pd.read_excel("Directory\\file.xlsx")
def regr(row):
r = scipy.stats.linregress(df.iloc[1:, :], row)
return r
full_dataframe = None
for index,row in df.iterrows():
x = regr(index)
if full_dataframe is None:
full_dataframe = x.T
else:
full_dataframe = full_dataframe.append([x.T])
full_dataframe.to_excel('Directory\\file.xlsx')
But this fails and gives the following error:
ValueError: all the input array dimensions except for the concatenation axis
must match exactly
I'm really lost in here.
So, I want to achieve that I have the slope, intercept, pvalue, rvalue and stderr per row, starting from the second one, because the first row is the x-variable.
Anyone has an idea HOW to do this and tell me WHY mine isn't working and WHAT the code should look like?
Thanks!!
Guessing the issue
Most likely, your problem is the format of your numbers, there are Unicode String dtype('<U21') instead of being Integer or Float.
Always check types:
df.dtypes
Cast your dataframe using:
df = df.astype(np.float64)
Below a small example showing the issue:
import numpy as np
import pandas as pd
# DataFrame without numbers (will not work for Math):
df = pd.DataFrame(['1', '2', '3'])
df.dtypes # object: placeholder for everything that is not number or timestamps (string, etc...)
# Casting DataFrame to make it suitable for Math Operations:
df = df.astype(np.float64)
df.dtypes # float64
But it is difficult to be sure of this without having the original file or data you are working with.
Carefully read the Exception
This is coherent with the Exception you get:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U21') dtype('<U21') dtype('<U21')
The method scipy.stats.linregress raises a TypeError (so it is about type) and is telling you than it cannot perform add operation because adding String dtype('<U21') does not make any sense in the context of a Linear Regression.
Understand the Design
Loading the data:
import io
fh = io.StringIO("""1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268""")
df = pd.read_fwf(fh).astype(np.float)
Then we can regress the second row vs the first:
scipy.stats.linregress(df.iloc[0,:].values, df.iloc[1,:].values)
It returns:
LinregressResult(slope=0.12419744768547877, intercept=49.60998434527584, rvalue=0.11461693561751324, pvalue=0.5938303095361301, stderr=0.22949908667668056)
Assembling all together:
result = pd.DataFrame(columns=["slope", "intercept", "rvalue"])
for i, row in df.iterrows():
fit = scipy.stats.linregress(df.iloc[0,:], row)
result.loc[i] = (fit.slope, fit.intercept, fit.rvalue)
Returns:
slope intercept rvalue
0 1.000000 0.000000 1.000000
1 0.124197 49.609984 0.114617
2 -1.095801 289.293224 -0.205150
Which is, as far as I understand your question, what you expected.
The second exception you get comes because of this line:
x = regr(index)
You sent the index of the row instead of the row itself to the regression method.
What I'm looking for
# I have an array
x = np.arange(0, 100)
# I have a size n
n = 10
# I have a random set of numbers
indexes = np.random.randint(n, 100, 10)
# What I want is a matrix where every row i is the i-th element of indexes plus the previous n elements
res = np.empty((len(indexes), n), int)
for (i, v) in np.ndenumerate(indexes):
res[i] = x[v-n:v]
To reformulate, as I wrote in the title what am looking for is a way to take multiple subsets (of the same size) of an initial array.
Just to add a detail this loopy version works, I want just to know if there is a numpyish way to achieve this in a more elegant way.
The following does what you are asking for. It uses numpy.lib.stride_tricks.as_strided to create a special view on the data which can be indexed in the desired way.
import numpy as np
from numpy.lib import stride_tricks
x = np.arange(100)
k = 10
i = np.random.randint(k, len(x)+1, size=(5,))
xx = stride_tricks.as_strided(x, strides=np.repeat(x.strides, 2), shape=(len(x)-k+1, k))
print(i)
print(xx[i-k])
Sample output:
[ 69 85 100 37 54]
[[59 60 61 62 63 64 65 66 67 68]
[75 76 77 78 79 80 81 82 83 84]
[90 91 92 93 94 95 96 97 98 99]
[27 28 29 30 31 32 33 34 35 36]
[44 45 46 47 48 49 50 51 52 53]]
A bit of explanation. Arrays store not only data but also a small "header" with layout information. Amongst this are the strides which tell how to translate linear memory to nd. There is a stride for each dimension which is just the offset at which the next element along that dimension can be found. So the strides for a 2d array are (row offset, element offset). as_strided permits to directly manipulate an array's strides; by setting row offsets to the same as element offsets we create a view that looks like
0 1 2 ...
1 2 3 ...
2 3 4
. .
. .
. .
Note that no data are copied at this stage; for exasmple, all the 2s refer to the same memory location in the original array. Which is why this solution should be quite efficient.
I must open a file, compute the averages of a row and column and then the max of the data sheet. The data is imported from a text file. When I am done with the program, it should look like an excel sheet, only printed on my terminal.
Data file must be seven across by six down.
88 90 94 98 100 110 120
75 77 80 86 94 103 113
80 83 85 94 111 111 121
68 71 76 85 96 122 125
77 84 91 102 105 112 119
81 85 90 96 102 109 134
Later, I will have to print the above data. I the math is easy, my problem is selecting the number from the indexed list. Ex:
selecting index 0, 8, 16, 24, 32, 40. Which should be numbers 88, 75, 80, 68, 77, 81.
What I get when I input the index number is 0 = 8, 1 = 8, 2 = " "... ect.
What have I done wrong here? I have another problem where I had typed into the program the list, which works as I wanted this to work. That program was using the index numbers to select a month. 0= a blank index, 1 = january, 2 = Febuary, ect...
I hope this example made clear what I intended to do, but cannot seem to do. Again, the only difference between my months program and this program is that for the below code, I open a file to fill the list. Have I loaded the data poorly? Split and stripped the list poorly? Help is more useful than answers, as I can learn rather than be given the answer.
def main():
print("Program to output a report of noise for certain model cars.")
print("Written by censored.")
print()
fileName = input("Enter the name of the data file: ")
infile = open(fileName, "r")
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
print(list)
n = eval(input("Enter a index number: ", ))
print("The index is", line[n] + ".")
print("{0:>38}".format(str("Speed (MPH)")))
print("{0:>6}".format(str("Car")), ("{0:>3}".format(str(":"))),
("{0:>6}".format(str("30"))), ("{0:>4}".format(str("40"))),
("{0:>4}".format(str("50"))))
main()
Thank you for your time.
You keep overwriting your variables, and I wouldn't recommend masking a built-in (list).
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
should be:
infileData = list(map(int, infile.read().strip().split()))
This reads the file contents into a string, strips off the leading and trailing whitespace, splits it up into a list separated by whitespace, maps each element as an int, and creates a list out of that.
Or:
stringData = infile.read()
stringData = stringData.strip()
stringData = stringData.split()
infileData = []
for item in stringData:
infileData.append(int(item))
Storing each element as an integer lets you easily do calculations on it, such as if item > 65 or exceed = item - 65. When you want to treat them as strings, such as for printing, cast them as such with str():
print("The index is", str(infileData[n]) + ".")
Just to be clear, it looks like your data is space-separated not comma separated. So when you call,
list = line.split(',')
the list looks like this,
['88 90 94 98 100 110 120 75 77 80 86 94 103 113 80 83 85 94 111 111 121 68 71 76 85 96 122 125 77 84 91 102 105 112 119 81 85 90 96 102 109 134']
So therefore, when you access list[0], you will get '8' not '88', or when you access list[2] you will get ' ', not '94'
list = line.split() # this is what you should call (space-separated)
Again this answer is based on how your data is presented.
I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on.
P3
480 640
255
49 49 49 48 48 48 47 47 47 46 46 46 45 45 45 42 42 42 38 38
38 35 35 35 23 23 23 8 8 8 7 7 7 17 17 17 21 21 21 29 29
29 41 41 41 47 47 47 49 49 49 42 42 42 33 33 33 24 24 24 18 18
...
Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem.
The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.
To read three space-separated word at a time from a file:
with open(filename, 'rb') as file:
kind, dimensions, max_color = map(next, [file]*3) # read 3 lines
rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3)
Output
[(49, 49, 49),
(48, 48, 48),
(47, 47, 47),
(46, 46, 46),
(45, 45, 45),
(42, 42, 42),
...
See What is the most “pythonic” way to iterate over a list in chunks?
To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.
Probably not the most 'pythonic' way but...
Iterate through the lines containing integers.
Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479).
For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed.
Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row].
Increment col.
Reset color_code_count.
'numbers_processed' will continue to increment until you get to 1920.
Once you hit 1920, you've reached the end of the first row.
Reset numbers_processed and col to zero, increment row by 1.
By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0.
Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.
A possible way to go through each word is to iterate through each line then .split it into each word.
the_file = open("file.txt",r)
for line in the_file:
for word in line.split():
#-----Your Code-----
From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic)
for line in the_file:
if "1" not in line or "2" not in line ...:
for word in line.split():
#-----Your Code-----
Or you can test if there is anything in each line: (Much more pythonic)
for line in the_file:
for word in line.split():
if len(word) != 0 or word != "\n":
#-----Your Code-----
I would recommend adding each of your new "lines" to a new document.
I am a C programmer. Sorry if this code looks like C Style:
f = open("pixel.ppm", "r")
type = f.readline()
height, width = f.readline().split()
height, width = int(height), int(width)
max_color = int(f.readline());
colors = []
count = 0
col_count = 0
line = []
while(col_count < height):
count = 0
i = 0
row =[]
while(count < width * 3):
temp = f.readline().strip()
if(temp == ""):
col_count = height
break
temp = temp.split()
line.extend(temp)
i = 0
while(i + 2 < len(line)):
row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])})
i = i+3
count = count +3
if(count >= width *3):
break
if(i < len(line)):
line = line[i:len(line)]
else:
line = []
col_count += 1
colors.append(row)
for row in colors:
for rgb in row:
print(rgb)
print("\n")
You can tweak this according to your needs. I tested it on this file:
P4
3 4
256
4 5 6 4 7 3
2 7 9 4
2 4
6 8 0
3 4 5 6 7 8 9 0
2 3 5 6 7 9 2
2 4 5 7 2
2
This seems to do the trick:
from re import findall
def _split_list(lst, i):
return lst[:i], lst[i:]
def iter_ppm_rows(path):
with open(path) as f:
ftype = f.readline().strip()
h, w = (int(s) for s in f.readline().split(' '))
maxcolor = int(f.readline())
rlen = w * 3
row = []
next_row = []
for line in f:
line_ints = [int(i) for i in findall('\d+\s+', line)]
if not row:
row, next_row = _split_list(line_ints, rlen)
else:
rest_of_row, next_row = _split_list(line_ints, rlen - len(row))
row += rest_of_row
if len(row) == rlen:
yield row
row = next_row
next_row = []
It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths.
I tested it on a file that looked like the following:
P3
120 160
255
0 1 2 3 4 5 6 7
8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[...]
9993 9994 9995 9996 9997 9998 9999
That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file.
Using the following test code...
for row in iter_ppm_rows('mock_ppm.txt'):
print(len(row), row[0], row[-1])
...the result was the following, which seems to not be skipping over any data and returning rows of the right size.
480 0 479
480 480 959
480 960 1439
480 1440 1919
480 1920 2399
480 2400 2879
480 2880 3359
480 3360 3839
480 3840 4319
480 4320 4799
480 4800 5279
480 5280 5759
480 5760 6239
480 6240 6719
480 6720 7199
480 7200 7679
480 7680 8159
480 8160 8639
480 8640 9119
480 9120 9599
As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.