Reading formatted array from file in Python - python

I have a file which contains some strings and then two formatted arrays. It looks something like this
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
I don't know the size of the arrays beforehand. Only thing I know is the delimiter for the array which is []. What can be an elegant way to read the arrays.
I am a newbie in python.

Using Regex. re.findall
Ex:
import re
import ast
with open(filename) as infile:
data = infile.read()
for i in re.findall(r"(\[.*?\])", data, flags=re.S):
print(ast.literal_eval(i))
Output:
[58, 50, 42, 34, 26, 18, 10, 2, 61, 53, 45, 37, 29, 21, 13, 5, 63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9, 1, 58, 50, 42, 34, 26, 18, 14, 6, 61, 53, 45, 37, 29, 21, 13, 5, 28, 20, 12, 4]

I wouldn't call it elegant but it works
ars = """
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
"""
arrays = []
for a in ars.split("["):
if ']' in a:
arrays.append([i.strip() for i in a.replace("]",'').split(',')])

Related

Transform a list of ranges into a single list

I have a data frame that have some points to mark another dataset.
I'm creating a range from the starting mark and the stopping mark that I want to transform into a single list or numpy array.
I have the following:
list(map(lambda limits : np.arange(limits[1] - limits[0]-1, -1, -1),
zip(df_cycles['Start_point'], df_cycles['Stop_point']))
)
This is returning a list of arrays:
[array([1155, 1154, 1153, ..., 2, 1, 0]),
array([71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55,
54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21,
20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4,
3, 2, 1, 0]),
...]
How can I modify or transform the output to have a single list or NumPy array like this:
array([1155, 1154, 1153, ..., 2, 1, 0, 71, 70, 69, 68, 67, 66, 65,
64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48,
47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31,
30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14,
13, 12, 11, 10, 9, 8, 7, 6, 5, 4,3, 2, 1, 0,...])
Just do:
flatarray = np.concatenate(list_of_arrays)
concatenate puts together two or more arrays into a single new array; you don't to do it a single array at a time (it creates a Schlemiel the Painter's algorithm), but once you've got them all, it's an efficient way to combine them.

Select values from two different dataset in python

i have a trouble when i'm dealing with my 2 dataset, i explain my problem:
I have 2 different dataset:
training_df = pd.read_csv('.../train.csv')
test_df = pd.read_csv('.../test.csv')
I have to take values from some columns from train.csv and take other columns in test.csv, i tried like this:
num_attrib = pd.DataFrame(training_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib = pd.DataFrame(training_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
num_attrib_test = pd.DataFrame(test_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib_test = pd.DataFrame(test_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
Both datasets have numerical and categorial datas. I have to select and separate categorical from numerical datas for each datasets, but my way is wrong.
I have this trouble because i have to make the Columntransformer() on training_df and test_df.
Any suggestion?
Thank you so much
You are looking for iloc. See documentation here.
num_attrib = training_df.iloc[:,[0,2,3,...,15]]
You can also slice:
#even columns
num_attrib = training_df.iloc[:, ::2]
#odd columns
num_attrib = training_df.iloc[:, 1::2]

Divide Dataframe into 2 dataframe using index

I need to divide my dataframe into 2 dataframe based on their index
Df1 with this index:[5, 15, 22, 23, 24]
Df2 with this index:[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
Unable to find solution! Any help would be appreciated
If input is list of index values is possible use Index.isin in boolean indexing (if not exist some values in original index also working correct):
idx = [5, 15, 22, 23, 24]
mask = df.index.isin(idx)
df1 = df[mask]
df2 = df[~mask]
Solution with DataFrame.loc is possible without : and is necessary all values exist in original index:
L1 = [5, 15, 22, 23, 24]
L2 = [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20,
21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
df1 = df.loc[L1]
df2 = df.loc[L2]
You can use .loc:
df_1 = df.loc[[5, 15, 22, 23, 24], :]
df_2 = df.loc[[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54], :]
Here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Overlapping writing tasks in multiprocessing (pool.map)

I'm facing overlapping writing problems when using multiprocessing when running the following piece of code.
def spectrum(i):
for j in range (num_x):
coordinate = data[:,j,i]
filtered = filter(lambda a: a != 0, coordinate)
occupancy = float(len(filtered))/framespfile
if filtered == [] or filtered[0] > 500:
output = str([j, i]) + "\n" + str(filtered) + "\n"
badpixelfile.write(output)
else :
output = str([j, i]) + "\n" + str(filtered) + "\n"
coordinatefile.write(output)
pool2 = multiprocessing.Pool(multiprocessing.cpu_count())
pool2.map(spectrum, range(num_y))
pool2.close()
pool2.join()
It should write away results like:
[14,0]
[50, 51, 84]
[0, 314]
[60, 74, 12, 202, 129]
But sometimes processes overlap and the file looks like (this happens very occasionally, but it results in analysis problems)
[149, 27]
[27, 34, 26, 25, 19, 45, 32, 36, 46, 29, 25, 25, 40, 62, 24, 31, 23, 46, 33, 35, 60, 33, 8, 24, 49, 29, 29, 42, 8, 22, 31, 28, 25, 25, 56, 32, 31, 27, 11, 20, 29, 23, 51, 28, 31, 29, 28, 30, 23, 16, 34, 36, 25, 17, 25, 19, 19, 51, 27, 37, 9, 32, 26, 28, 27, 3, 44, 4, 38, 20, 34, 28, 22, 26, 26, 19, 21, 25, 25, 48, 24, 29, 22, 20, 23, 29, 15, 32, 42, 3, 23, 26, 34, 28, 26, 39, 17, [0, 123]
[20, 43, 33, 34, 18, 44, 15, 22, 33, 20, 45, 30, 21, 33, 32, 43, 30, 8, 37, 54, 9, 46, 33, 16, 27, 29, 31, 47, 26, 38, 40, 29, 34, 38, 17, 33, 47, 28, 24, 33, 40, 47, 16, 32, 33, 21, 49, 34, 26, 21, 47, 46, 49, 13, 62, 62, 31, 41, 14, 65, 36, 49, 27, 38, 44, 54, 55, 64, 32, 50, 28, 34, 41, 49, 33, 40, 28, 32, 31, 56, 16, 35, 37, 50, 33, 41, 38, 26, 41, 26, 28, 25, 37, 27, 20, 47, 31, 35, 28, 43, 48, 37, 31, 24, 34, 36, 41, 19, 41, 41, 3, 36]
[1, 123]
Thus it doesn't finish the process for [149, 27] and already begins with [0, 123] without closing the [149,27] process.

numpy arrays: basic questions

sorry if this question is so basic
A=np.arange(64).reshape(2,32)
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]])
A.reshape(4,4,4)
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]],
[[32, 33, 34, 35],
[36, 37, 38, 39],
[40, 41, 42, 43],
[44, 45, 46, 47]],
[[48, 49, 50, 51],
[52, 53, 54, 55],
[56, 57, 58, 59],
[60, 61, 62, 63]]])
Now, i would have liked something like A[2] or A[2,:] or A[2,:,:] to return me the matrix
[[32, 33, 34, 35],
[36, 37, 38, 39],
[40, 41, 42, 43],
[44, 45, 46, 47]]
and A[2,2,2] to return me 42 for example
but i got this error
IndexError: too many indices for array
You have to do
A = A.reshape(4,4,4)
instead of
A.reshape(4,4,4)
Because reshape is not inplace, you need to do this. Then you can do
A[2,2,2]
Out[301]: 42
after A.reshape(4,4,4) , A does not change

Categories