My textfile consists of different "blocks", e.g.,
0 0 1
1 1 1
1 0 0
1 0 0 1
1 1 1 1
1 1 1 0
1 0 0 1
1 1 1 1
1 1 1 0
1 0 1 0
1 0 0
0 1 1
1 1 1
1 0 0 0 0 1
1 1 1 1 0 0
1 1 0 0 1 0
1 0 1 0 0 0
I want to read each block in a np array.
I didn't find a parameter fornp.loadtxt() to read within blank lines.
I guess imposing conditions at f = open('test_case_11x5.txt', 'r') for line in f: ... is slow.
Does anyone know a neat method?
Here is a working solution using re.split and a small list comprehension. I assumes the full text is first loaded in the variable text:
import re, io
import numpy as np
# text = ... ## load here your file
[np.loadtxt(io.StringIO(t)) for t in re.split('\n\n', text)]
output:
[array([[0., 0., 1.],
[1., 1., 1.],
[1., 0., 0.]]),
array([[1., 0., 0., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 0.]]),
array([[1., 0., 0., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 0.],
[1., 0., 1., 0.]]),
array([[1., 0., 0.],
[0., 1., 1.],
[1., 1., 1.]]),
array([[1., 0., 0., 0., 0., 1.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0.]])]
You can use the groupby function in itertools like this:
from itertools import groupby
import numpy as np
arr = []
with open('data.txt') as f_data:
for k, g in groupby(f_data, lambda x: x.startswith('#')):
if not k:
arr.append(np.array([[int(x) for x in d.split()] for d in g if len(d.strip())]))
This will yield a list of np arrays.
Related
Good morning everyone. I am working with Python and Pandas.
I have two DataFrames, of the following type:
df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])
C1 C2 C3
0 -3 -1 -1
1 5 3 3
2 3 3 1
3 -1 -1 -3
4 -3 -1 -1
5 2 3 1
6 1 1 1
df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])
F1 F2 F3 F4 F5
0 -1 1 -1 -1 -1
1 1 1 1 1 1
2 1 1 1 -1 1
3 1 -1 -1 -1 1
4 -1 0 0 -1 -1
5 1 1 1 -1 0
6 1 1 -1 1 -1
I would like to be able to "cross" these two DataFrames, to generate or one in 3D, as follows:
The new data that is generated must compare the values of the df_F with the values of the df_C, taking into account the following:
If both values are positive, generate 1
If both values are negative, generate 1
If one value is positive and the other negative, it generates 0
If any of the values is zero, it generates None (NaN)
True table
Comparison of the data df_C vs df_F
df_C vs df_F = 3D
+ + 1
+ - 0
+ 0 None
- + 0
- - 1
- 0 None
0 + None
0 - None
0 0 None
You, who are experts in programming, could you please guide me, as I generate this matrix, I compare the values. I wish to do it with Pandas. I have done it with loops (for) and conditions (if), but it is visually unpleasant and I think that with Pandas it is more efficient and elegant.
Thank you.
Numpy broadcasting and np.select
Broadcast and multiply the values in df_C with the values from df_F in such a way that the shape of the resulting product matrix will be (3, 7, 5), then test for the condition where the values in the product matrix are positive, negative or zero and assign the corresponding values 1, 0 and NaN where the condition holds True
a = df_C.values.T[:, :, None] * df_F.values
a = np.select([a > 0, a < 0], [1, 0], np.nan)
array([[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]],
[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]],
[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]]])
This question already has answers here:
Quick way to upsample numpy array by nearest neighbor tiling [duplicate]
(3 answers)
Closed 4 years ago.
Given a matrix, such as:
1 0 0
0 1 1
1 1 0
I would like to expand each element to a "sub-matrix" of size AxA, e.g., 3x3, the result will be:
1 1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0 0
What is the fastest way of doing it in Python using numpy (or PyTorch)?
Since what you're describing is the Kronecker product:
Use np.kron
Computes the Kronecker product, a composite array made of blocks of the second array scaled by the first.
x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 0]])
np.kron(x, np.ones((3, 3)))
array([[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 0., 0., 0.]])
Just wondering if there is an off-the-shelf function to perform the following operation; given a matrix X, holding labels (that can be assumed to be integer numbers 0-to-N) in each entry e.g.:
X = [[0 1 1 2 2 3 3 3],
[0 1 1 2 2 3 3 4],
[0 1 5 5 5 5 3 4]]
I want its adjacency matrix G i.e. G[i,j] = 1 if i,j are adjacent in X and 0 otherwise.
For example G[1,2] = 1, because 1,2 are adjacent in (X[0,2],X[0,3]), (X[1,2],X[1,3]) etc..
The naive solution is to loop through all entries and check its neighbors, but I'd rather avoid loops for performance reason.
You can use fancy indexing to assign the values of G directly from your X array:
import numpy as np
X = np.array([[0,1,1,2,2,3,3,3],
[0,1,1,2,2,3,3,4],
[0,1,5,5,5,5,3,4]])
G = np.zeros([X.max() + 1]*2)
# left-right pairs
G[X[:, :-1], X[:, 1:]] = 1
# right-left pairs
G[X[:, 1:], X[:, :-1]] = 1
# top-bottom pairs
G[X[:-1, :], X[1:, :]] = 1
# bottom-top pairs
G[X[1:, :], X[:-1, :]] = 1
print(G)
#array([[ 1., 1., 0., 0., 0., 0.],
# [ 1., 1., 1., 0., 0., 1.],
# [ 0., 1., 1., 1., 0., 1.],
# [ 0., 0., 1., 1., 1., 1.],
# [ 0., 0., 0., 1., 1., 0.],
# [ 0., 1., 1., 1., 0., 1.]])
Say I have a dataframe like the following:
A B
0 bar one
1 bar three
2 flux six
3 bar three
4 foo five
5 flux one
6 foo two
I would like to apply dummy-coding contrasting on it so that I get:
A B
0 0 0
1 0 2
2 1 1
3 0 2
4 2 3
5 1 0
6 2 4
(i.e. mapping every unique value to a different integer, per column).
I have tried using scikit-learn's DictVectorizer, but I get:
> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 1.]])
This is because scikit-learn's DictVectorizer is designed to output one-of-K encoding. What I want is a simple-encoding instead (one column per variable).
How can I do this with scikit-learn and/or pandas? Aside from that, are there any other Python packages that help with general contrasting methods?
You could use pd.factorize:
In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]:
A B
0 0 0
1 0 1
2 1 2
3 0 1
4 2 3
5 1 0
6 2 4
The patsy package provides all the contrasts you'd need (and the ability to make more). [1] AFAIK, statsmodels is the only stats package that currently uses patsy's formula framework. [2, 3].
[1] https://patsy.readthedocs.org/en/latest/API-reference.html#handling-categorical-data
[2] http://statsmodels.sourceforge.net/devel/contrasts.html
[3] http://statsmodels.sourceforge.net/devel/example_formulas.html
Dummy encoding is what you get when you call DictVectorizer. The kind of integer encoding you get is actually different:
sklearn.preprocessing.LabelBinarizer or DictVectorizer gives dummy encoding (as pandas.get_dummies)
sklearn.preprocessing.LabelEncoder gives integer categorical encoding (as pandas.factorize)
Say I have a matrix in a numpy array in Python
In [3]: my_matrix
Out[3]:
array([[ 2., 2., 2., 2., 2., 2., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 2., 2., 2., 2., 0., 0., 0.,
0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 2., 2.,
2., 2., 2., 2., 2.]])
Is there a way to have Python/IPython print my array as:
[ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2;
0 0 0 0 0 0 2 2 2 2 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 ]
? (~ similar to the way MATLAB does it)
Also, I have noticed that IPython does not use the full width of my terminal when printing numpy arrays. Other functions do (e.g. pprint.pprint). How can I change that?
Use numpy.set_printoptions. For increasing the line width:
np.set_printoptions(linewidth=150)
Replace 150 by whatever you need. Now, to print as you asked (I guess it means without the decimal point):
print my_matrix.astype('i')
If you have floating point values you can also control the precision for printouts with the option precision:
np.set_printoptions(precision=3)