4 I am trying to put array into a pandas dataframe - python

import pandas as pd
import numpy as np
zeros=np.zeros((6,6))
arra=np.array([zeros])
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']]
df=pd.DataFrame(arra,index=rownames,columns=colnames)
print(df)
Error:
ValueError: Must pass 2-d input. shape=(1, 6, 6)
My desired output is :
A B C D E F
one 0 0 0 0 0 0
tow 0 0 0 0 0 0
three 0 0 0 0 0 0
four 0 0 0 0 0 0
five 0 0 0 0 0 0
six 0 0 0 0 0 0

Try this
pd.DataFrame(np.zeros((6,6)), columns=list('ABCDEF'), index=['one','tow','three','four','five','six'])

If you want to initialize your DataFrame with a single value, you don't need to bother creating a 2D array, just pass the desired scalar to the DataFrame constructor and it will broadcast:
import pandas as pd
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']
df=pd.DataFrame(0, index=rownames, columns=colnames)
print(df)
Output:
one tow three four five six
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0

Try this
zeros=np.zeros((6,6), dtype=int)
df=pd.DataFrame(zeros, columns=['A','B','C','D','E','F'], index=['one','tow','three','four','five','six'])
Understand that in your questions 'A','B','C','D','E','F' these are column names and 'one','tow','three','four','five','six' are indexes, you have confused them with rows and columns.
The reason you got that error is because of the line arra=np.array([zeros]) which converts 2d array to 1d array (like how its given below - see '[[[' which means it is 1d array of 2d array ), but you need 2d array to create a dataframe.
array([[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
Hope this helped!

Related

Index-based access to rows in pandas.DataFrame with Sparse columns

Due to memory limitations I have to use sparse columns in a pandas.DataFrame (pandas version 1.0.5).
Unfortunately, with index-based access to rows (using .loc[]), I am running into the following issue:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
2 0 1 0 0
If using .loc:
df.loc[[0,1]]
Output:
0 1 2 3
0 0 0 NaN 1
1 1 0 NaN 0
Ideally, I would be expecting 0s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes sort of speak against this:
df.loc[[0,1]].dtypes
Output:
0 Sparse[int32, 0]
1 Sparse[int32, 0]
2 Sparse[float64, 0]
3 Sparse[int32, 0]
(note that the fill-value is still given as 0, even though the view's dtype for column 2 has changed from Sparse[int32, 0] to Sparse[float64, 0]).
Can anyone tell me whether all NaNs occuring in a row-sliced pd.DataFrame with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrames with sparse columns?
So this indeed turned out to be a bug in pandas that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0).
In 1.1.0 the minimal example works:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df.loc[[0, 1]]
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0

pandas grouping multiple column data at a column level [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Aggregation in Pandas
(2 answers)
Closed 2 years ago.
I'm trying to group data of a column A, B, C, and D on Report column.
report A B C D
1 1 0 0 0
1 0 1 0 0
2 0 0 0 1
2 0 0 1 0
3 1 0 0 0
3 0 1 0 0
4 0 0 1 0
4 1 0 0 0
Here is what I'm trying to achieve
report A B C D
1 1 1 0 0
2 0 0 1 1
3 1 1 0 0
4 1 0 1 0
Is there any straight forward way to achieve the result?
Thank you very much!!
I appreciate any support!
I believe you're looking for sum in the pandas library. The script below outputs the expected results.
import pandas as pd
report = [1, 1, 2, 2, 3, 3, 4, 4]
A = [1, 0, 0, 0, 1, 0, 0, 1]
B = [0, 1, 0, 0, 0, 1, 0, 0]
C = [0, 0, 0, 1, 0, 0, 1, 0]
D = [0, 0, 1, 0, 0, 0, 0, 0]
df = pd.DataFrame(list(zip(report, A, B, C, D)), columns = ['report', 'A', 'B', 'C', 'D'])
print(df.groupby(['report']).sum())

Pandas Dataframe: Split a column into multiple columns

I have a pandas dataframe with columns names as: (columns type as Object)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe.
Here is an example: dataframe names as df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python? How to do this using pandas dataframe?
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
One-hot-encoding labeling solution:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X - is a sparsed (memory saving) matrix:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
If you want to "unpack" it:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest is the length of the longest sentence. Now make the Null columns:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Python: fill hollow object

I am writing a script to calculate the volume of any random shaped 3D object. I don't care if the object is hollow or not I need to calculate its total volume.
The data model I have is a 3D table (histogram of pixels) with ones and zeros. ones are evidently where the object is and zero where we have nothing. to calculate the volume of a well filled object it's as easy as summing all the pixels that contains one and multiply by the pixel volume.
On the other hand, the main difficulty remains where we have a hollow object, so we have zeros surrounded by ones. Therefore applying the straightforward method I described herein is not valid anymore. What we need to do is fill all the object area with ones. here is a 2D example so you can understand What i mean
a 2D table :
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 0 0 0 1 1 1 0 0
0 0 0 1 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
I need to transform it to this
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
If you use scipy you can do this in one line with binary_fill_holes. And this works in n-dimensions. With your example:
import numpy as np
from scipy import ndimage
shape=np.array([
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
])
shape[ndimage.binary_fill_holes(shape)] = 1
#Output:
[[0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 0 0 0]
[0 0 1 1 1 1 1 1 1 1 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0]]
A standard flood fill should be extensible to three dimensions. From Wikipedia, the 2-d version in outline:
1. If the color of node is not equal to target-color, return.
2. Set the color of node to replacement-color.
3. Perform Flood-fill (one step to the west of node, target-color, replacement-color).
Perform Flood-fill (one step to the east of node, target-color, replacement-color).
Perform Flood-fill (one step to the north of node, target-color, replacement-color).
Perform Flood-fill (one step to the south of node, target-color, replacement-color).
4. Return.
Notice that in step 3. you are keeping track of all the adjacent cells. If you change this to find all adjacent cells in 3-d and run as before it should work nicely.
Not intuitive and hard to read, but compact:
matrix = [[0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0]]
ranges = [1 in m and range(m.index(1), len(m)-list(reversed(m)).index(1)) or None for m in matrix]
result = [[ranges[j] is not None and i in ranges[j] and 1 or 0 for i,a in enumerate(m)] for j,m in enumerate(matrix)]
result
[[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]]
matrix=[
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
]
def fill (x,y):
global matrix
if ( x==len (matrix)
or y==len (matrix[0])
or x==-1
or y==-1
or matrix[x][y]==1 ):
return
else:
matrix[x][y]=1
fill (x+1,y)
fill (x-1,y)
fill (x,y+1)
fill (x,y-1)
fill (4,4)
for i in matrix:
print i
Assuming you're talking about something like filling a voxel shape, why can't you just do something like this (take it as pseudocode example for the simplified 2D case, as I don't know what data structure you're using - maybe a numpy.array? - so I'm just taking a hypothetical "list of lists as a matrix" and I don't take in consideration the problem of modifying an iterable while traversing it etc.):
for i, row in enumerate(matrix):
last_filled_voxel_j = false
for j, voxel in enumerate(row):
if voxel:
if last_filled_voxel != false:
fill_matrix(matrix, i, last_filled_voxel_j, j)
last_filled_voxel_j = j
...assuming that fill_matrix(matrix, row, column_start, column_end) just fills the row of voxels between and not including column_start and column_end.
I guess this is probably not the answer you're looking for, but can you expand what thing different than what I pseudocoded before you actually need to do so we can be of more help?

Categories