Due to memory limitations I have to use sparse columns in a pandas.DataFrame (pandas version 1.0.5).
Unfortunately, with index-based access to rows (using .loc[]), I am running into the following issue:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
2 0 1 0 0
If using .loc:
df.loc[[0,1]]
Output:
0 1 2 3
0 0 0 NaN 1
1 1 0 NaN 0
Ideally, I would be expecting 0s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes sort of speak against this:
df.loc[[0,1]].dtypes
Output:
0 Sparse[int32, 0]
1 Sparse[int32, 0]
2 Sparse[float64, 0]
3 Sparse[int32, 0]
(note that the fill-value is still given as 0, even though the view's dtype for column 2 has changed from Sparse[int32, 0] to Sparse[float64, 0]).
Can anyone tell me whether all NaNs occuring in a row-sliced pd.DataFrame with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrames with sparse columns?
So this indeed turned out to be a bug in pandas that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0).
In 1.1.0 the minimal example works:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df.loc[[0, 1]]
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Aggregation in Pandas
(2 answers)
Closed 2 years ago.
I'm trying to group data of a column A, B, C, and D on Report column.
report A B C D
1 1 0 0 0
1 0 1 0 0
2 0 0 0 1
2 0 0 1 0
3 1 0 0 0
3 0 1 0 0
4 0 0 1 0
4 1 0 0 0
Here is what I'm trying to achieve
report A B C D
1 1 1 0 0
2 0 0 1 1
3 1 1 0 0
4 1 0 1 0
Is there any straight forward way to achieve the result?
Thank you very much!!
I appreciate any support!
I believe you're looking for sum in the pandas library. The script below outputs the expected results.
import pandas as pd
report = [1, 1, 2, 2, 3, 3, 4, 4]
A = [1, 0, 0, 0, 1, 0, 0, 1]
B = [0, 1, 0, 0, 0, 1, 0, 0]
C = [0, 0, 0, 1, 0, 0, 1, 0]
D = [0, 0, 1, 0, 0, 0, 0, 0]
df = pd.DataFrame(list(zip(report, A, B, C, D)), columns = ['report', 'A', 'B', 'C', 'D'])
print(df.groupby(['report']).sum())
I have a pandas dataframe with columns names as: (columns type as Object)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe.
Here is an example: dataframe names as df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python? How to do this using pandas dataframe?
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
One-hot-encoding labeling solution:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X - is a sparsed (memory saving) matrix:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
If you want to "unpack" it:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest is the length of the longest sentence. Now make the Null columns:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.
I am writing a script to calculate the volume of any random shaped 3D object. I don't care if the object is hollow or not I need to calculate its total volume.
The data model I have is a 3D table (histogram of pixels) with ones and zeros. ones are evidently where the object is and zero where we have nothing. to calculate the volume of a well filled object it's as easy as summing all the pixels that contains one and multiply by the pixel volume.
On the other hand, the main difficulty remains where we have a hollow object, so we have zeros surrounded by ones. Therefore applying the straightforward method I described herein is not valid anymore. What we need to do is fill all the object area with ones. here is a 2D example so you can understand What i mean
a 2D table :
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 0 0 0 1 1 1 0 0
0 0 0 1 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
I need to transform it to this
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
If you use scipy you can do this in one line with binary_fill_holes. And this works in n-dimensions. With your example:
import numpy as np
from scipy import ndimage
shape=np.array([
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
])
shape[ndimage.binary_fill_holes(shape)] = 1
#Output:
[[0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 0 0 0]
[0 0 1 1 1 1 1 1 1 1 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0]]
A standard flood fill should be extensible to three dimensions. From Wikipedia, the 2-d version in outline:
1. If the color of node is not equal to target-color, return.
2. Set the color of node to replacement-color.
3. Perform Flood-fill (one step to the west of node, target-color, replacement-color).
Perform Flood-fill (one step to the east of node, target-color, replacement-color).
Perform Flood-fill (one step to the north of node, target-color, replacement-color).
Perform Flood-fill (one step to the south of node, target-color, replacement-color).
4. Return.
Notice that in step 3. you are keeping track of all the adjacent cells. If you change this to find all adjacent cells in 3-d and run as before it should work nicely.
Not intuitive and hard to read, but compact:
matrix = [[0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0]]
ranges = [1 in m and range(m.index(1), len(m)-list(reversed(m)).index(1)) or None for m in matrix]
result = [[ranges[j] is not None and i in ranges[j] and 1 or 0 for i,a in enumerate(m)] for j,m in enumerate(matrix)]
result
[[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]]
matrix=[
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,1,1,1,1,1,1,0,0,0],
[0,0,1,1,0,0,0,1,1,1,0,0],
[0,0,0,1,0,0,1,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0],
[0,0,1,1,1,1,1,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0]
]
def fill (x,y):
global matrix
if ( x==len (matrix)
or y==len (matrix[0])
or x==-1
or y==-1
or matrix[x][y]==1 ):
return
else:
matrix[x][y]=1
fill (x+1,y)
fill (x-1,y)
fill (x,y+1)
fill (x,y-1)
fill (4,4)
for i in matrix:
print i
Assuming you're talking about something like filling a voxel shape, why can't you just do something like this (take it as pseudocode example for the simplified 2D case, as I don't know what data structure you're using - maybe a numpy.array? - so I'm just taking a hypothetical "list of lists as a matrix" and I don't take in consideration the problem of modifying an iterable while traversing it etc.):
for i, row in enumerate(matrix):
last_filled_voxel_j = false
for j, voxel in enumerate(row):
if voxel:
if last_filled_voxel != false:
fill_matrix(matrix, i, last_filled_voxel_j, j)
last_filled_voxel_j = j
...assuming that fill_matrix(matrix, row, column_start, column_end) just fills the row of voxels between and not including column_start and column_end.
I guess this is probably not the answer you're looking for, but can you expand what thing different than what I pseudocoded before you actually need to do so we can be of more help?