Index-based access to rows in pandas.DataFrame with Sparse columns - python

Due to memory limitations I have to use sparse columns in a pandas.DataFrame (pandas version 1.0.5).
Unfortunately, with index-based access to rows (using .loc[]), I am running into the following issue:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
2 0 1 0 0
If using .loc:
df.loc[[0,1]]
Output:
0 1 2 3
0 0 0 NaN 1
1 1 0 NaN 0
Ideally, I would be expecting 0s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes sort of speak against this:
df.loc[[0,1]].dtypes
Output:
0 Sparse[int32, 0]
1 Sparse[int32, 0]
2 Sparse[float64, 0]
3 Sparse[int32, 0]
(note that the fill-value is still given as 0, even though the view's dtype for column 2 has changed from Sparse[int32, 0] to Sparse[float64, 0]).
Can anyone tell me whether all NaNs occuring in a row-sliced pd.DataFrame with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrames with sparse columns?

So this indeed turned out to be a bug in pandas that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0).
In 1.1.0 the minimal example works:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df.loc[[0, 1]]
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0

Related

Inplace operation on specific lines and columns of dataframe

Say I have a dataframe with negative values on specific columns:
df = pd.DataFrame([[1, 1, -1],[-1, 1, 1],[-1, -1, 1]])
Now, I want to inplace clip the negative values to 0 on only specific lines and columns:
df.loc[[1, 2], [0, 1]].clip(lower=0, inplace=True)
But this doesn't work:
df
Out:
0 1 2
0 1 1 -1
1 -1 1 1
2 -1 -1 1
This is because slicing dataframe with a list of integers returns a copy:
df.loc[[1, 2], [0, 1]] is df.loc[[1, 2], [0, 1]]
Out: False
How do I make inplace changes to specific rows and columns then?
How about using df.lt instead:
df[df.loc[[1, 2], [0, 1]].lt(0)] = 0
print(df)
0 1 2
0 1 1 -1
1 0 1 1
2 0 0 1
You can do this:
df.loc[[1, 2], [0, 1]] = df.loc[[1, 2], [0, 1]].clip(lower=0)
Output:
0 1 2
0 1 1 -1
1 0 1 1
2 0 0 1

4 I am trying to put array into a pandas dataframe

import pandas as pd
import numpy as np
zeros=np.zeros((6,6))
arra=np.array([zeros])
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']]
df=pd.DataFrame(arra,index=rownames,columns=colnames)
print(df)
Error:
ValueError: Must pass 2-d input. shape=(1, 6, 6)
My desired output is :
A B C D E F
one 0 0 0 0 0 0
tow 0 0 0 0 0 0
three 0 0 0 0 0 0
four 0 0 0 0 0 0
five 0 0 0 0 0 0
six 0 0 0 0 0 0
Try this
pd.DataFrame(np.zeros((6,6)), columns=list('ABCDEF'), index=['one','tow','three','four','five','six'])
If you want to initialize your DataFrame with a single value, you don't need to bother creating a 2D array, just pass the desired scalar to the DataFrame constructor and it will broadcast:
import pandas as pd
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']
df=pd.DataFrame(0, index=rownames, columns=colnames)
print(df)
Output:
one tow three four five six
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
Try this
zeros=np.zeros((6,6), dtype=int)
df=pd.DataFrame(zeros, columns=['A','B','C','D','E','F'], index=['one','tow','three','four','five','six'])
Understand that in your questions 'A','B','C','D','E','F' these are column names and 'one','tow','three','four','five','six' are indexes, you have confused them with rows and columns.
The reason you got that error is because of the line arra=np.array([zeros]) which converts 2d array to 1d array (like how its given below - see '[[[' which means it is 1d array of 2d array ), but you need 2d array to create a dataframe.
array([[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
Hope this helped!

Data frame mode function

HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?
The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0
There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

custom sorting pandas dataframe

I have a (very large) table using pandas.DataFrame. It contains wordcounts from texts; the index is the wordlist:
one.txt third.txt two.txt
a 1 1 0
i 0 0 1
is 1 1 1
no 0 0 1
not 0 1 0
really 1 0 0
sentence 1 1 1
short 2 0 0
think 0 0 1
I want to sort the wordlist on the frequency of words in all texts. So I can easily create a Series which contains the frequency sum for each word (using the words as index). But how how can I sort on this list?
One easy way would be to add the list to the dataframe as column, sort on it and then delete it. For performance reasons I would like to avoid this.
Two other ways are described here, but the one duplicates the dataframe which is a problem because of its size, and the other creates a new index, but I need the information about the words further down the line.
You could compute the frequency and use the sort method to find the desired order of the index. Then use df.loc[order.index] to reorder the original DataFrame:
order = df.sum(axis=1).sort(inplace=False)
result = df.loc[order.index]
For example,
import pandas as pd
df = pd.DataFrame({
'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0],
'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0],
'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]},
index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think'])
order = df.sum(axis=1).sort(inplace=False, ascending=False)
print(df.loc[order.index])
yields
one.txt third.txt two.txt
sentence 1 1 1
is 1 1 1
short 2 0 0
a 1 1 0
think 0 0 1
really 1 0 0
not 0 1 0
no 0 0 1
i 0 0 1

Categories