How can I check the sparsity of a Pandas DataFrame?

How can I check the sparsity of a Pandas DataFrame? - python

In Pandas, How can I check how sparse a DataFrame? Is there any function available, or I will need to write my own?
For now, I have this:
df = pd.DataFrame({'a':[1,0,1,1,3], 'b':[0,0,0,0,1], 'c':[4,0,0,0,0], 'd':[0,0,3,0,0]})
a b c d
0 1 0 4 0
1 0 0 0 0
2 1 0 0 3
3 1 0 0 0
4 3 1 0 0
sparsity = sum((df == 0).astype(int).sum())/df.size
Which divides the number of zeros by the total number of elements, in this example it's 0.65.
Wanted to know if there is any better way to do this. And if there is any function which gives more information about the sparsity (like NaNs, any other prominent number like -1).

One idea for your solution is convert to numpy array, compare and use mean:
a = (df.to_numpy() == 0).mean()
print (a)
0.65
If want use Sparse dtypes is possible use:
#convert each column to SparseArray
sparr = df.apply(pd.arrays.SparseArray)
print (sparr)
a b c d
0 1 0 4 0
1 0 0 0 0
2 1 0 0 3
3 1 0 0 0
4 3 1 0 0
print (sparr.dtypes)
a Sparse[int64, 0]
b Sparse[int64, 0]
c Sparse[int64, 0]
d Sparse[int64, 0]
dtype: object
print (sparr.sparse.density)
0.35

As of September 16th, 2021 (and, I want to say, good for any version > 0.25.0, released July 2019) the sparse accessor gives DataFrame.sparse.density, which is exactly what you're looking for.
Of course, in order to do that you need to actually convert to a sparse DataFrame: df.astype(pd.SparseDtype("int", 0))

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))

Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)

you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4

And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Build matrix of dummy indicators

I have a pandas dataframe which looks like the following
team_id
skill_id
inventor_id
1
A
Jack
1
B
Jack
1
A
Jill
1
B
Jill
2
A
Jack
2
B
Jack
2
A
Joe
2
B
Joe
So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =
Jack_A
Jack_B
Jill_A
Jill_B
Joe_A
Joe_B
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.
I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.
I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

In addition to #Ben.T 's great answer I figured out another which allows me to keep memory efficient.
# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)
inventor_data["inv_skill_id"] = inventor_data["inventor_id"] + inventor_data["skill_id"]
inventor_data["team_skill_id"] = inventor_data["team_id"] + inventor_data["skill_id"]
# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))
# Change the dict entry from count to 1
for team_id, team in enumerate(teams):
teams[team_id] = {v: 1 for k, v in team.items()}
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)

IIUC, you can use crosstab:
print(
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
)#.to_numpy()
)
# inventor_id Jack Jill Joe
# skill_id A B A B A B
# team_id skill_id
# 1 A 1 0 1 0 0 0
# B 0 1 0 1 0 0
# 2 A 1 0 0 0 1 0
# B 0 1 0 0 0 1
and if you just want the matrix, then uncomment .to_numpy() in the above code.
Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
).reindex(
index=pd.MultiIndex.from_product(
[df['team_id'].unique(),df['skill_id'].unique()]),
columns=pd.MultiIndex.from_product(
[df['inventor_id'].unique(),df['skill_id'].unique()]),
fill_value=0
)#.to_numpy()

Check if n consecutive elements equals x and any previous element is greater than x

I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?

using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF

Counting number of consecutive more than 2 occurences

I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?

Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]

Joining a DataFrame on itself to speed up iteration

I am working on a data project and I'm trying to speed up my initial data processing because inevitably I want to do something else/new with the data. So far I've been trying to do more vectorization and using np.where and the like. I've seen material gains with just that.
The last bit of my code that I need to process is the slowest. I am using itterrows to cycle through a very large dataframe (>million rows).
What I am essentially trying to do is the SQL equivalent of
select curr.value, prev.value from t1 left join t2 on curr.number = prev.number - 1
As far as I know, there's no way to join a DataFrame on itself like that. Is there some other way to iterate through it to compare the current and previous values? Here's how the data frame currently looks
df =
[a b c
3 1 0
4 1 0
5 1 0
6 0 1]
Note that b goes from 1 to 0 and that's what I am trying to capture such that I would now have a df that looks like this
[a b c b_c
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 1]
Any help is much appreciated, thank you.

I think you are looking for something like this. Basically you want to know the switch from b to c and back.
df = pd.DataFrame()
df["a"] = [3,4,5,6,7,8,9]
df["b"] = [1,1,1,0,0,1,1]
df["c"] = [0,0,0,1,1,0,0]
df["b_c"] = df["b"].eq(df["c"].shift()).astype(int)
print(df)
Output:
a b c b_c
0 3 1 0 0
1 4 1 0 0
2 5 1 0 0
3 6 0 1 1
4 7 0 1 0
5 8 1 0 1
6 9 1 0 0
I'm not sure if this is the fastest method out there or if it's faster than with iterrows but I assume it is. (at least it looks nice)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I check the sparsity of a Pandas DataFrame? - python

Related

pandas: replace values in column with the last character in the column name

Build matrix of dummy indicators

Check if n consecutive elements equals x and any previous element is greater than x

Counting number of consecutive more than 2 occurences

Joining a DataFrame on itself to speed up iteration

Categories

Resources