I have sparse data stored in a dataframe:
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
a b data
0 1 2 -0.824022
1 3 5 0.503239
2 5 5 -0.540105
Since I care about the null data the actual data would look like this:
true_df
a b data
0 1 1 NaN
1 1 2 -0.824022
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 NaN
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 NaN
13 3 4 NaN
14 3 5 0.503239
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 NaN
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 -0.540105
My question is how do I construct true_df? I was hoping there was some way to use pd.concat or pd.merge, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn't join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.
As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Will return:
a b data
0 1.0 1.0 NaN
1 1.0 1.1 NaN
2 1.0 1.2 NaN
3 1.0 1.3 NaN
4 1.0 1.4 NaN
5 1.0 1.5 NaN
6 1.0 1.6 NaN
7 1.1 1.0 NaN
8 1.1 1.1 NaN
9 1.1 1.2 NaN
10 1.1 1.3 NaN
11 1.1 1.4 NaN
12 1.1 1.5 NaN
13 1.1 1.6 NaN
14 1.2 1.0 NaN
15 1.2 1.1 NaN
16 1.2 1.2 NaN
17 1.2 1.3 NaN
18 1.2 1.4 NaN
19 1.2 1.5 NaN
20 1.2 1.6 NaN
21 1.3 1.0 NaN
22 1.3 1.1 NaN
23 1.3 1.2 NaN
24 1.3 1.3 NaN
25 1.3 1.4 NaN
26 1.3 1.5 NaN
27 1.3 1.6 NaN
28 1.4 1.0 NaN
29 1.4 1.1 NaN
30 1.4 1.2 NaN
31 1.4 1.3 NaN
32 1.4 1.4 NaN
33 1.4 1.5 NaN
34 1.4 1.6 NaN
35 1.5 1.0 NaN
36 1.5 1.1 NaN
37 1.5 1.2 NaN
38 1.5 1.3 NaN
39 1.5 1.4 NaN
40 1.5 1.5 NaN
41 1.5 1.6 NaN
42 1.6 1.0 NaN
43 1.6 1.1 NaN
44 1.6 1.2 NaN
45 1.6 1.3 NaN
46 1.6 1.4 NaN
47 1.6 1.5 NaN
48 1.6 1.6 NaN
Reindex is a straightforward solution. Similar to #jezrael's solution, but no need for merge.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)
You can then reset the index if you want the numeric count as your overall index.
In the case that your index is floats you should use linspace and not arange:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Since you intend to pivot an a and b, you could obtain the pivoted result with
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)
which yields
1 2 3 4 5
1 NaN 0.436389 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.066621
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 0.328880
This is a nice fast approach for converting numeric data from sparse to dense, using SciPy's sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1
sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64
I want to generate a column count that counts the value of pts group by id. Condition is if x and y both contain NaN corresponding pts will be counted, otherwise it will be ignored.
Sample Df:
id pts x y
0 1 0.1 NaN NaN
1 1 0.2 1.0 NaN
2 1 1.1 NaN NaN
3 2 0.1 NaN NaN
4 2 0.2 2.0 1.0
5 3 1.1 NaN NaN
6 3 1.2 NaN 5.0
7 3 3.1 NaN NaN
8 3 3.2 NaN NaN
9 4 0.1 NaN NaN
Expected df:
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
I tried:
df['count'] = df.groupby(['id'])['pts'].value_counts()
You can test if missing values in both Dataframes by DataFrame.isna and DataFrame.all and then count Trues values by sum for new column in GroupBy.transform:
df['count'] = df[['x','y']].isna().all(axis=1).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
Or chain both masks by & for bitwise AND:
df['count'] = (df['x'].isna() & df['y'].isna()).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Suppose I have a data set like:
> NaN NaN NaN 12 NaN NaN NaN NaN 10 NaN NaN NaN NaN 8 NaN 6 NaN
I want to distribute the values as evenly as possible between values of their surrounding NaNs. For example the value 12 should take into consideration of their surrounding NaNs, and distribute them evenly until it touches the 2nd non-NaN value's NaNs.
For example the 1st 12 should only take into consideration of his closest NaNs.
> NaN NaN NaN 12 NaN NaN
The output should be:
2 2 2 2 2 (Distributed by the 12)
2 2 2 2 2 (Distributed by the 10)
2 2 2 2 (Distributed by the 8)
2 2 2 (Distributed by the 6)
> NaN NaN NaN 12 NaN NaN NaN NaN 10 NaN NaN NaN NaN 8 NaN 6 NaN
> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
I was originally thinking about using smoothers, such as the interpolate function in Pandas. It does not have to be lossless, meaning that we can lose or get more than the sum in the progress. Are there any libraries that can perform this kind of distribution vs using a lossy smoother?
You can use interpolate(method='nearest'), ffill() and bfill() and finally groupby().
Short version:
>> series = pd.Series(x).interpolate(method='nearest').ffill().bfill()
>> series.groupby(series).apply(lambda k: k/len(k))
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0]
To illustrate what's happening, create your df
df = pd.DataFrame()
df["x"] = x
where x is the series you gave. Now:
>>> df["inter"] = df.x.interpolate(method='nearest').ffill().bfill()
>>> df["inter"] = df.groupby("inter").inter.apply(lambda k: k/len(k))
>>> df
x inter
0 NaN 2.0
1 NaN 2.0
2 NaN 2.0
3 12.0 2.0
4 NaN 2.0
5 NaN 2.0
6 NaN 2.0
7 NaN 2.0
8 10.0 2.0
9 NaN 2.0
10 NaN 2.0
11 NaN 2.0
12 NaN 2.0
13 8.0 2.0
14 NaN 2.0
15 6.0 3.0
16 NaN 3.0