I have the following data frame structure:
SC0 Shape S1 S2 S3 C1 C2 C3 D1 D2 D3
2 1 Circle NaN NaN NaN 1 1 1 NaN NaN NaN
3 13 Square 2 1 2 NaN NaN NaN NaN NaN NaN
4 13 Diamond NaN NaN NaN NaN NaN NaN 2 1 2
5 16 Diamond NaN NaN NaN NaN NaN NaN 2 2 2
6 16 Square 2 2 2 NaN NaN NaN NaN NaN NaN
How can I combine S1,S2,S3 with C1,C2,C3,D1,D2,D3 so S1,C1 and D1 are on the same column, S2,C2 and D2...(all the way to S16,C16 and D16)?
When Shape = Circle the populated columns are C1-C16, when Shape = Square its S1-S16 and for Shape = Diamond its D1-D16.
I don't mind creating a new set of columns or copy two of them to to an existing set, as long as I have all the #1 scores in the same column, #2 same column etc.
Thank you!
Try:
n=3
cols_prefixes=["C", "S", "D"]
for i in range(n):
cols=[f"{l}{i+1}" for l in cols_prefixes]
df[f"res{i+1}"]=df[cols].bfill(axis=1).iloc[:,0]
df=df.drop(columns=cols)
Outputs:
SC0 Shape res1 res2 res3
2 1 Circle 1.0 1.0 1.0
3 13 Square 2.0 1.0 2.0
4 13 Diamond 2.0 1.0 2.0
5 16 Diamond 2.0 2.0 2.0
6 16 Square 2.0 2.0 2.0
IIUC you have an equal amount of columns for each category, and you want to compress this into numeric columns which are shape agnostic. If so this will work:
dfs = []
for var in ['S', 'D', 'C']:
# filter columns with a regex
res = df[df.iloc[:, 2:].filter(regex= var + '\d{1,2}').columns].dropna()
# rename coumns with just numbers to enable concatenation
res.columns = range(3)
dfs.append(res)
df = pd.concat([df.iloc[:, :2], pd.concat(dfs)], 1)
print(df)
Output:
SC0 Shape 0 1 2
2 1 Circle 1.0 1.0 1.0
3 13 Square 2.0 1.0 2.0
4 13 Diamond 2.0 1.0 2.0
5 16 Diamond 2.0 2.0 2.0
6 16 Square 2.0 2.0 2.0
Related
test_df = pd.DataFrame({'a':[np.nan,np.nan,np.nan,4,np.nan,np.nan,6]})
test_df
a
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 NaN
6 6.0
I'm trying to backfill with the real value divided by the number of na values + itself. The following is what I'm trying to get
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
Try:
# identify the blocks by cumsum on the reversed non-nan series
groups = test_df['a'].notna()[::-1].cumsum()
# groupby and transform
test_df['a'] = test_df['a'].fillna(0).groupby(groups).transform('mean')
Output:
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
IIUC use:
# get reverse group
group = test_df.loc[::-1,'a'].notna().cumsum()
# get size and divide
test_df['a'] = (test_df['a']
.bfill()
.div(test_df.groupby(group)['a'].transform('size'))
)
Or with rdiv:
test_df['a'] = (test_df
.groupby(group)['a']
.transform('size')
.rdiv(test_df['a'].bfill())
)
Output (as new column for clarity):
a a2
0 NaN 1.0
1 NaN 1.0
2 NaN 1.0
3 4.0 1.0
4 NaN 2.0
5 NaN 2.0
6 6.0 2.0
I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.
I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.
You can use this to split into many data frames based on all NaN rows:
#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]
And if you want to exclude the NaN rows from split data frames:
idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]
Example:
df:
0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN
3 NaN NaN
4 NaN NaN
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN
list_of_dfs:
[ 0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN,
Empty DataFrame
Columns: [0, 1]
Index: [],
0 1
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN]
Use df[df[COLUMN_NAME].isnull()].index.tolist() to get a list of indices corresponding to the NaN rows. You can then split the dataframe into multiple dataframes by using the indices.
My solution allows to split your DataFrame into any number of chunks,
on each row full of NaNs.
Assume that the input DataFrame contains:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
3 NaN NaN NaN
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
7 NaN NaN NaN
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
so that "split points" are rows with indices 3 and 7.
To do your task:
Generate the grouping criterion Series:
grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()
Drop rows full of NaN and group the result by the above criterion:
gr = df.dropna(axis=0, thresh=1).groupby(grp)
thresh=1 means that for the current row it is enough to have 1
non-NaN value to be kept in the result.
Perform actual split, as a list comprehension:
result = [ gr.get_group(key) for key in gr.groups ]
To print the result, you can run:
for i, chunk in enumerate(result):
print(f'Chunk {i}:')
print(chunk, end='\n\n')
getting:
Chunk 0:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
Chunk 1:
A B C
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
Chunk 2:
A B C
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.
I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?
You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))
If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4