Can split pandas dataframe based on row values? - python

I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.

You can use this to split into many data frames based on all NaN rows:
#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]
And if you want to exclude the NaN rows from split data frames:
idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]
Example:
df:
0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN
3 NaN NaN
4 NaN NaN
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN
list_of_dfs:
[ 0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN,
Empty DataFrame
Columns: [0, 1]
Index: [],
0 1
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN]

Use df[df[COLUMN_NAME].isnull()].index.tolist() to get a list of indices corresponding to the NaN rows. You can then split the dataframe into multiple dataframes by using the indices.

My solution allows to split your DataFrame into any number of chunks,
on each row full of NaNs.
Assume that the input DataFrame contains:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
3 NaN NaN NaN
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
7 NaN NaN NaN
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
so that "split points" are rows with indices 3 and 7.
To do your task:
Generate the grouping criterion Series:
grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()
Drop rows full of NaN and group the result by the above criterion:
gr = df.dropna(axis=0, thresh=1).groupby(grp)
thresh=1 means that for the current row it is enough to have 1
non-NaN value to be kept in the result.
Perform actual split, as a list comprehension:
result = [ gr.get_group(key) for key in gr.groups ]
To print the result, you can run:
for i, chunk in enumerate(result):
print(f'Chunk {i}:')
print(chunk, end='\n\n')
getting:
Chunk 0:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
Chunk 1:
A B C
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
Chunk 2:
A B C
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0

Related

Is there a way to forward fill with ascending logic in pandas / numpy?

What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64

Pandas sum of two columns - dealing with nan-values correctly

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Combining sets of columns using pandas

I have the following data frame structure:
SC0 Shape S1 S2 S3 C1 C2 C3 D1 D2 D3
2 1 Circle NaN NaN NaN 1 1 1 NaN NaN NaN
3 13 Square 2 1 2 NaN NaN NaN NaN NaN NaN
4 13 Diamond NaN NaN NaN NaN NaN NaN 2 1 2
5 16 Diamond NaN NaN NaN NaN NaN NaN 2 2 2
6 16 Square 2 2 2 NaN NaN NaN NaN NaN NaN
How can I combine S1,S2,S3 with C1,C2,C3,D1,D2,D3 so S1,C1 and D1 are on the same column, S2,C2 and D2...(all the way to S16,C16 and D16)?
When Shape = Circle the populated columns are C1-C16, when Shape = Square its S1-S16 and for Shape = Diamond its D1-D16.
I don't mind creating a new set of columns or copy two of them to to an existing set, as long as I have all the #1 scores in the same column, #2 same column etc.
Thank you!
Try:
n=3
cols_prefixes=["C", "S", "D"]
for i in range(n):
cols=[f"{l}{i+1}" for l in cols_prefixes]
df[f"res{i+1}"]=df[cols].bfill(axis=1).iloc[:,0]
df=df.drop(columns=cols)
Outputs:
SC0 Shape res1 res2 res3
2 1 Circle 1.0 1.0 1.0
3 13 Square 2.0 1.0 2.0
4 13 Diamond 2.0 1.0 2.0
5 16 Diamond 2.0 2.0 2.0
6 16 Square 2.0 2.0 2.0
IIUC you have an equal amount of columns for each category, and you want to compress this into numeric columns which are shape agnostic. If so this will work:
dfs = []
for var in ['S', 'D', 'C']:
# filter columns with a regex
res = df[df.iloc[:, 2:].filter(regex= var + '\d{1,2}').columns].dropna()
# rename coumns with just numbers to enable concatenation
res.columns = range(3)
dfs.append(res)
df = pd.concat([df.iloc[:, :2], pd.concat(dfs)], 1)
print(df)
Output:
SC0 Shape 0 1 2
2 1 Circle 1.0 1.0 1.0
3 13 Square 2.0 1.0 2.0
4 13 Diamond 2.0 1.0 2.0
5 16 Diamond 2.0 2.0 2.0
6 16 Square 2.0 2.0 2.0

How to move values over in each Pandas data frame row where np.nan are located?

If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.

Insert value into column which is named in known column pandas

I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0

Categories