How can I split dataframe with blank spaces

How can I split dataframe with blank spaces - python

import pandas
import numpy
names = ['a', 'b', 'c']
df = pandas.DataFrame([1, 2, 3, numpy.nan, numpy.nan, 4, 5, 6, numpy.nan, numpy.nan, 7, 8, 9])
For the above one, how will the condition change? Can someone please explain?
how can I get this,
df1 =
0
0 1.0
1 2.0
2 3.0
df2 =
0
4 4.0
5 5.0
6 6.0
df3 =
0
8 7.0
9 8.0
10 9.0

You can generate a temporary column, remove NaNs, and group by the temporary column:
dataframes = {f'df{idx+1}': d for idx, (_, d) in enumerate(df.dropna().groupby(df.assign(cond=df.isna().cumsum()).dropna()['cond']))}
Output:
>>> dataframes
{'df1': 0
0 1.0
1 2.0
2 3.0,
'df2': 0
5 4.0
6 5.0
7 6.0,
'df3': 0
10 7.0
11 8.0
12 9.0}
>>> dataframes['df1']
0
0 1.0
1 2.0
2 3.0
>>> dataframes['df2']
0
5 4.0
6 5.0
7 6.0
>>> dataframes['df3']
0
10 7.0
11 8.0
12 9.0

Related

Add an empty row in a dataframe when the entries of a column repeats

I have a dataframe that stores time-series data
Please find the code below
import pandas as pd
from pprint import pprint
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1],
}
df = pd.DataFrame(d)
pprint(df)
df>
t input type value
0 2 A 0.1
1 2 A 0.2
2 2 A 0.3
0 2 B 1.0
2 2 B 2.0
0 2 B 3.0
1 4 A 1.0
When the first entry of the column t repeats, I would like to add an empty row.
Expected output:
df>
t input type value
0 2 A 0.1
1 2 A 0.2
2 2 A 0.3
0 2 B 1.0
2 2 B 2.0
0 2 B 3.0
1 4 A 1.0
I am not sure how to do this. Suggestions will be really helpful.
EDIT:
dup = df['t'].eq(0).shift(-1, fill_value=False)
helps when starting value in row t si 0.
But it could also be a non-zero value like the example below.
Additional example:
d = {
't': [25, 35, 90, 25, 90, 25, 35],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1],
}

There are several ways to achieve this
option 1
you can use groupby.apply:
(df.groupby(df['t'].eq(0).cumsum(), as_index=False, group_keys=False)
.apply(lambda d: pd.concat([d, pd.Series(index=d.columns, name='').to_frame().T]))
)
output:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B 1.0
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
NaN NaN NaN NaN
option 2
An alternative if the index is already sorted:
dup = df['t'].eq(0).shift(-1, fill_value=False)
pd.concat([df, df.loc[dup].assign(**{c: '' for c in df})]).sort_index()
output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
2
3 0 2 B 1.0
4 2 2 B 2.0
4
5 0 2 B 3.0
6 1 4 A 1.0
addendum on grouping
set the group when the value decreases:
dup = df['t'].diff().lt(0).cumsum()
(df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda d: pd.concat([d, pd.Series(index=d.columns, name='').to_frame().T]))
)

Because groupby is generally slow, you can create helper DataFrame by consecutive groups for starting by 0 in t column, join by concat and sorting:
#groups starting by 0
df.index = df['t'].eq(0).cumsum()
#groups starting by difference if less like 0
df.index = (~df['t'].diff().gt(0)).cumsum()
df = (pd.concat([df, pd.DataFrame('', columns=df.columns, index=df.index.unique())])
.sort_index(kind='mergesort', ignore_index=True)
.iloc[:-1])
print (df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3
4 0 2 B 1.0
5 2 2 B 2.0
6
7 0 2 B 3.0
8 1 4 A 1.0
df.index = (~df['t'].diff().gt(0)).cumsum()
df = (pd.concat([df, pd.DataFrame(' ', columns=df.columns, index=df.index.unique())])
.sort_index(kind='mergesort', ignore_index=True)
.iloc[:-1])
print (df)
t input type value
0 25 2 A 0.1
1 35 2 A 0.2
2 90 2 A 0.3
3
4 25 2 B 1.0
5 90 2 B 2.0
6
7 25 2 B 3.0
8 35 4 A 1.0

Here is my suggestion:
pd.concat([pd.DataFrame(index=df.index[df.t == df.t.iat[0]][1:]), df]).sort_index()
t input type value
0 25.0 2.0 A 0.1
1 35.0 2.0 A 0.2
2 90.0 2.0 A 0.3
3 NaN NaN NaN NaN
3 25.0 2.0 B 1.0
4 90.0 2.0 B 2.0
5 NaN NaN NaN NaN
5 25.0 2.0 B 3.0
6 35.0 4.0 A 1.0

Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
For example, say I have the dataframe given by:
df = pd.DataFrame({
'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})
which looks like
In [27]: df
Out[27]:
data group
0 0.0 0
1 1.0 0
2 NaN 0
3 3.0 1
4 NaN 1
5 5.0 0
6 NaN 0
7 NaN 0
8 NaN 1
9 NaN 1
If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be
In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 3.0
9 1 NaN
What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:
In [36]: for i, group in df.groupby('group'):
...: print(group)
...:
data group
0 0.0 0
1 1.0 0
2 NaN 0
5 5.0 0
6 NaN 0
7 NaN 0
data group
3 3.0 1
4 NaN 1
8 NaN 1
9 NaN 1
I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 NaN
9 1 NaN
FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).
I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.

import numpy as np
import pandas as pd
df = pd.DataFrame({
'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})
df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
# index data group stop_index mask
# 0 0 0.0 0 2.0 True
# 1 1 1.0 0 3.0 True
# 2 2 1.0 1 4.0 True
# 3 3 3.0 0 5.0 True
# 4 4 1.0 1 4.0 True
# 5 5 22.0 0 7.0 True
# 6 6 NaN 1 4.0 False
# 7 7 5.0 0 9.0 True
# 8 8 NaN 1 4.0 False
# 9 9 NaN 1 4.0 False
# clean up df
df = df[['data', 'group']]
print(df)
yields
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
This copies the index into a column, then
makes a second stop_index column which is the index augmented by the size of
the (time) window.
df = df.reset_index()
df['stop_index'] = df['index'] + 2
Then it makes null rows in stop_index to match null rows in data:
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
Then it forward-fills stop_index on a per-group basis:
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()

IIUC
l=[]
for i, group in df.groupby('group'):
idx=group.index
l.append(group.reindex(df.index).ffill(limit=2).loc[idx])
pd.concat(l).sort_index()
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 0.0
3 3.0 1.0
4 3.0 1.0
5 5.0 0.0
6 5.0 0.0
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Testing data
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 NaN 1
5 22 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
My method for testing data
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 1.0
3 3.0 0.0
4 1.0 1.0
5 22.0 0.0
6 NaN 1.0# here not change , since the previous two do not have valid value for group 1
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Out put with unutbu
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 1.0 1# miss match in here
7 5.0 0
8 NaN 1
9 NaN 1

What is the fastest way to calculate a rolling function with a two dimensional window?

I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration

Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038

DataFrame only keep higher/lower values

I am trying to clean up a dataset. Only values smaller than the last value should be kept.
Right now it look slike this:
my_data
0 10
1 8
2 7
3 10
4 5
5 8
6 2
after the cleanup it should look like this:
my_data
0 10
1 8
2 7
3 7
4 5
5 5
6 2
I also have some working code but I am looking for a faster and more pythonic way of doing it.
import pandas as pd
df_results = pd.DataFrame()
df_results['my_data'] = [10, 8, 7, 10, 5, 8, 2]
data_idx = list(df_results['my_data']._index)
for i in range(1, len(df_results['my_data'])):
current_value = df_results['my_data'][data_idx[i]]
last_value = df_results['my_data'][data_idx[i - 1]]
df_results['my_data'][data_idx[i]] = current_value if current_value < last_value else last_value

You can use:
In [53]: df[df.my_data.diff() > 0] = np.nan
In [54]: df
Out[54]:
my_data
0 10.0
1 8.0
2 7.0
3 NaN
4 5.0
5 NaN
6 2.0
In [55]: df.ffill()
Out[55]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0

I am using shift with diff
s=df.my_data.diff().gt(0)
df.loc[s,'my_data']=df.loc[s.shift(-1).fillna(False),'my_data'].values
Out[71]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0

Update Where Equivalent in Pandas

I have a list of columns in a dataframe that shouldn't be empty.
I want to remove any rows that are empty in any of these columns. My solution would be to iterate through the required columns and set the column 'excluded' to the error message that the user will be shown before excluding them (I will present these to the user in the form of a report at the end of the process)
I'm currently trying something like this:
for col in requiredColumns:
df[pd.isnull(df[col])]['excluded'] = df[pd.isnull(df[col])]['excluded'].apply(lambda x: str(x) + col + ' empty, excluded')
but no luck - the columns aren't updated. The filter by itself (to get only the empty rows) works, the update part doesn't seem to be working.
I'm used to SQL:
UPDATE df SET e = e & "empty, excluded" WHERE NZ(col, '') = ''

If you need to update a panda based on multiple conditions:
You can simply use .loc
>>> df
A B C
0 2 40 800
1 1 90 600
2 6 80 700
3 1998 70 55
4 1 90 300
5 7 80 700
6 4 20 300
7 1998 20 2
8 7 10 100
9 1998 60 2
>>> df.loc[(df['A'] > 7) & (df['B'] > 69) , 'C'] = 75
This will set 'C' = 75 where 'A' > 7 and 'B' > 69

One way is to use numpy functions to create a column with the desired marker.
Setup
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [1, np.nan, 2, 3, 4, 5],
'B': [2, 3, np.nan, 5, 1, 9],
'C': [5, 8, 1, 9, np.nan, 7]})
A B C
0 1.0 2.0 5.0
1 NaN 3.0 8.0
2 2.0 NaN 1.0
3 3.0 5.0 9.0
4 4.0 1.0 NaN
5 5.0 9.0 7.0
Solution
df['test'] = np.any(np.isnan(df.values), axis=1)
A B C test
0 1.0 2.0 5.0 False
1 NaN 3.0 8.0 True
2 2.0 NaN 1.0 True
3 3.0 5.0 9.0 False
4 4.0 1.0 NaN True
5 5.0 9.0 7.0 False
Explanation
np.isnan returns a Boolean array corresponding to whether the elements of a numpy array are null.
Use np.any or np.all, as required, to determine which rows are in scope.
Use df.values to extract underlying numpy array from dataframe. For selected columns, you can use df[['A', 'B']].values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I split dataframe with blank spaces - python

Related

Add an empty row in a dataframe when the entries of a column repeats

Forward fill column with an index-based limit

What is the fastest way to calculate a rolling function with a two dimensional window?

DataFrame only keep higher/lower values

Update Where Equivalent in Pandas

Categories

Resources