create new columns where values of row is NaN - python

I have a column of data with rows where NaN exists (see image). I intend splitting it where values are NaN and create new columns where a value emerges after NaN. For instance, I intend to create a new column at row 7 and subsequent rows where succeeding NaN values in the column. I have tried this but it congests the data together.
Col1
0 Start
1 65
2 oft
3 23:59:02
4 12-Feb-99
5 NaN
6 NaN
7 17
8 Sparkle
9 10
I have used the code below to break them into groups.
df['group_no'] = (df.Column1.isnull()).cumsum()
Col1 groups
0 Start 0
1 65 0
2 oft 0
3 23:59:02 0
4 12-Feb-99. 0
5 NaN 1
6 NaN 2
7 17 2
8 Sparkle 2
9 10 2
I now intend to stack the the data into different columns based on the groups numbers
Col1 Col2 Col3 ... ColN
0 Start NaN Nan ...
1 65 17 ....
2 oft Sparkle ....
3 23:59:02 10 ...
4 12-Feb-99

I suggest slicing pandas dataframe manually instead of using numpy to slice.
# Get index of Null values
index = df.index[df.col.isna()].to_list()
starting_index = [0] + [i + 1 for i in index]
ending_index = [i - 1 for i in index] + [len(df) - 1]
n = 0
for i, j in zip(starting_index, ending_index):
if i <= j:
n += 1
df[f"col{n}"] = np.nan
df.loc[: j - i, f"col{n}"] = df.loc[i:j, "col"].values

Related

how do I sort a list of values that correspond to 'n' into a large table ordered by 'n' [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
I have a list of uncertainties that correspond to a particular values of n that i'll call table 1. I would like to add those uncertainties into a comprehensive large table of data, table 2, that is ordered numerically and in ascending order by n. How could I put attach my uncertainty to the correct corresponding value of n?
My first issue is, my table of uncertainties is a table, not a dataframe. I have the separate arrays but not sure how to combine into a dataframe.
table1 = Table([xrow,yrow])
xrow denotes the array of the below 'n' in table1 and yrow denotes the corresponding error.
excerpt of table1:
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707
excerpt of table2:
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN
so the end result should look like this:
n Energy g J error
0 1 0.000000 1 0 0
1 2 1827.486200 1 0 0.00496
2 3 3626.681500 1 0 0.0096
3 4 5396.686500 1 0 0.00913
4 5 6250.149500 1 0 NaN
i.e. the ones where there is no data remains to be blank (e.g. n=5 in the above case)
I should note there is a lot of data (roughly 30k) in table 2 and 2.5k in table1.
you can use .merge like this:
import pandas as pd
from io import StringIO
table1 = pd.read_csv(StringIO("""
n error
1 0.0
2 0.00496
3 0.0096
4 0.00913
6 0.00555
8 0.00718
10 0.00707"""), sep=r"\s+")
table2 = pd.read_csv(StringIO("""
n Energy g J error
0 1 0.000000 1 0 NaN
1 2 1827.486200 1 0 NaN
2 3 3626.681500 1 0 NaN
3 4 5396.686500 1 0 NaN
4 5 6250.149500 1 0 NaN"""), sep=r"\s+")
table2["error"] = table1.merge(table2, on="n", how="right")["error_x"]
print(table2)
Output:
n Energy g J error
0 1 0.0000 1 0 0.00000
1 2 1827.4862 1 0 0.00496
2 3 3626.6815 1 0 0.00960
3 4 5396.6865 1 0 0.00913
4 5 6250.1495 1 0 NaN
EDIT: using .map should perform better (see comments):
table2["error"] = table2["n"].map(table1.set_index('n')['error'])

Pandas: Insert dataframe into other dataframe without preserving indices

I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

fillna in clustered data in large pandas dataframes

Considering the following dataframe:
index group signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
I want to modify the signals by ffill NANs in each group so that I can have the following dataframe:
index group signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
The dataframe is big (around 800,000 rows with about 16,000 different groups) and currently I put it into a groupby object and try to modify each group there, which is very slow. Then I tried to convert it into a pivot_table and ffill() there, but the dataframe is simple too large and the program gives errors. Any suggestions? Thank you!
Can you try out this
data_group = data.groupby('group').apply(lambda v: v.fillna(method='ffill'))
I think in your data NAN is a string. Its not a empty element. Empty data will appear as NaN. If it is a string, do a replacement of NAN. Like
data_group = data.groupby('group').apply(lambda v: v.replace('NAN', float('nan')).fillna(method='ffill'))
Or a better version as Jeff suggested
data['signal'] = data['signal'].replace('NAN', float('nan'))
data = data.groupby('group').ffill()

Categories