Group rows by date and overwrite NaN values - python

I have a dataframe of the following structure which is simplified for this question.
A B C D E
0 2014/01/01 nan nan 0.2 nan
1 2014/01/01 0.1 nan nan nan
2 2014/01/01 nan 0.3 nan 0.7
3 2014/01/02 nan 0.4 nan nan
4 2014/01/02 0.5 nan 0.6 0.8
What I have here is a series of readings across several timestamps on single days. The columns B,C,D and E represent different locations. The data I am reading in is set up such that at a specified timestamp it takes data from certain locations and fills in nan values for the other locations.
What I wish to do is group the data by timestamp which I can easily do with a .GroupBy()command. From there I wish to have the nan values in the grouped data be overwritten with the valid values taken in later rows such that this is the following result is obtained.
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
How do I go about achieving this?

Try df.groupby with DataFrameGroupBy.agg:
In [528]: df.groupby('A', as_index=False, sort=False).agg(np.nansum)
Out[528]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
A shorter version with DataFrameGroupBy.sum (thanks MaxU!):
In [537]: df.groupby('A', as_index=False, sort=False).sum()
Out[537]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8

you can try this by using pandas first
df.groupby('A', as_index=False).first()
A B C D E
0 1/1/2014 0.1 0.3 0.2 0.7
1 1/2/2014 0.5 0.4 0.6 0.8

Related

Find consecutive values in rows in pandas Dataframe based on condition

I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5, which is similar to the one I have in mind. I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above. Here a facsimile dataframe:
from random import uniform
idx = pd.date_range("2018-01-01", periods=10, freq="M")
df = pd.DataFrame(
{
'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
},
index=idx
)
df = df.replace({0:np.nan})
df
Hence, since columns B and D don't satisfy the criteria should be removed from the output.
I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:
def consecutive_values_in_range(s, min, max):
return s.between(left=min, right=max)
min, max = 0, 0.5
df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)
print(df)
But I didn't obtain what I was looking for, that would be something like this:
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN
Any suggestions? Thanks in advance.
lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
get an is_between mask over df
get the rolling sum of these masks per column, window size being 3
since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, i.e., 0 <= val <= 0.5 values in that column
so check equality against 3 and see if there's any in a column
lastly index with the resulting True/False mask per column
to get
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN

Fill the dataframe values from other dataframe in pandas python

I have 2 dataframes df1 and df2. df1 is filled with values and df2 is empty.
df1 and df2, as it can be seen, both dataframes's index and columns will always be same, just difference is df1 doesn't contain duplicate values of columns and indexes but df2 does contain.
How to fill values in df2 from df1, so that it also considers the combination of index and columns?
df1 = pd.DataFrame({'Ind':pd.Series([1,2,3,4]),1:pd.Series([1,0.2,0.2,0.8])
,2:pd.Series([0.2,1,0.2,0.8]),3:pd.Series([0.2,0.2,1,0.8])
,4:pd.Series([0.8,0.8,0.8,1])})
df1 = df1.set_index(['Ind'])
df2 = pd.DataFrame(columns = [1,1,2,2,3,4], index=[1,1,2,2,3,4])
IIUC, you want to update:
df2.update(df1)
print(df2)
1 1 2 2 3 4
1 1.0 1.0 0.2 0.2 0.2 0.8
1 1.0 1.0 0.2 0.2 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
3 0.2 0.2 0.2 0.2 1.0 0.8
4 0.8 0.8 0.8 0.8 0.8 1.0

How to do a fillna with zero values until data appears in each column, then use the forward fill for each column in pandas data frame

I have the following data frame,
date x1 x2 x3
2001-01-01 nan 0.4 0.1
2001-01-02 nan 0.3 nan
2021-01-03 nan nan 0.5
...
2001-05-05 nan 0.1 0.2
2001-05-06 0.1 nan 0.3
...
So I want to first fill all nan values with zero until the first data point appears in each column, after that, I want the rest of the rows to use the fowardfill function.
So the above data frame should look like this,
date x1 x2 x3
2001-01-01 0 0.4 0.1
2001-01-02 0 0.3 0.1
2021-01-03 0 0.3 0.5
...
2001-05-05 0 0.1 0.2
2001-05-06 0.1 0.1 0.3
...
If I do fillna with 0 first then do forwardfill, like this,
df = df.fillna(0)
df = df.ffill()
I just get all the na values to be zero, and I am unable to do ffill for the parts where the data starts.
Is there a way to do the ffill the way I want?
Reverse the logic:
out = df.ffill().fillna(0)
print(out)
# Output
date x1 x2 x3
0 2001-01-01 0.0 0.4 0.1
1 2001-01-02 0.0 0.3 0.1
2 2021-01-03 0.0 0.3 0.5
3 2001-05-05 0.0 0.1 0.2
4 2001-05-06 0.1 0.1 0.3

Pandas Reshape DataFrame in a loop

I am new to Pandas in python. I have a dataframe with 2 keys 15 rows each and 1 column like below
1
key1/1 0.5
key1/2 0.5
key1/3 0
key1/4 0
key1/5 0.6
key1/6 0.7
key1/7 0
key1/8 0
key1/9 0
key1/10 0.5
key1/11 0.5
key1/12 0.5
key1/13 0
key1/14 0.5
key1/15 0.5
key2/1 0.4
key2/2 0.2
key2/3 0
key2/4 0
key2/5 0.1
key2/6 0.2
key2/7 0
key2/8 0
key2/9 0.3
key2/10 0.2
key2/11 0
key2/12 0.5
key2/13 0
key2/14 0
key2/15 0.5
I want to iterate the rows of the dataframe so each time it meets a 'zero' it creates a new column like below
1 2 3 4
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 nan nan 0.5 nan
key1/4 nan nan nan nan
1 2 3 4 5
key2/1 0.4 0.1 0.3 0.5 0.5
key2/2 0.2 0.2 0.2 nan nan
key2/3 nan nan nan nan nan
key2/4 nan nan nan nan nan
I have tried the following code trying to iterate 'key1' only
df2=pd.Dataframe[]
for row in df['key1'].index:
new_df['keyl'][row] == df['keyl'][row]
if df['keyl'][row] == 0:
new_df['key1'].append(df2,ignore_index=True)
Obviously it is not working, please send some help. Ideally I would like to modify the same dataframe instead of creating a new one. Thanks
EDIT
Below is a drawing of how my data looks like
And below is what I am trying to achieve
You can use mask them by zero and assign a key. Based on the key you can group them and transform them to columns.
All credit goes to this answer. You will find a great explanation there.
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
df2.pivot(columns='group')
1
group 1 2 3 4
key1/1 0.5 NaN NaN NaN
key1/10 NaN NaN 0.5 NaN
key1/11 NaN NaN 0.5 NaN
key1/12 NaN NaN 0.5 NaN
key1/14 NaN NaN NaN 0.5
key1/15 NaN NaN NaN 0.5
key1/2 0.5 NaN NaN NaN
key1/5 NaN 0.6 NaN NaN
key1/6 NaN 0.7 NaN NaN
Your group key will look like this:
1 group
key1/1 0.5 1
key1/2 0.5 1
key1/3 NaN 1
key1/4 NaN 1
key1/5 0.6 2
key1/6 0.7 2
key1/7 NaN 2
key1/8 NaN 2
key1/9 NaN 2
key1/10 0.5 3
key1/11 0.5 3
key1/12 0.5 3
key1/13 NaN 3
key1/14 0.5 4
key1/15 0.5 4
This data you can translate it into column format.
Complete solution:
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
df3.index = [f"key1/{i}" for i in range(1,len(df3)+1)]
0 1 2 3
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 NaN NaN 0.5 NaN
If you want something in that format you need to have data like this:
group
1 [0.5, 0.5]
2 [0.6, 0.7]
3 [0.5, 0.5, 0.5]
4 [0.5, 0.5]
Name: 1, dtype: object
Update1:
Assuming data to be:
def func(r):
df2 = r.mask((r['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
# df3.index = [r.name]*len(df3)
return (df3)
df.groupby(df.index).apply(func)

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24
As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df
Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Categories