To extract non-nan values from multiple rows in a pandas dataframe - python

I am working on several taxi datasets. I have used pandas to concat all the dataset into a single dataframe.
My dataframe looks something like this.
675 1039 #and rest 125 taxis
longitude latitude longitude latitude
date
2008-02-02 13:31:21 116.56359 40.06489 Nan Nan
2008-02-02 13:31:51 116.56486 40.06415 Nan Nan
2008-02-02 13:32:21 116.56855 40.06352 116.58243 39.6313
2008-02-02 13:32:51 116.57127 40.06324 Nan Nan
2008-02-02 13:33:21 116.57120 40.06328 116.55134 39.6313
2008-02-02 13:33:51 116.57121 40.06329 116.55126 39.6123
2008-02-02 13:34:21 Nan Nan 116.55134 39.5123
where 675,1039 are the taxi ids. Basically there are totally 127 taxis having their corresponding latitudes and longitudes columned up.
I have several ways to extract not-null values for a row.
df.ix[k,df.columns[np.isnan(df.irow(0))!=1]]
(or)
df.irow(0)[np.isnan(df.irow(0))!=1]
(or)
df.irow(0)[np.where(df.irow(0)[df.columns].notnull())[0]]
any of the above commands will return,
675 longitude 116.56359
latitude 40.064890
4549 longitude 116.34642
latitude 39.96662
Name: 2008-02-02 13:31:21
now i want to extract all the notnull values from first few rows(say from row 1 to row 6).
how do i do that?
i can probably loop it up. But i want a non-looped way of doing it.
Any help, suggestions are welcome.
Thanks in adv! :)

df.ix[1:6].dropna(axis=1)
As a heads up, irow will be deprecated in the next release of pandas. New methods, with clearer usage, replace it.
http://pandas.pydata.org/pandas-docs/dev/indexing.html#deprecations

In 0.11 (0.11rc1 is out now), this is very easy using .iloc to first select the first 6 rows, then dropna drops any row with a nan (you can also pass some options to dropna to control exactly which columns you want considered)
I realized you want 1:6, I did 0:6 in my answer....
In [8]: df = DataFrame(randn(10,3),columns=list('ABC'),index=date_range('20130101',periods=10))
In [9]: df.ix[6,'A'] = np.nan
In [10]: df.ix[6,'B'] = np.nan
In [11]: df.ix[2,'A'] = np.nan
In [12]: df.ix[4,'B'] = np.nan
In [13]: df.iloc[0:6]
Out[13]:
A B C
2013-01-01 0.442692 -0.109415 -0.038182
2013-01-02 1.217950 0.006681 -0.067752
2013-01-03 NaN -0.336814 -1.771431
2013-01-04 -0.655948 0.484234 1.313306
2013-01-05 0.096433 NaN 1.658917
2013-01-06 1.274731 1.909123 -0.289111
In [14]: df.iloc[0:6].dropna()
Out[14]:
A B C
2013-01-01 0.442692 -0.109415 -0.038182
2013-01-02 1.217950 0.006681 -0.067752
2013-01-04 -0.655948 0.484234 1.313306
2013-01-06 1.274731 1.909123 -0.289111

Using Jeff's dataframe:
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(randn(10,3),columns=list('ABC'),index=pd.date_range('20130101',periods=10))
df.ix[6,'A'] = np.nan
df.ix[6,'B'] = np.nan
df.ix[2,'A'] = np.nan
df.ix[4,'B'] = np.nan
We can replace nans by some number we know is not in the dataframe:
df = df.fillna(999)
If you want to keep only the non-null values without iterating you can do:
df_nona = df.apply(lambda x: list(filter(lambda y: y != 999, x)))
df_na = df.apply(lambda x: list(filter(lambda y: y == 999, x)))
The problem of this approach is that the result are lists so you lose information about the index.
df_nona
A [-1.9804955861, 0.146116306853, 0.359075672435...
B [-1.01963803293, -0.829747654648, 0.6950551455...
C [2.40122968044, 0.79395493777, 0.484201174184,...
dtype: object
Another option is:
df1 = df.dropna()
index_na = df.index ^ df1.index
df_na = df[index_na]
In this case you don't lose info about index, although this is really similar to previous answers.
Hope it helps!

Related

Grouper() and agg() functions produce multiple copies when squashed

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Pandas - Calculate row values based on prior row value, update the result to be the new row value (and so on)

Below is some dummy data that reflects the data I am working with.
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
# Dummy data that represents a percent change
datelist = pd.date_range(start='1983-01-01', end='1994-01-01', freq='Y')
df1 = pd.DataFrame({"P Change_1": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,)),
"P Change_2": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,))})
#This dataframe contains the rows we want to operate on
df2 = pd.DataFrame({
'Loc1': [None, None, None, None, None, None, None, None, None, None, 2.5415],
'Loc2': [None, None, None, None, None, None, None, None, None, None, 3.2126],})
#Set the datetime index
df1 = df1.set_index(datelist)
df2 = df2.set_index(datelist)
df1:
P Change_1 P Change_2
1984-12-31 -0.172080 -0.231574
1985-12-31 -0.328773 -0.247018
1986-12-31 -0.160834 -0.099079
1987-12-31 -0.457924 0.000266
1988-12-31 0.017374 -0.501916
1989-12-31 -0.349052 -0.438816
1990-12-31 0.034711 0.036164
1991-12-31 -0.415445 -0.415372
1992-12-31 -0.206852 -0.413107
1993-12-31 -0.313341 -0.181030
1994-12-31 -0.474234 -0.118058
df2:
Loc1 Loc2
1984-12-31 NaN NaN
1985-12-31 NaN NaN
1986-12-31 NaN NaN
1987-12-31 NaN NaN
1988-12-31 NaN NaN
1989-12-31 NaN NaN
1990-12-31 NaN NaN
1991-12-31 NaN NaN
1992-12-31 NaN NaN
1993-12-31 NaN NaN
1994-12-31 2.5415 3.2126
DataFrame details:
First off, Loc1 will correspond with P Change_1 and Loc2 corresponds to P Change_2, etc. Looking at Loc1 first, I want to either fill up the DataFrame containing Loc1 and Loc2 with the relevant values or compute a new dataframe that has columns Calc1 and Calc2.
The calculation:
I want to start with the 1994 value of Loc1 and calculate a new value for 1993 by taking Loc1 1993 = Loc1 1994 + (Loc1 1994 * P Change_1 1993). With the values filled in it would be 2.5415 +(-0.313341 * 2.5415) which equals about 1.74514.
This 1.74514 value will replace the NaN value in 1993, and then I want to use that calculated value to get a value for 1992. This means we now compute Loc1 1992 = Loc1 1993 + (Loc1 1993 * P Change_1 1992). I want to carry out this operation row-wise until it gets the earliest value in the timeseries.
What is the best way to go about implementing this row-wise equation? I hope this makes some sense and any help is greatly appreciated!
df = pd.merge(df1, df2, how='inner', right_index=True, left_index=True) # merging dataframes on date index
df['count'] = range(len(df)) # creating a column, count for easy operation
# divides dataframe in two part, one part above the not NaN row and one below
da1 = df[df['count']<=df.dropna().iloc[0]['count']]
da2 = df[df['count']>=df.dropna().iloc[0]['count']]
da1.sort_values(by=['count'],ascending=False, inplace=True)
g=[da1,da2]
num_col=len(df1.columns)
for w in range(len(g)):
list_of_col=[]
count = 0
list_of_col=[list() for i in range(len(g[w]))]
for item, rows in g[w].iterrows():
n=[]
if count==0:
for p in range(1,num_col+1):
n.append(rows[f'Loc{p}'])
else:
for p in range(1,num_col+1):
n.append(list_of_col[count-1][p-1]+ list_of_col[count-1][p-1]* rows[f'P Change_{p}'])
list_of_col[count].extend(n)
count+=1
tmp=[list() for i in range(num_col)]
for d_ in range(num_col):
for x_ in range(len(list_of_col)):
tmp[d_].append(list_of_col[x_][d_])
z1=[]
z1.extend(tmp)
for i in range(num_col):
g[w][f'Loc{i+1}']=z1[i]
da1.sort_values(by=['count'] ,inplace=True)
final_df = pd.concat([da1, da2[1:]])
calc_df = pd.DataFrame()
for i in range(num_col):
calc_df[f'Calc{i+1}']=final_df[f'Loc{i+1}']
print(calc_df)
I have tried to include all the obscure thing I have done in the comment. I have edited my code to let initial dataframes remain unaffected.
[Edited] : I have edited the code to include any number of columns in the given dataframe.
[Edited:]If the name of columns are arbitrary in df1 and df2, please run this block of code before running the upper code. I have renamed the columns name using list comprehension!
df1.columns = [f'P Change_{i+1}' for i in range(len(df1.columns))]
df2.columns = [f'Loc{i+1}' for i in range(len(df2.columns))]
[EDITED] Perhaps there are better/more elegant ways to do this, but this worked fine for me:
def fill_values(df1, df2, cols1=None, cols2=None):
if cols1 is None: cols1 = df1.columns
if cols2 is None: cols2 = df2.columns
for i in reversed(range(df2.shape[0]-1)):
for col1, col2 in zip(cols1, cols2):
if np.isnan(df2[col2].iloc[i]):
val = df2[col2].iloc[i+1] + df2[col2].iloc[i+1] * df1[col1].iloc[i]
df2[col2].iloc[i] = val
return df1, df2
df1, df2 = fill_values(df1, df2)
print(df2)
Loc1 Loc2
1983-12-31 0.140160 0.136329
1984-12-31 0.169291 0.177413
1985-12-31 0.252212 0.235614
1986-12-31 0.300550 0.261526
1987-12-31 0.554444 0.261457
1988-12-31 0.544976 0.524925
1989-12-31 0.837202 0.935388
1990-12-31 0.809117 0.902741
1991-12-31 1.384158 1.544128
1992-12-31 1.745144 2.631024
1993-12-31 2.541500 3.212600
This assumes that the rows in df1 and df2 corresponds perfectly (I'm not querying the index, but only the location). Hope it helps!
Just to be clear, what you need is Loc1[year]=Loc1[next_year] + PChange[year]*Loc1[next_year], right?
The below loop will do what you are looking for, but it just assumes that the number of rows in both df's is always equal, etc. (instead of matching the value in the index). From your description, I think this works for your data.
for i in range(df2.shape[0]-2,-1,-1):
df2.Loc1[i]=df2.Loc1[i+1] + (df1.PChange_1[i]*df2.Loc1[i+1])
Hope this helps :)

How to apply "Diff()" methode to multiple columns?

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged
In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)
Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

How to return new data frame when using a apply function on old dataframe?

How to return new data frame when using a apply function on old dataframe?
Input Data (df):
bookings rolling_mean rolling_std_dev
ds city
2013-01-01 City_2 69 NaN NaN
2013-01-02 City_2 101 NaN NaN
2013-01-03 City_2 134 101.333333 32.501282
2013-01-04 City_2 155 130.000000 27.221315
2013-01-05 City_2 104 131.000000 25.632011
Code:
def f1(x):
if (math.isnan(x.bookings) or math.isnan(x.rolling_mean) or math.isnan(x.rolling_std_dev)):
print "Not enough information"
elif abs(x.bookings-x.rolling_mean) > (2*x.rolling_std_dev):
print x.bookings
print x.rolling_mean
print x.rolling_std_dev
df.apply(lambda x: f2(x), axis = 1)
Output:
Problem:
The function above compiles correctly with no errors. However, when I try to run it, it doesn't give me the output I want. It's not printing anything after the elif statement, but it should. Also, I don't understand the dataframe that is showing up with all nones after the printing part of the output. Where is that coming from?
What solution I want:
Return a new dataframe with all the rows that fulfill the elif statement.
When a function call does not explicitly return anything, it returns None (as all function calls must return something in Python, and default returned value is None , if nothing is returned explicitly from the function).
This is why you are getting a dataframe of all None. I do not think you can achieve what you are trying for with apply , as apply() with axis 1 actually runs the function for every row and replaces the row with the returned value (as you see in your case).
What you are trying to do can be achieved in vectorized way using -
newdf = df.dropna()
result = newdf[(newdf['bookings'] - newdf['rolling_mean']) > (2 * newdf['rolling_std_dev'])]
Explanation -
df.dropna() - This function drops any row with a NaN value in it
Next line does boolean comparison of series (which does same bool comparison for each of its elements and returns a boolean series back) , and then it does boolean indexing.
Demo (I changed a row so that there is atleast one row meeting the condition) -
In [50]: df
Out[50]:
bookings rolling_mean rolling_std_dev
ds city
2013-01-01 City_2 69 NaN NaN
2013-01-02 City_2 101 NaN NaN
2013-01-03 City_2 134 101.333333 32.501282
2013-01-04 City_2 155 130.000000 27.221315
2013-01-05 City_2 1000 131.000000 25.632011
In [51]: newdf = df.dropna()
In [52]: result = newdf[(newdf['bookings'] - newdf['rolling_mean']) > (2 * newdf['rolling_std_dev'])]
In [53]: result
Out[53]:
bookings rolling_mean rolling_std_dev
ds city
2013-01-05 City_2 1000 131 25.632011

Resampling a multi-index DataFrame

I want to resample a DataFrame with a multi-index containing both a datetime column and some other key. The Dataframe looks like:
import pandas as pd
from StringIO import StringIO
csv = StringIO("""ID,NAME,DATE,VAR1
1,a,03-JAN-2013,69
1,a,04-JAN-2013,77
1,a,05-JAN-2013,75
2,b,03-JAN-2013,69
2,b,04-JAN-2013,75
2,b,05-JAN-2013,72""")
df = pd.read_csv(csv, index_col=['DATE', 'ID'], parse_dates=['DATE'])
df.columns.name = 'Params'
Because resampling is only allowed on datatime indexes, i thought unstacking the other index column would help. And indeed it does, but i cant stack it again afterwards.
print df.unstack('ID').resample('W-THU')
Params VAR1
ID 1 2
DATE
2013-01-03 69 69.0
2013-01-10 76 73.5
But then stacking 'ID' again results in an index-error:
print df.unstack('ID').resample('W-THU').stack('ID')
IndexError: index 0 is out of bounds for axis 0 with size 0
Strangely enough, i can stack the other column level with both:
print df.unstack('ID').resample('W-THU').stack(0)
and
print df.unstack('ID').resample('W-THU').stack('Params')
The index-error also occurs if i reorder (swap) both column levels. Does anyone know how to overcome this issue?
The example unstacks a non-numerical column 'NAME' which is silently dropped but causes problems during re-stacking. The code below worked for me
print df[['VAR1']].unstack('ID').resample('W-THU').stack('ID')
Params VAR1
DATE ID
2013-01-03 A 69.0
B 69.0
2013-01-10 A 76.0
B 73.5

Categories