Remove NaN from pandas series - python

Is there a way to remove a NaN values from a panda series? I have a series that may or may not have some NaN values in it, and I'd like to return a copy of the series with all the NaNs removed.

>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s[~s.isnull()]
0 1
1 2
2 3
3 4
5 5
update or even better approach as #DSM suggested in comments, using pandas.Series.dropna():
>>> s.dropna()
0 1
1 2
2 3
3 4
5 5

A small usage of np.nan ! = np.nan
s[s==s]
Out[953]:
0 1.0
1 2.0
2 3.0
3 4.0
5 5.0
dtype: float64
More Info
np.nan == np.nan
Out[954]: False

If you have a pandas serie with NaN, and want to remove it (without loosing index):
serie = serie.dropna()
# create data for example
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
ser.replace('e', np.NAN)
print(ser)
0 g
1 NaN
2 NaN
3 k
4 s
dtype: object
# the code
ser = ser.dropna()
print(ser)
0 g
3 k
4 s
dtype: object

Related

python pandas dataframe multiply columns matching index or row name

I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.

How to create data fame from random lists length using python?

I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.
You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2
You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0

Using Panda to skip the next three rows from a specific column

I have been working on this code forever:
I would like to accomplish the following
-If Pout>3 then drop/delete the next 3 Rows
df=pd.read_csv(file,sep=',',usecols=['Iin', 'Iout','Pout'])
print(df['Pout'])
for i in df['Pout']:
if i>3:
df.drop(df[3:])# drop/delete the next 3 rows regardless of the value
print(df)
Any help will greatly appreciated
Thanks
I came up with this code based on your first code. but the updated version that you have just posted is more efficient. I am now dropping the next five rows after the conditions have been met.
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
df.at[index+1, 'c'] = None
df.at[index+2, 'c'] = None
df.at[index+3, 'c'] = None
df.at[index+4, 'c'] = None
df.at[index+5, 'c'] = None
print(df['c'])
break
try this:
import pandas as pd
df = pd.DataFrame({'a': [1,5,1,2,2,2,1], 'b': [1,4,2,4,8,1,2], 'c': [1,2,6,6,2,1,2]})
for i in df['c']:
if i>3:
try:
idx = df['c'].tolist().index(i)# drop/delete the next 3 rows regardless of the value
print(idx)
except:
pass
for i in range(idx, idx+3):
df.at[i, 'c'] = None
print(df)
Output:
a b c
0 1 1 1.0
1 5 4 2.0
2 1 2 NaN
3 2 4 NaN
4 2 8 NaN
5 2 1 1.0
6 1 2 2.0
my solution is with a dummy data frame
i got the index of the item if the item is bigger than 3 then iterate trough the range of the items index to the items index plus 3 then do at function to set the value to Nan
In my edit i just added try and except and now it works
For 5 rows:
I think this code is what you want, i also think this is more efficient
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
for i in range(index+1, index+6):
df.at[i, 'c'] = None
print(df['c'])
break
Output:
0 1.0
1 4.5
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 3.0
8 4.0
9 7.0
Name: c, dtype: float64

Getting Pandas.groupby.shift() results with groupbyvars as cols / index?

Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!
It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0
As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64

pandas subset using sliced boolean index

code to make test data:
import pandas as pd
import numpy as np
testdf = {'date': range(10),
'event': ['A', 'A', np.nan, 'B', 'B', 'A', 'B', np.nan, 'A', 'B'],
'id': [1] * 7 + [2] * 3}
testdf = pd.DataFrame(testdf)
print(testdf)
gives
date event id
0 0 A 1
1 1 A 1
2 2 NaN 1
3 3 B 1
4 4 B 1
5 5 A 1
6 6 B 1
7 7 NaN 2
8 8 A 2
9 9 B 2
subset testdf
df_sub = testdf.loc[testdf.event == 'A',:]
print(df_sub)
date event id
0 0 A 1
1 1 A 1
5 5 A 1
8 8 A 2
(Note: not re-indexed)
create conditional boolean index
bool_sliced_idx1 = df_sub.date < 4
bool_sliced_idx2 = (df_sub.date > 4) & (df_sub.date < 6)
I want to insert conditional values using this new index in original df, like
dftest[ 'new_column'] = np.nan
dftest.loc[bool_sliced_idx1, 'new_column'] = 'new_conditional_value'
which obviously (now) gives error:
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
bool_sliced_idx1 looks like
>>> print(bool_sliced_idx1)
0 True
1 True
5 False
8 False
Name: date, dtype: bool
I tried testdf.ix[(bool_sliced_idx1==True).index,:], but that doesn't work because
>>> (bool_sliced_idx1==True).index
Int64Index([0, 1, 5, 8], dtype='int64')
IIUC, you can just combine all of your conditions at once, instead of trying to chain them. For example, df_sub.date < 4 is really just (testdf.event == 'A') & (testdf.date < 4). So, you could do something like:
# Create the conditions.
cond1 = (testdf.event == 'A') & (testdf.date < 4)
cond2 = (testdf.event == 'A') & (testdf.date.between(4, 6, inclusive=False))
# Make the assignments.
testdf.loc[cond1, 'new_col'] = 'foo'
testdf.loc[cond2, 'new_col'] = 'bar'
Which would give you:
date event id new_col
0 0 A 1 foo
1 1 A 1 foo
2 2 NaN 1 NaN
3 3 B 1 NaN
4 4 B 1 NaN
5 5 A 1 bar
6 6 B 1 NaN
7 7 NaN 2 NaN
8 8 A 2 NaN
9 9 B 2 NaN
This worked
idx = np.where(bool_sliced_idx1==True)[0]
## or
# np.ravel(np.where(bool_sliced_idx1==True))
idx_original = df_sub.index[idx]
testdf.iloc[idx_original,:]

Categories