been trying to do an efficient vlookup style on pandas, with IF function...
Basically, I want to apply to this column ccy_grp, that if the value (in a particular row) is 'NaN', it will take the value from another column ccy
def func1(tkn1, tkn2):
if tkn1 == 'NaN:
return tkn2
else:
return tkn1
tmp1_.ccy_grp = tmp1_.apply(lambda x: func1(x.ccy_grp, x.ccy), axis = 1)
but nope, doesn't work. The code cannot seem to detect 'NaN'. I tried another way of np.isnan(tkn1), but I just get a boolean error message...
Any experienced python pandas code developer know?
use pandas.isna to detect a value whether a NaN
generate data
import pandas as pd
import numpy as np
data = pd.DataFrame({'value':[np.NAN, None, 1,2,3],
'label':['str:np.NAN', 'str: None', 'str: 1', 'str: 2', 'str: 3']})
data
create a function
def func1(x):
if pd.isna(x):
return 'is a na'
else:
return f'{x}'
apply function to data
data['func1_result'] = data['value'].apply((lambda x: func1(x)))
data
There is a pandas method for what you are trying to do. Check out combine_first:
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with
non-null values from the other Series.
tmp1_.ccy_grp = tmp1_.ccy_grp.combine_first(tmp1_.ccy)
This looks like it should be a pandas mask/where/fillna problem, not an apply:
Given:
value values2
0 NaN 0.0
1 NaN 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0
Doing:
df.value.fillna(df.values2, inplace=True)
print(df)
# or
df.value.mask(df.value.isna(), df.values2, inplace=True)
print(df)
# or
df.value.where(df.value.notna(), df.values2, inplace=True)
print(df)
Output:
value values2
0 0.0 0.0
1 0.0 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0
Related
I need to create a dataframe with two columns: variable, function based on this variable. There is an error in case of next code:
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
k = 0.5**i
test.append(i, k)
print(test)
TypeError: cannot concatenate object of type '<class 'int'>'; only Series and DataFrame objs are valid
What do I need to fix here? Looks like answer is easy, however it is uneasy to find it...
Many thanks for your help
Is there a specific reason you are trying to use the loop? You can create df with column_1 and use Pandas vectorized operations to create column_2
df = pd.DataFrame(np.arange(1,30), columns = ['Column_1'])
df['Column_2'] = 0.5**df['Column_1']
Column_1 Column_2
0 1 0.50000
1 2 0.25000
2 3 0.12500
3 4 0.06250
4 5 0.03125
I like Vaishali's way of approaching it. If you really want to use the for loop, this is how I would of done it:
import pandas as pd
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
test=test.append({'Column_1':i,'Column_2':0.5**i},ignore_index=True)
test = test.round(5)
print(test)
Column_1 Column_2
0 1.0 0.50000
1 2.0 0.25000
2 3.0 0.12500
3 4.0 0.06250
4 5.0 0.03125
5 6.0 0.01562
I am applying an inner join in a for loop on another dataset and now I just need to remove the rows that are already part of the inner join so I went with Dataframe.isin(another_df) but it is not giving me the expected results. I checked the column names and their data types, they are all the same. Can someone help me with that, please?
In the following code, isin is where I check between two data frames still I'm not getting any response, I'm getting the same set of rows even if they have the same no of rows and columns.
Note: I'm dropping an extra column in isin function as it is the extra column present in one of the dataframes.
My code looks like this:
df = pd.DataFrame(columns= override.columns)
for i in list1:
join_value = tuple(i)
i.append('creditor_tier_interim')
subset_df = override.merge(criteria[i].dropna(), on = list(join_value), how = 'inner')
subset_df['PRE_CHARGEOFF_FLAG'] = pd.to_numeric(subset_df.PRE_CHARGEOFF_FLAG)
override=override[~override.isin(subset_df.drop(columns = 'creditor_tier_interim'))].dropna(how = 'all')
print('The override shape would be:', override.shape)
df = df.append(subset_df)
df = df.append(override)
It sounds as if you have 'left' and a 'right' DataFrames and you're look for those records that are exclusively in one or the other. The below returns rows that are in exclusively the right or left DataFrame.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
dataframe_left = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
dataframe_right = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
insert_left = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'],index=[7])
insert_right = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'], index=[6])
dataframe_right = dataframe_right.append(insert_right)
dataframe_left = dataframe_left.append(insert_left)
Code above produces this output
Left Table
A
B
C
D
E
0
-0.3240086903973736
1.0441549453943946
-0.23640436950107843
0.5466767470739027
-0.2123693649877372
1
-0.04263388410830733
-0.4855492977594353
-1.5584284407735072
1.2438524586306603
-0.31087239909921277
2
0.6982581750529829
-0.42379154444215905
1.1625089013522614
-3.378898146269229
1.0550121763954057
3
0.3774337535208665
0.6402576096348337
-0.2787520258645991
0.31071767629270125
0.34499495360962007
4
-0.133649590435452
0.3679768579635411
-2.0196709364730014
1.2860033685128436
-0.49674737879741193
7
0.0
1.0
2.0
3.0
4.0
Right Table
A
B
C
D
E
0
-0.09946693056759418
-0.03378933704588447
-0.4117873368048701
0.21976489856531914
-0.7020527418892488
1
-2.9936183481793233
0.42443360961021837
-0.1681576564885903
-0.5080538565354785
-0.29483296271514153
2
-0.6567306172004121
-1.221239625798079
-1.2604670988941196
0.44472543746187265
-0.4562966381137614
3
-0.0027697712245823482
0.1323767897141191
-0.11073953230359104
-0.3596157927825233
1.9894525572891626
4
0.5170901011452596
-1.1694605240821456
0.29238712582282705
-0.38912521589557797
-0.8793074660039492
6
0.0
1.0
2.0
3.0
4.0
After setting up the test dataframes we can join the two and filter for the rows we're interested in:
tmp = pd.merge(
left=dataframe_left,
right=dataframe_right,
right_index=True,
left_index=True,
how='outer',
suffixes=['_left','_right'],
indicator=True
)
tmp[tmp._merge.isin(['right_only','left_only'])]
This produces the below result
A_left
B_left
C_left
D_left
E_left
A_right
B_right
C_right
D_right
E_right
_merge
6
0.0
1.0
2.0
3.0
4.0
right_only
7
0.0
1.0
2.0
3.0
4.0
left_only
import pandas as pd
I have a dataframe:
df=pd.DataFrame({'cmplxnumbers':[1+1j,2-2j,3*(1+1j)]})
I need to get the imaginary parts of the numbers in the column.
I do it by:
df.cmplxnumbers.apply(lambda number: number.imag)
I get as a result:
0 1.0
1 -2.0
2 3.0
Name: cmplxnumbers, dtype: float64
Which is as expected.
Is there any quicker, more straightforward method, perhaps not involving the lambda function?
Pandas DataFrame/Series builds on top of numpy array, so they can be passed to most numpy functions.
In this case, you can try the following, which should be faster than the non-vectorized .apply:
df['imag'] = np.imag(df.cmplxnumbers)
df['real'] = np.real(df.cmplxnumbers)
Output:
cmplxnumbers imag real
0 1.000000+1.000000j 1.0 1.0
1 2.000000-2.000000j -2.0 2.0
2 3.000000+3.000000j 3.0 3.0
Or you can do agg:
df[['real','imag']] = df.cmplxnumbers.agg([np.real, np.imag])
I have a dataframe with 2 columns.
df=pd.DataFrame({'values':arrays,'ii':lin_index})
I want to group the values by the lin_index and get the mean per group and the most common value per group
I try this
bii=df.groupby('ii').median()
bii2=df.groupby('ii').agg(lambda x:x.value_counts().index[0])
bii3=df.groupby('ii')['values'].agg(pd.Series.mode)
I wonder if bii2 and bii3 return the same values
Then I want to return the mean and most common value to the original array
bs=np.zeros((np.unique(array).shape[0],1))
bs[bii.index.values]=bii.values
Does this look good?
df looks like
values ii
0 1.0 10446786
1 1.0 11316289
2 1.0 16416704
3 1.0 12151686
4 1.0 30312736
... ...
93071038 3.0 28539525
93071039 3.0 19667948
93071040 3.0 22240849
93071041 3.0 22212513
93071042 3.0 41641943
[93071043 rows x 2 columns]
something like this maybe:
# get the mean
df.groupby(['ii']).mean()
# get the most frequent
df.groupby(['ii']).agg(pd.Series.mode)
your question seems similar to
GroupBy pandas DataFrame and select most common value
this link might also be useful https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats
I'd like to set the value of a column based on a query. I could probably use .where to accomplish this, but the criteria for .query are strings which are easier for me to maintain, especially when the criteria become complex.
import numpy as np
import pandas as pd
np.random.seed(51723)
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
I'd like to make a new column, d, and set the value to 1 where these criteria are met:
criteria = '(a < b) & (b < c)'
Among other things, I've tried:
df['d'] = np.nan
df.query(criteria).loc[:,'d'] = 1
But that seems to do nothing except giving the SettingWithCopyWarning even though I'm using .loc
And passing inplace like this:
df.query(criteria, inplace=True).loc[:,'d'] = 1
Gives AttributeError: 'NoneType' object has no attribute 'loc'
AFAIK df.query() returns a new DF, so try the following approach:
In [146]: df.loc[df.eval(criteria), 'd'] = 1
In [147]: df
Out[147]:
a b c d
0 0.175155 0.221811 0.808175 1.0
1 0.069033 0.484528 0.841618 1.0
2 0.174685 0.648299 0.904037 1.0
3 0.292404 0.423220 0.897146 1.0
4 0.169869 0.395967 0.590083 1.0
5 0.574394 0.804917 0.746797 NaN
6 0.642173 0.252437 0.847172 NaN
7 0.073629 0.821715 0.859776 1.0
8 0.999789 0.833708 0.230418 NaN
9 0.028163 0.666961 0.582713 NaN