Not able to solve one condition in python code - python

I want to find in SOFTWARE column which is the new software with respect to VIN column.
for example 'c5D2-14N450-CBQ' to 'c5D2-14N450-CBR'(for both software C column value should be less than or equal to 10) so, 'c5D2-14N450-CBR' is my new software
condition:- Update should be done when the value of column C should be less than or equal to 10
Below is my data frame
import pandas as pd
data = {'VIN': ['aaaa','aaaa','aaaa','aaaa','bbb','bbb','bbb','bbb','CCCC','CCCC','CCCC','CCCC'],
'SOFTWARE': ['P8G2-14B570-PRC','c5D2-14N450-CBR','P8G2-14B570-PRA','c5D2-14N450-CBQ',
'K9A2-13V570-BAI','K9A2-13V570-BAH','K9A2-13V570-BAH','K9A2-13V570-BAH',
'J4E2-12K532-K7N','J4E2-12K532-K7O','J4E2-12K532-K7O','J4E2-12K532-K7N'],
'C': [1,3,15,9,9,12,17,88,3,5,9,10]
}
df = pd.DataFrame(data)
I tried below method but not getting what I expected:
df['RESULT'] = df.apply(lambda x: x['SOFTWARE'] if x['C'] >= 10 else (x['SOFTWARE']), axis=1)
df
I also tried by masking:
mask = df.groupby('VIN')['C'].diff().le(10)
df['Result'] = np.where(mask|mask.groupby(df['VIN']),1,0)
Below is my expected output:
data = {'VIN': ['aaaa','aaaa','aaaa','aaaa','bbb','bbb','bbb','bbb','CCCC','CCCC','CCCC','CCCC'],
'SOFTWARE': ['P8G2-14B570-PRC','c5D2-14N450-CBR','P8G2-14B570-PRA','c5D2-14N450-CBQ',
'K9A2-13V570-BAI','K9A2-13V570-BAH','K9A2-13V570-BAH','K9A2-13V570-BAH',
'J4E2-12K532-K7N','J4E2-12K532-K7O','J4E2-12K532-K7O','J4E2-12K532-K7N'],
'C': [1,3,15,9,9,12,17,88,3,5,9,10],
'RESULT': ['old software','new software','old software','old software','old software','old software',
'old software','old software','old software','new software','new software','old software',]
}
df = pd.DataFrame(data)
print (df)

You should do the following:
import pandas as pd
data = {'VIN': ['aaaa','aaaa','aaaa','aaaa','bbb','bbb','bbb','bbb','CCCC','CCCC','CCCC','CCCC'],
'SOFTWARE': ['P8G2-14B570-PRC','c5D2-14N450-CBR','P8G2-14B570-PRA','c5D2-14N450-CBQ',
'K9A2-13V570-BAI','K9A2-13V570-BAH','K9A2-13V570-BAH','K9A2-13V570-BAH',
'J4E2-12K532-K7N','J4E2-12K532-K7O','J4E2-12K532-K7O','J4E2-12K532-K7N'],
'C': [1,3,15,9,9,12,17,88,3,5,9,10]
}
df = pd.DataFrame(data)
g = df.groupby('VIN')
mask = (g['C'].diff() <= 10) & (g['C'].diff().shift(-1) >= 0)
df['RESULT'] = np.where(mask, 'new software', 'old software')
which returns your desired output:
VIN SOFTWARE C RESULT
0 aaaa P8G2-14B570-PRC 1 old software
1 aaaa c5D2-14N450-CBR 3 new software
2 aaaa P8G2-14B570-PRA 15 old software
3 aaaa c5D2-14N450-CBQ 9 old software
4 bbb K9A2-13V570-BAI 9 old software
5 bbb K9A2-13V570-BAH 12 old software
6 bbb K9A2-13V570-BAH 17 old software
7 bbb K9A2-13V570-BAH 88 old software
8 CCCC J4E2-12K532-K7N 3 old software
9 CCCC J4E2-12K532-K7O 5 new software
10 CCCC J4E2-12K532-K7O 9 new software
11 CCCC J4E2-12K532-K7N 10 old software

Related

Combine if statement with apply in python

New to python. I am trying to figure out the best way to create a column based on other columns. Ideally, the code would be as such.
df['new'] = np.where(df['Country'] == 'CA', df['x'], df['y'])
I do not think this works because it thinks that I am calling the entire column. I tried to do the same thing with apply but was having trouble with syntax.
df['my_col'] = df.apply(
lambda row:
if row.country == 'CA':
row.my_col == row.x
else:
row.my_col == row.y
I feel like there must be an easier way.
Any of these three approaches (np.where, apply, mask) seems to work:
df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'country':['CA','US','CA','UK','CA'], 'x':[1,2,3,4,5], 'y':[6,7,8,9,10]})
print(df)
df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']
print(df)
Input:
country x y
0 CA 1 6
1 US 2 7
2 CA 3 8
3 UK 4 9
4 CA 5 10
Output
country x y where apply mask
0 CA 1 6 1 1 1.0
1 US 2 7 7 7 7.0
2 CA 3 8 3 3 3.0
3 UK 4 9 9 9 9.0
4 CA 5 10 5 5 5.0
This might also work for you
data = {
'Country' : ['CA', 'NY', 'NC', 'CA'],
'x' : ['x_column', 'x_column', 'x_column', 'x_column'],
'y' : ['y_column', 'y_column', 'y_column', 'y_column']
}
df = pd.DataFrame(data)
condition_list = [df['Country'] == 'CA']
choice_list = [df['x']]
df['new'] = np.select(condition_list, choice_list, df['y'])
df
Your np.where() looked fine though so I would double check that your columns are labeled correctly.

How To replace multiple unknown values with default value except nan in python dataframe column

I have a dataframe. How can I replace multiple unknown values with default value except nan in python dataframe column.
df = S.No. Columns_A
1 python
2 java
3 NAN
4 C++
5 python , java
How to get updated data frame
df_updated = S.No. Columns_A
1 Good
2 Good
3 NAN
4 Good
5 Good
How about this:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
'S.No.':[
1,
2,
3,
4,
5,
],
'ColumnA':[
'python',
'java',
np.nan,
'C++',
'python , java',
]
})
df['ColumnA'] = df.apply(lambda row: np.nan if pd.isna(row['ColumnA']) else 'Good', axis=1)
result:
S.No. ColumnA
0 1 Good
1 2 Good
2 3 NaN
3 4 Good
4 5 Good
You easily need to select not Na values by .loc and make it 'Good' like this:
df.loc[~ df.ColumnA.isna()] = 'Good'
import pandas as pd
df = pd.DataFrame(data={
'ColumnA':[
'python',
'java',
None,
'C++',
'python , java',
]
})
df.loc[~ df.ColumnA.isna()] = 'Good'
df
ColumnA
0 Good
1 Good
2 NaN
3 Good
4 Good
Use df.where: This should be faster than other solutions
In [1443]: df['Columns_A'] = df['Columns_A'].where(df['Columns_A'].isna(), 'Good')
In [1444]: df
Out[1444]:
S.No. Columns_A
0 1 Good
1 2 Good
2 3 NaN
3 4 Good
4 5 Good

Changing row names in dataframe

I have a dataframe and one of the columns roughly looks like as shown below. Is there any way to rename rows? Rows should be renamed as psPARP8, psEXOC8, psTMEM128, psCFHR3. Where ps represents pseudogene and and the term in
bracket is the code for that pseudogene. I will highly appreciate if anyone can can make
a python function or any alternative to perform this task.
d = {'gene_final': ["1poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene",
"mitochondrially encoded NADH 4L dehydrogenase (MT-ND4L) pseudogene",
"relaxin family peptide/INSL5 receptor 4 (RXFP4 ) pseudogene",
"nasGBP7and GBP2"
]}
df = pd.DataFrame(data=d)
The desired output should look like this
gene_final
-----------
psPARP8
psEXOC8
psTMEM128
psCFHR3
psMT-ND4L
psRXFP4
nasGBP2
import pandas as pd
from regex import regex
# build dataframe
df = pd.DataFrame({'gene_final': ["poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene"]})
def extract_name(s):
"""Helper function to extract ps name """
s = regex.findall(r"\s\((\S*)\s?\)", s)[0] # find a word between ' (' and ' )'
s = f"ps{s}" # add ps to string
return s
# apply function extract_name() to each row
df['gene_final'] = df['gene_final'].apply(extract_name)
print(df)
> gene_final
> 0 psPARP8
> 1 psEXOC8
> 2 psTMEM128
> 3 psCFHR3
> 4 psMT-ND4L
> 5 psRXFP4
I think you are saying about index names (rows):
This is how you change the row names in DataFrames:
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
print(df)
and you can change the row names after building dataframe also like this:
df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
print(df_new)
# Col_1 B C
# Row_1 11 12 13
# TWO 21 22 23
# THREE 31 32 33
print(df)
# A B C
# ONE 11 12 13
# TWO 21 22 23
# THREE 31 32 33

Pandas merge dataframe with conditions depends on value in a column

a help will be appreciated.
I have 2 DataFrames.
The first data frame consisted of an activity schedule of person,schedule, as following:
PersonID Person Origin Destination
3-1 1 A B
3-1 1 B A
13-1 1 C D
13-1 1 D C
13-2 2 A B
13-2 2 B A
And I have another DataFrame, household, containing the details of the person/agent.
PersonID1 Age1 Gender1 PersonID2 Age2 Gender2
3-1 20 M NaN NaN NaN
13-1 45 F 13-2 17 M
I want to perform a VLOOKUP on these two using pd.merge. Since the lookup(merge) will depends on the person's ID, I tried to that with a condition.
def merging(row):
if row['Person'] == 1:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
else:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
return row
schedule_merged = schedule.apply(merging, axis=1)
However, for some reason, it just doesn't work. The error says ValueError: len(right_on) must equal len(left_on). I'm aiming to make this kind of data in the end:
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
I think I messed up the pd.merge lines. While it might be more efficient to use VLOOKUP in Excel, it's just to heavy for my PC, since I have to apply this for a hundred thousand data. How could I do this properly? Thanks!
This is how I would do it if the real dataset is not more complicated than the given example. Other wise I would suggest looking at pd.melt() for more complex unpivoting.
import pandas as pd
import numpy as np
# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule
# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household
# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']
# Concat them together
household_new = pd.concat([household1, household2])
# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')
Output
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M

Create several columns with default values in Salesforce

I have a dataframe that looks like 1000 rows, 10 columns
I want to add 20 columns with only one single value in each column (what I call a default value)
Therefore, my final df would be 1000 rows, with 30 columns
I know that I can do it 30 times by doing:
df['column 11'] = 'default value'
df['column 12'] = 'default value 2'
But I would like to do it in a proper way of coding
I have a dict with my {'column label' : 'defaultvalues'}
How can I do so ?
I've tried pd.insert or pd.concatenate but couldn't find my way through
thanks
regards,
Eric
One way to do so:
df_len = len(df)
new_df = pd.DataFrame({col: [val] * df_len for col,val in your_dict.items()})
df = pd.concat((df,new_df), axis=1)
Generally if possible spaces in keys in dictionary for new columns names use DataFrame constuctor with DataFrame.join:
df = pd.DataFrame({'a':range(5)})
print (df)
a
0 0
1 1
2 2
3 3
4 4
d = {'A 11' : 's', 'A 12':'c'}
df = df.join(pd.DataFrame(d, index=df.index))
print (df)
a A 11 A 12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
If no spaces and no numbers in columns names (need valid identifier) is possible use DataFrame.assign:
d = {'A11' : 's', 'A12':'c'}
df = df.assign(**d)
print (df)
a A11 A12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
Another solution is loop by dictionary and assign:
for k, v in d.items():
df[k] = v

Categories