.replace codes will not replace column with new column in python - python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example

It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.

Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

How to save values in pandas dataframe after editing some values

I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None

Need help using pandas to read a column and print a new column in .csv file

I am trying to use pandas to read a column in an excel file and print a new column using my input. I am trying to convert 3-letter code to 1-letter code. So far, I've written this code, but when I run it, it will not print anything in the last column.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
codes = []
for i in df['WT_RESIDUE']:
if i == 'ALA':
codes.append('A')
if i == 'ARG':
codes.append('R')
if i == 'ASN':
codes.append('N')
if i == 'ASP':
codes.append('D')
if i == 'CYS':
codes.append('C')
if i == 'GLU':
codes.append('E')
print (codes)
codes = df ['MUTATION_CODE']
df.to_csv(r'C:\Users\User\Documents\Research\seqadv3.csv')
The way to do this is to define a dictionary with your replacement values, and then use either map() or replace() on your existing column to create your new column. The difference between the two is that
replace() will not change values not in the dictionary keys
map() will replace any values not in the dictionary keys with the dictionary's default value (if it has one) or with NaN (if the dictionary doesn't have a default value)
df = pd.DataFrame(data={'WT_RESIDUE':['ALA', 'REMARK', 'VAL', 'CYS', 'GLU']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E'}
df['code_m'] = df['WT_RESIDUE'].map(codes)
df['code_r'] = df['WT_RESIDUE'].replace(codes)
In: df
Out:
WT_RESIDUE code_m code_r
0 ALA A A
1 REMARK NaN REMARK
2 VAL NaN VAL
3 CYS C C
4 GLU E E
More detailed information is here: Remap values in pandas column with a dict
Write:
df['MUTATION_CODE'] = codes

Splitting text and numbers in dataframe in python

I have a dataframe df with column name 'col' as the second column and the data looks like:
Dataframe
Want to separate text part in one column with name "Casing Size" and numerical part with "DepthTo" in other column.
Desired Output
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['Test-Checking.xlsx']))
#Method 1
df2 = pd.DataFrame(data=df, columns=['col'])
df2 = df2.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True)
df2.columns = ['CasingSize', 'DepthTo']
df2
#Method 2
def split_col(x):
try:
numb = float(x.split()[0])
txt = x.split()[1]
except:
numb = float(x.split()[1])
txt = x.split()[0]
x['col1'] = txt
x['col2'] = numb
df2['col1'] = df.col.apply(split_col)
df2
Tried two methods but none of them work correctly. Is there anyone help me?
Code in Google Colab
Excel File Attached
Try this
first you need to return the the values from your functions. then you can unpack them into your columns using the to_list()
def sample(x):
b,y=x.split()
return b,y
temp_df=df2['col'].apply(sample)
df2[['col1','col2']]=pd.DataFrame(temp_df.tolist())
You could try splitting the values into a list, then sorting them so that the numerical part comes first. Then you could apply pd.Series and assign back to the two columns.
import pandas as pd
df = pd.DataFrame({'col':["PWT 69.2", '283.5 HWT', '62.9 PWT', '284 HWT']})
df[['Casing Size','DepthTO']] = df['col'].str.split().apply(lambda x: sorted(x)).apply(pd.Series)
print(df)
Output
col Casing Size DepthTO
0 PWT 69.2 69.2 PWT
1 283.5 HWT 283.5 HWT
2 62.9 PWT 62.9 PWT
3 284 HWT 284 HWT

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Categories