Python pandas data frame clean with dictionary of regular expressions - python

I want to clean a pandas data frame using a dictionary of regular expressions representing allowed data entry formats.
I'm trying to iterate over the input data frame so to check every row against the allowed data entry format for a given column.
If an entry doesn't meet the format allowed for the column, I want to replace it with NaN (see desired output below).
My current code gives me an error message: 'DataFrame' object has no attribute 'col'.
My MWE features two representative regular expressions, but for my actual data set I've got ~40.
Thanks for any help!
# Packages
import pandas as pd
import re
import numpy as np
# Input data frame
data = {'score': [71,72,55,'a'],
'bet': [0.260,0.380,'0.8dd',0.260]
}
df1 = pd.DataFrame(data, columns = ['score', 'bet'])
# Input dictionary
dict1 = {'score':'^\d+$',
'bet': '^\d[\.]\d+$'}
# Cleaning function
def cleaner(df, dict):
for col in df.columns:
if col in dict:
for row in df.col:
if re.match(dict[col], str(row)):
row = row
else:
row = np.nan
return(df)
cleaned_df = cleaner(df1, dict1)
# ERROR MESSAGE
# 'DataFrame' object has no attribute 'col'
# Desired output
goal_data = {'score': [71,72,55, np.nan],
'bet': [0.260,0.380, np.nan, 0.260]
}
goal_df = pd.DataFrame(goal_data, columns = ['score', 'bet'])

there is a problem with your cleaning function in the if statement.
try running the following cleaner function in place of yours.
# Cleaning function
def cleaner(df, dict):
for col in df.columns:
if col in dict.keys():
for row in df.index:
if type(re.match(dict[col], str(df[col][row]))) is re.Match:
df[col][row] = df[col][row]
print(df[col][row])
else:
df[col][row] = np.nan
return(df)
print(cleaner(df1, dict1))
cleaned_df = cleaner(df1, dict1)

Try np.where(if condition, yes,else alternative)
import pandas as pd
import numpy as np
df1['score']=np.where(df1.score.str.match('^\d+$'),df1['score'],np.nan)
df1['bet']=np.where(df1.bet.str.match('^\d[\.]\d+$'),df1['bet'],np.nan)
score bet
0 71 0.26
1 72 0.38
2 55 NaN
3 NaN 0.26

Related

Understanding initializing an empty dictionary

I really do not understand how there was the command (if "entry" in langs_count) is possible when the dictionary was initialized to be empty, so what is inside the dictionary and how did it get there? I'm really confused
`
import pandas as pd
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")
# Initialize an empty dictionary: langs_count
langs_count = {}
# Extract column from DataFrame: col
col = df['lang']
# Iterate over lang column in DataFrame
for entry in col:
# If the language is in langs_count, add 1
if entry in langs_count.keys():
langs_count[entry]+=1
# Else add the language to langs_count, set the value to 1
else:
langs_count[entry]=1
# Print the populated dictionary
print(langs_count)
`
You can implement the count functionality using groupby.
import pandas as pd
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")
# Populate dictionary with count of occurrences in 'lang' column
langs_count = dict(df.groupby(['lang']).size())
# Print the populated dictionary
print(langs_count)

Adding empty rows in Pandas dataframe

I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")

.replace codes will not replace column with new column in python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Need help using pandas to read a column and print a new column in .csv file

I am trying to use pandas to read a column in an excel file and print a new column using my input. I am trying to convert 3-letter code to 1-letter code. So far, I've written this code, but when I run it, it will not print anything in the last column.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
codes = []
for i in df['WT_RESIDUE']:
if i == 'ALA':
codes.append('A')
if i == 'ARG':
codes.append('R')
if i == 'ASN':
codes.append('N')
if i == 'ASP':
codes.append('D')
if i == 'CYS':
codes.append('C')
if i == 'GLU':
codes.append('E')
print (codes)
codes = df ['MUTATION_CODE']
df.to_csv(r'C:\Users\User\Documents\Research\seqadv3.csv')
The way to do this is to define a dictionary with your replacement values, and then use either map() or replace() on your existing column to create your new column. The difference between the two is that
replace() will not change values not in the dictionary keys
map() will replace any values not in the dictionary keys with the dictionary's default value (if it has one) or with NaN (if the dictionary doesn't have a default value)
df = pd.DataFrame(data={'WT_RESIDUE':['ALA', 'REMARK', 'VAL', 'CYS', 'GLU']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E'}
df['code_m'] = df['WT_RESIDUE'].map(codes)
df['code_r'] = df['WT_RESIDUE'].replace(codes)
In: df
Out:
WT_RESIDUE code_m code_r
0 ALA A A
1 REMARK NaN REMARK
2 VAL NaN VAL
3 CYS C C
4 GLU E E
More detailed information is here: Remap values in pandas column with a dict
Write:
df['MUTATION_CODE'] = codes

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

Categories