Data Cleaning with Pandas - python

I have a dataframe column consisting of text data and I need to filter it according to the following conditions:
The character "M", if it's present in the string, it can only be at the n-2 position
The n-1 position of the string always has to be a "D".
ex:
KFLL
KSDS
KMDK
MDDL
In this case, for example, I would have to remove the first string, since the character at the n-1 position is not a "D", and the last one, since the character "M" appears out of the n-2 position.
How can I apply this to a whole dataframe column?

Here's with a list comprehension:
l = ['KFLL', 'KSDS', 'KMDK', 'MDDL']
[x for x in l if ((('M' not in x) or (x[-3] == 'M')) and (x[-2] == 'D'))]
Output:
['KSDS', 'KMDK']

This does what you want. Could probably be written down shorter with list comprehensions, but at least this is readable. It assumes that the strings are all longer than 3 characters, otherwise you get an IndexError. In that case you need to add a try/except
from collections import Counter
import pandas as pd
df = pd.DataFrame(data=list(["KFLL", "KSDS", "KMDK", "MDDL"]), columns=["code"])
print("original")
print(df)
mask = list()
for code in df["code"]:
flag = False
if code[-2] == "D":
counter = Counter(list(code))
if counter["M"] == 0 or (counter["M"] == 1 and code[-3] == "M"):
flag = True
mask.append(flag)
df["mask"] = mask
df2 = df[df["mask"]].copy()
df2.drop("mask", axis=1, inplace=True)
print("new")
print(df2)
Output looks like this
original
code
0 KFLL
1 KSDS
2 KMDK
3 MDDL
new
code
1 KSDS
2 KMDK

Thank you all for your help.
I ended up implementing it like this:
l = {"Sequence": [ 'KFLL', 'KSDS', 'KMDK', 'MDDL', "MMMD"]}
df = pd.DataFrame(data= l)
print(df)
df = df[df.Sequence.str[-2] == 'D']
df = df[~df.Sequence.apply(lambda x: ("M" in x and x[-3]!='M') or x.count("M") >1 )]
print(df)
Output:
Sequence
0 KFLL
1 KSDS
2 KMDK
3 MDDL
4 MMMD
Sequence
1 KSDS
2 KMDK

Related

How can I classify a column of strings with true and false values by comparing with another column of strings

So I have a column of strings that is listed as "compounds"
Composition (column title)
ZrMo3
Gd(CuS)3
Ba2DyInTe5
I have another column that has strings metal elements from the periodic table and i'll call that column "metals"
Elements (column title)
Li
Be
Na
The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?
Example: (if "metals" has Zr, Ag, and Te)
ZrMo3 True
Gd(CuS)3 False
Ba2DyInTe5 True
I recently tried using this code below, but I ended up getting all false
asd = subset['composition'].isin(metals['Elements'])
print(asd)
also tried this code and got all false as well
subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list
import pandas as pd
elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
df = pd.DataFrame(compounds, columns=['compounds'])
print(df)
output
compounds
0 ZrMo3
1 Gd(CuS)3
2 Ba2DyInTe5
df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)
output
compounds boolean
0 ZrMo3 False
1 Gd(CuS)3 False
2 Ba2DyInTe5 True
if you are not using pandas, you can apply the lambda function to the lists with the map function
out = list(
map(
lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)
output
[False, False, True]
here would be a more complex version which also tackles the potential errors #Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.
import re
import pandas as pd
def split_compounds(c):
# remove all non-alphabet elements
c_split = re.sub(r"[^a-zA-Z]", "", c)
# split string at capital letters
c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
return c_split
def compare_compound(compound, element):
# split compound into list
compound_list = compound.split('-')
return any([element == c for c in compound_list])
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])
# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]
print(df)
output
compounds compounds_elements
0 SiO2 Si-O
1 Ba2DyInTe5 Ba-Dy-In-Te
2 ZrMo3 Zr-Mo
3 Gd(CuS)3 Gd-Cu-S
# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
lambda x: any([True if compare_compound(x, el) else False for el in elements])
)
print(df)
output
compounds compounds_elements boolean
0 SiO2 Si-O False
1 Ba2DyInTe5 Ba-Dy-In-Te True
2 ZrMo3 Zr-Mo False
3 Gd(CuS)3 Gd-Cu-S True

Groupby several columns, summing them up based on the presence of a sub-string

Context: I'm trying to sum all values based in a list only if they start with or contain a string
So with a config file like this:
{
'exclude_granularity':True,
'granularity_suffix_list':['A','B']
}
And a dataframe like this:
tt = pd.DataFrame({'A_2':[1,2,3],'A_3':[3,4,2],'B_4':[5,2,1],'B_1':[8,2,1],'C_3':[2,4,2})
How can I group by if they all start by a given substring present on the granularity_suffix_list?
Desired output:
A B C_3
0 4 13 2
1 6 4 4
2 5 2 2
Attempts:
I was trying this:
if exclude_granularity == True:
def correct_categories(cols):
return [cat if col.startswith(cat) else col for col in cols for cat in granularity_suffix_list]
df= df.groupby(correct_categories(df.columns),axis=1).sum()
But It doesn't work. Instead, the function returns a list like ['A_2','A','A_3','A',B_4','B'...]
Thank you
Okay finally managed to solve what I wanted
Posting the solution if anyone finds it relevant
tt = pd.DataFrame({'A_2':[1,2,3],'A_3':[3,4,2],'B_4':[5,2,1],'B_1':[8,2,1],'C_3':[2,4,2]})
granularity_suffix_list = ['A','B']
def correct_categories(cols_to_aggregate):
lst = []
for _, column in enumerate(cols_to_aggregate):
if not column.startswith(tuple(granularity_suffix_list)):
lst.append(column)
else:
lst.append(granularity_suffix_list[
[i for i, w in enumerate(granularity_suffix_list) if column.startswith(w)][0]
])
return lst
df = tt.groupby(correct_categories(tt.columns),axis=1).sum()
You could write that a bit more compact:
def grouper(c):
for suffix in granularity_suffix_list:
if c.startswith(suffix):
return suffix
return c
df = tt.groupby(grouper, axis=1).sum()
Or if you're not opposed to using re:
import re
re_suffix = re.compile("|".join(map(re.escape, granularity_suffix_list)))
def grouper(c):
return m[0] if (m := re_suffix.match(c)) else c
df = tt.groupby(grouper, axis=1).sum()
Another option would be:
pat = f"^({'|'.join(granularity_suffix_list)})"
suffixes = tt.columns.str.extract(pat, expand=False)
df = tt.groupby(suffixes.where(suffixes.notna(), tt.columns), axis=1).sum()

Split the column of dataframe into multiple columns according to the text length of column on Python

How can I separate a column of pandas dataframe into multiple columns based on the size of each text length? Assume that chunk size will be 3 and the sample dataframe is:
id body
1 abcdefgh
2 xyzk
For this case, I would like to get:
id body1 body2 body3
1 abc def gh
2 xyz k
I am assuming that I should be able to handle it with something like : df[['body1','body2', 'body3']] = df['body'].str.split(...
Any suggestions?
You can do the following:
new_values = df['body'].str.findall('.{1,3}')
new_columns = [f'body{num}' for num in range(1, new_values.apply(len).max() +1)]
new_df = pd.DataFrame(data=new_values.tolist(), columns=new_columns)
You can also define your regex pattern based on the maximum number of characters you want on each column:
max_char_per_column = 3
regex_pattern = f".{{1,{max_char_per_column}}}"
new_values = df['body'].str.findall(regex_pattern)
If you don't want the None, feel free to .fillna("") your new_df.
See this answer for splitting a string with regex every nth character Split string every nth character?.
First, define a split_chunk function
def split_chunk(txt, n=3):
return [txt[i:i+n] for i in range(0, len(txt), n)]
Then create a new dataframe from body using apply
>>> df2 = pd.DataFrame(df.body.apply(split_chunk).to_list())
>>> df2
0 1 2
0 abc def gh
1 xyz k None
You can replace the None values, and rename the columns with the following
>>> df2 = df2.fillna("").rename(columns=lambda x: f"body{x+1}")
>>> df2
body1 body2 body3
0 abc def gh
1 xyz k
Finaly, restore the index
>>> df2.index = df.id
>>> df2
body1 body2 body3
id
1 abc def gh
2 xyz k
Shorter version
df = df.set_index("id")
df = pd.DataFrame(
df.body.apply(split_chunk).to_list(),
index=df.index
).fillna("").rename(columns=lambda x: f"body{x+1}")
Try this:
import pandas as pd
df = pd.DataFrame({"body": ["abcdefgh","xyzk"]})
df['body1'] = df['body'].astype(str).str[0:3]
df['body2'] = df['body'].astype(str).str[3:6]
df['body3'] = df['body'].astype(str).str[6:9]
df.drop('body',axis=1,inplace=True)
print(df)

More efficient way to find multiple keywords in column of strings pandas

I have a dataframe containing many rows of strings: btb['Title']. I would like to identify whether each string contains positive, negative or neutral keywords. The following works but is considerably slow:
positive_kw =('rise','positive','high','surge')
negative_kw = ('sink','lower','fall','drop','slip','loss','losses')
neutral_kw = ('flat','neutral')
#create new columns, turn value to one if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan
#Turn value to one if keyword exists in sentence
for index, row in btb.iterrows():
if any(s in row.Title for s in positive_kw) == True:
btb['Positive'].loc[index] = 1
if any(s in row.Title for s in negative_kw) == True:
btb['Negative'].loc[index] = 1
if any(s in row.Title for s in neutral_kw) == True:
btb['Neutral'].loc[index] = 1
I appreciate your time and am intested to see what is necessary to improve the performance of this code
You can use '|'.join on a list of words to create a regex pattern which matches any of the words (at least one)
Then you can use the pandas.Series.str.contains() method to create a boolean mask for the matches.
import pandas as pd
# create regex pattern out of the list of words
positive_kw = '|'.join(['rise','positive','high','surge'])
negative_kw = '|'.join(['sink','lower','fall','drop','slip','loss','losses'])
neutral_kw = '|'.join(['flat','neutral'])
# creating some fake data for demonstration
words = [
'rise high',
'positive attitude',
'something',
'foo',
'lowercase',
'flat earth',
'neutral opinion'
]
df = pd.DataFrame(data=words, columns=['words'])
df['positive'] = df['words'].str.contains(positive_kw).astype(int)
df['negative'] = df['words'].str.contains(negative_kw).astype(int)
df['neutral'] = df['words'].str.contains(neutral_kw).astype(int)
print(df)
Output:
words positive negative neutral
0 rise high 1 0 0
1 positive attitude 1 0 0
2 something 0 0 0
3 foo 0 0 0
4 lowercase 0 1 0
5 flat earth 0 0 1
6 neutral opinion 0 0 1

Splitting strings in tuples within a pandas dataframe column

I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation ., take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?
Use nested list and set comprehension and for test convert sets to bools - empty set return False:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False

Categories