Splitting strings in tuples within a pandas dataframe column - python

I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation ., take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?

Use nested list and set comprehension and for test convert sets to bools - empty set return False:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False

Related

How can I classify a column of strings with true and false values by comparing with another column of strings

So I have a column of strings that is listed as "compounds"
Composition (column title)
ZrMo3
Gd(CuS)3
Ba2DyInTe5
I have another column that has strings metal elements from the periodic table and i'll call that column "metals"
Elements (column title)
Li
Be
Na
The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?
Example: (if "metals" has Zr, Ag, and Te)
ZrMo3 True
Gd(CuS)3 False
Ba2DyInTe5 True
I recently tried using this code below, but I ended up getting all false
asd = subset['composition'].isin(metals['Elements'])
print(asd)
also tried this code and got all false as well
subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)
assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list
import pandas as pd
elements = ['Li', 'Be', 'Na', 'Te']
compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
df = pd.DataFrame(compounds, columns=['compounds'])
print(df)
output
compounds
0 ZrMo3
1 Gd(CuS)3
2 Ba2DyInTe5
df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))
print(df)
output
compounds boolean
0 ZrMo3 False
1 Gd(CuS)3 False
2 Ba2DyInTe5 True
if you are not using pandas, you can apply the lambda function to the lists with the map function
out = list(
map(
lambda x: any([True if el in x else False for el in elements]), compounds)
)
print(out)
output
[False, False, True]
here would be a more complex version which also tackles the potential errors #Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.
import re
import pandas as pd
def split_compounds(c):
# remove all non-alphabet elements
c_split = re.sub(r"[^a-zA-Z]", "", c)
# split string at capital letters
c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
return c_split
def compare_compound(compound, element):
# split compound into list
compound_list = compound.split('-')
return any([element == c for c in compound_list])
# build sample data
compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
elements = ['Li', 'Be', 'Na', 'Te', 'S']
df = pd.DataFrame(compounds, columns=['compounds'])
# split compounds into elements
df['compounds_elements'] = [split_compounds(x) for x in compounds]
print(df)
output
compounds compounds_elements
0 SiO2 Si-O
1 Ba2DyInTe5 Ba-Dy-In-Te
2 ZrMo3 Zr-Mo
3 Gd(CuS)3 Gd-Cu-S
# check if any item from 'elements' is in the compounds
df['boolean'] = df.compounds_elements.apply(
lambda x: any([True if compare_compound(x, el) else False for el in elements])
)
print(df)
output
compounds compounds_elements boolean
0 SiO2 Si-O False
1 Ba2DyInTe5 Ba-Dy-In-Te True
2 ZrMo3 Zr-Mo False
3 Gd(CuS)3 Gd-Cu-S True

Flipping a matrix-like string horizontally

The goal of this function is to flip a matrix-like string horizontally.
For example the string: '100010001' with 2 rows and three columns would look like:
1 0 0
0 1 0
0 0 1
but when flipped should look like:
0 0 1
0 1 0
1 0 0
So the function would return the following output:
'001010100'
The caveat, I cannot use lists or arrays. only strings.
The current code I have written up, I believe, should work, however it is returning an empty string.
def flip_horizontal(image, rows, column):
horizontal_image = ''
for i in range(rows):
#This should slice the image string, and map image(the last element in the
#column : to the first element of the column) onto horizontal_image.
#this will repeat for the given amount of rows
horizontal_image = horizontal_image + image[(i+1)*column-1:i*column]
return horizontal_image
Again this returns an empty string. Any clue what the issue is?
Use [::-1] to reverse each row of the image.
def flip(im, w):
return ''.join(im[i:i+w][::-1] for i in range(0, len(im), w))
>>> im = '100010001'
>>> flip(im, 3)
'001010100'
The range function can be used to isolate your string into steps that represent rows. While iterating through the string you can use [::-1] to reverse each row to achieve the horizontal flip.
string = '100010001'
output = ''
prev = 0
# Iterate through string in steps of 3
for i in range(3, len(string) + 1, 3):
# Isolate and reverse row of string
row = string[prev:i]
row = row[::-1]
output = output + row
prev = i
Input:
'100
010
001'
Output:
'001
010
100'

Data Cleaning with Pandas

I have a dataframe column consisting of text data and I need to filter it according to the following conditions:
The character "M", if it's present in the string, it can only be at the n-2 position
The n-1 position of the string always has to be a "D".
ex:
KFLL
KSDS
KMDK
MDDL
In this case, for example, I would have to remove the first string, since the character at the n-1 position is not a "D", and the last one, since the character "M" appears out of the n-2 position.
How can I apply this to a whole dataframe column?
Here's with a list comprehension:
l = ['KFLL', 'KSDS', 'KMDK', 'MDDL']
[x for x in l if ((('M' not in x) or (x[-3] == 'M')) and (x[-2] == 'D'))]
Output:
['KSDS', 'KMDK']
This does what you want. Could probably be written down shorter with list comprehensions, but at least this is readable. It assumes that the strings are all longer than 3 characters, otherwise you get an IndexError. In that case you need to add a try/except
from collections import Counter
import pandas as pd
df = pd.DataFrame(data=list(["KFLL", "KSDS", "KMDK", "MDDL"]), columns=["code"])
print("original")
print(df)
mask = list()
for code in df["code"]:
flag = False
if code[-2] == "D":
counter = Counter(list(code))
if counter["M"] == 0 or (counter["M"] == 1 and code[-3] == "M"):
flag = True
mask.append(flag)
df["mask"] = mask
df2 = df[df["mask"]].copy()
df2.drop("mask", axis=1, inplace=True)
print("new")
print(df2)
Output looks like this
original
code
0 KFLL
1 KSDS
2 KMDK
3 MDDL
new
code
1 KSDS
2 KMDK
Thank you all for your help.
I ended up implementing it like this:
l = {"Sequence": [ 'KFLL', 'KSDS', 'KMDK', 'MDDL', "MMMD"]}
df = pd.DataFrame(data= l)
print(df)
df = df[df.Sequence.str[-2] == 'D']
df = df[~df.Sequence.apply(lambda x: ("M" in x and x[-3]!='M') or x.count("M") >1 )]
print(df)
Output:
Sequence
0 KFLL
1 KSDS
2 KMDK
3 MDDL
4 MMMD
Sequence
1 KSDS
2 KMDK

Return a list item from a null value using list comprehension

I'm using a list comprehension to get a list of numerical values that are separated by a semicolon, ;, within a string.
I need to get a 0 from missing values on either side of the semicolon.
Example string without missing values:
job_from_serial_st = 'xxxxx\rxxxxxx\rxxxx\rGAX=77.00;85.00\rxxxxx\r'
Using my list comprehension I would get the following list of values: [7700, 8500]
But how can I get a list of values from strings like 'GAX=77.00;\r' or 'GAX=;85.00\r'?
I'm expecting to get the following lists from the example strings with missing values:
[7700, 0] or [0, 8500]
def get_term(A, B, phrase):
n = A.len()
start = phrase.find(A) + n
end = phrase.find(B, start)
term = phrase[start:end]
return term
# GET GAX NUMBER
gax_nums = get_term(r'GAX=', r'\r\x1e\x1d', job_from_serial_st)
gax = [int(float(x) * 100) for x in gax_nums.split(';')]
print(gax)
You can add a condition to the list comprehension to return 0 if there is no number before or after the semicolon (also replaced your function with a series of splits to retrieve the numbers from the string).
s = 'xxxxx\rxxxxxx\rxxxx\rGAX=77.00;\rxxxxx\r'
nums = s.split('GAX=')[1].split('\r')[0].split(';')
gax = [int(float(n) * 100) if n else 0 for n in nums]
print(gax)
# [7700, 0]

Extracting the n-th elements from a list of named tuples in pandas Python?

I am trying to extract the n-th element from a list of named tuples stored in a df looking like this:
df['text'] = [Tag(word='Come', pos='adj', lemma='Come'), Tag(word='on', pos='nounpl', lemma='on'), Tag(word='Feyenoord', pos='adj', lemma='Feyenoord')]
I am trying to extract only elements that contain the pos information from each tuple. This is the outcome that I would like to achieve:
df['text'] = ['adj', 'nounpl', 'adj']
This is what I have tried this far:
d =[]
count = 0
while count < df['text'].size:
d.append([item[1] for item in df['text'][count]])
count += 1
dfpos = pd.DataFrame({'text':d})
df['text']= pd.DataFrame({'text':d})
df['text']=df['text'].apply(lambda x: ', '.join(x))
And this is the error: IndexError: tuple index out of range
What am I missing?
Solution: It seems that the easiest solution is to turn the tuples into a list. I am not sure if this is the best solution, but it works.
d =[]
count = 0
while count < df['text'].size:
temp=([list(item[1:-1]) for item in df['text'][count]])
d.append(sum(temp, []))
count += 1
df['text']= pd.DataFrame({'text':d})
df['text2']=df['text'].apply(lambda x: ', '.join(x))
Try indexing using apply if Tag is your named tuple i.e
Data Preparation :
from collections import namedtuple
Tag = namedtuple('Tag', 'word pos lemma')
li = [Tag(word='Come', pos='adj', lemma='Come'), Tag(word='on', pos='nounpl', lemma='on'), Tag(word='Feyenoord', pos='adj', lemma='Feyenoord')]
df = pd.DataFrame({'text':li})
For attribute based selection use . in apply since its a named tuple i.e
df['new'] = df['text'].apply(lambda x : x.pos)
If you need an index based selection then use
df['new'] = df['text'].apply(lambda x : x[1] if len(x)>1 else np.nan)
Output df['new']
0 adj
1 nounpl
2 adj
Name: text, dtype: object
Another solution is use str[1] for select value in namedtuple:
df['text1'] = df['text'].str[1]
print (df)
text text1
0 (Come, adj, Come) adj
1 (on, nounpl, on) nounpl
2 (Feyenoord, adj, Feyenoord) adj

Categories