Delete specific string from array column python dataframe - python

I'm trying to remove string '$A' from column a array elements.
But below code doesn't seems to work.
In the below code I'm trying to replace $A string with empty string (it doesn't work though) also, instead I would like to just delete that string.
df = pd.DataFrame({'a': [['$A','1'], ['$A', '3','$A'],[]], 'b': ['4', '5', '6']})
df['a'] = df['a'].replace({'$A': ''}, regex=True)
print(df['a'])

replace doesn't check inside the list element, you'll have to use loops/apply in this case:
df['a'] = df.a.apply(lambda x: [s for s in x if s != '$A'])
df
# a b
#0 [1] 4
#1 [3] 5
#2 [] 6

Related

Pandas replace() string with int "Cannot set non-string value in StringArray"

I'm trying to replace strings with integers in a pandas dataframe. I've already visited here but the solution doesn't work.
Reprex:
import pandas as pd
pd.__version__
> '1.4.1'
test = pd.DataFrame(data = {'a': [None, 'Y', 'N', '']}, dtype = 'string')
test.replace(to_replace = 'Y', value = 1)
> ValueError: Cannot set non-string value '1' into a StringArray.
I know that I could do this individually for each column, either explicitly or using apply, but I am trying to avoid that. I'd ideally replace all 'Y' in the dataframe with int(1), all 'N' with int(0) and all '' with None or pd.NA, so the replace function appears to be the fastest/clearest way to do this.
Use Int8Dtype. IntXXDtype allow integer values and <NA>:
test['b'] = test['a'].replace({'Y': '1', 'N': '0', '': pd.NA}).astype(pd.Int8Dtype())
print(test)
# Output
a b
0 <NA> <NA>
1 Y 1
2 N 0
3 <NA>
>>> [type(x) for x in test['b']]
[pandas._libs.missing.NAType,
numpy.int8,
numpy.int8,
pandas._libs.missing.NAType]

Function to replace values in columns with Column Headers (Pandas)

I am trying to create a function that loops through specific columns in a dataframe and replaces the values with the column names. I have tried the below but it does not change the values in the columns.
def value_replacer(df):
cols = ['Account Name', 'Account Number', 'Maintenance Contract']
x= [i for i in df.columns if i not in cols]
for i in x:
for j in df[i]:
if isinstance(j,str):
j.replace(j,i)
return df
What should be added to the function to change the values?
Similar to #lazy's solution, but using difference to get the unlisted columns and using a mask instead of the list comprehension:
df = pd.DataFrame({'w': ['a', 'b', 'c'], 'x': ['d', 'e', 'f'], 'y': [1, 2, '3'], 'z': [4, 5, 6]})
def value_replacer(df):
cols_to_skip = ['w', 'z']
for col in df.columns.difference(cols_to_skip):
mask = df[col].map(lambda x: isinstance(x, str))
df.loc[mask, col] = col
return df
Output:
Loop through only the columns of interest once, and only evaluate each row within each column to see if it is a string or not, then use the resulting mask to bulk update all strings with the column name.
Note that this will change the dataframe inplace, so make a copy if you want the original, and you don't necessarily need the return statement.

Remove digits from a list of strings in pandas column

I have this pandas dataframe
0 Tokens
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'
All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column
I want to remove the digits, and because it with another words,
the digits cannot be removed.
I have tried this code:
def remove_digits(tokens):
"""
Remove digits from a string
"""
return [''.join([i for i in tokens if not i.isdigit()])]
df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()
but it only joined the strings, and I clearly do not want to do that.
My desired output:
0 Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'
This is possible using pandas methods, which are vectorised so more efficient that looping.
import pandas as pd
df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})
col = "Tokens"
df[col] = (
df[col]
.explode()
.str.replace("\d+", "", regex=True)
.groupby(level=0)
.agg(list)
)
# Tokens
# 0 [rice, XXX, g]
# 1 [beer, XXX, cc]
Here we use:
pandas.Series.explode to convert the Series of lists into rows
pandas.Series.str.replace to replace occurrences of \d (number 0-9) with "" (nothing)
pandas.Series.groupby to group the Series by index (level=0) and put them back into lists (.agg(list))
Here's a simple solution -
df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'],
['beer', 'XXX', '750cc']]})
def remove_digits_from_string(s):
return ''.join([x for x in s if not x.isdigit()])
def remove_digits(l):
return [remove_digits_from_string(s) for s in l]
df["Tokens"] = df.Tokens.apply(remove_digits)
You can use to_list + re.sub in order to update your original dataframe.
import re
for index, lst in enumerate(df['Tokens'].to_list()):
lst = [re.sub('\d+', '', i) for i in lst]
df.loc[index, 'Tokens'] = lst
print(df)
Output:
Tokens
0 [rice, XXX, g]
1 [beer, XXX, cc]

Find a column name and retaining certain string in that entire column values

I would like to format the "status" column in a csv and retain the string inside single quotation adjoining comma ('sometext',)
Example:
Input
as in row2&3 - if more than one values are found in any column values then it should be concatenated with a pipe symbol(|)Ex. Phone|Charger
Expected output should get pasted in same status column like below
My attempt (not working):
import pandas as pd
df = pd.read_csv("test projects.csv")
scol = df.columns.get_loc("Status")
statusRegex = re.
compile("'\t',"?"'\t',") mo = statusRegex.search (scol.column)
Let say you have df as :
df = pd.DataFrame([[[{'a':'1', 'b': '4'}]], [[{'a':'1', 'b': '2'}, {'a':'3', 'b': '5'}]]], columns=['pr'])
df:
pr
0 [{'a': '1', 'b': '4'}]
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}]
df['comb'] = df.pr.apply(lambda x: '|'.join([i['a'] for i in x]))
df:
pr comb
0 [{'a': '1', 'b': '4'}] 1
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}] 1|3
import pandas as pd
# simplified mock data
df = pd.DataFrame(dict(
value=[23432] * 3,
Status=[
[{'product.type': 'Laptop'}],
[{'product.type': 'Laptop'}, {'product.type': 'Charger'}],
[{'product.type': 'TV'}, {'product.type': 'Remote'}]
]
))
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# save to desired column
df['output'] = df['Status'].apply(da_piper) # apply the method to the Status col
Additional help: You do not need to use read_excel since csv is not an excel format. It is comma separated values which is a standard format. in this case you can just do this:
import pandas as pd
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# read csv to dataframe
df = pd.read_csv("test projects.csv")
# apply method and save to desired column
df['Status'] = df['Status'].apply(da_piper) # apply the method to the Status col
Thank you all for the help and suggestions. Please find the final working codes.
df = pd.read_csv('test projects.csv')
rows = len(df['input'])
def get_values(value):
m = re.findall("'(.+?)'",value)
word = ""
for mm in m:
if 'value' not in str(mm):
if 'autolabel_strategy' not in str(mm):
if 'String Matching' not in str(mm):
word += mm + "|"
return str(word).rsplit('|',1)[0]
al_lst =[]
ans_lst = []
for r in range(rows):
auto_label = df['autolabeledValues'][r]
answers = df['answers'][r]
al = get_values(auto_label)
ans = get_values(answers)
al_lst.append(al)
ans_lst.append(ans)
df['a'] = al_lst
df['b'] = ans_lst
df.to_csv("Output.csv",index=False)

Checking if a data series is strings

I want to check if a column in a dataframe contains strings. I would have thought this could be done just by checking dtype, but that isn't the case. A pandas series that contains strings just has dtype 'object', which is also used for other data structures (like lists):
df = pd.DataFrame({'a': [1,2,3], 'b': ['Hello', '1', '2'], 'c': [[1],[2],[3]]})
df = pd.DataFrame({'a': [1,2,3], 'b': ['Hello', '1', '2'], 'c': [[1],[2],[3]]})
print(df['a'].dtype)
print(df['b'].dtype)
print(df['c'].dtype)
Produces:
int64
object
object
Is there some way of checking if a column contains only strings?
You can use this to see if all elements in a column are strings
df.applymap(type).eq(str).all()
a False
b True
c False
dtype: bool
To just check if any are strings
df.applymap(type).eq(str).any()
You could map the data with a function that converts all the elements to True or False if they are equal to str-type or not, then just check if the list contains any False elements
The example below tests a list containing element other then str. It will tell you True if data of other type is present
test = [1, 2, '3']
False in map((lambda x: type(x) == str), test)
Output: True

Categories