Pandas replace() string with int "Cannot set non-string value in StringArray"

Pandas replace() string with int "Cannot set non-string value in StringArray" - python

I'm trying to replace strings with integers in a pandas dataframe. I've already visited here but the solution doesn't work.
Reprex:
import pandas as pd
pd.__version__
> '1.4.1'
test = pd.DataFrame(data = {'a': [None, 'Y', 'N', '']}, dtype = 'string')
test.replace(to_replace = 'Y', value = 1)
> ValueError: Cannot set non-string value '1' into a StringArray.
I know that I could do this individually for each column, either explicitly or using apply, but I am trying to avoid that. I'd ideally replace all 'Y' in the dataframe with int(1), all 'N' with int(0) and all '' with None or pd.NA, so the replace function appears to be the fastest/clearest way to do this.

Use Int8Dtype. IntXXDtype allow integer values and <NA>:
test['b'] = test['a'].replace({'Y': '1', 'N': '0', '': pd.NA}).astype(pd.Int8Dtype())
print(test)
# Output
a b
0 <NA> <NA>
1 Y 1
2 N 0
3 <NA>
>>> [type(x) for x in test['b']]
[pandas._libs.missing.NAType,
numpy.int8,
numpy.int8,
pandas._libs.missing.NAType]

Related

Turn a dictionary of dictionaries into a dataframe

So let's say I have a dictionary that looks like this:
{Row1 : {Col1: Data, Col2: Data ..} , Row2: {Col1: Data,...},...}
Ex dictionary:
{'1': {'0': '1', '1': '2', '2': '3'}, '2': {'0': '4', '1': '5', '2': '6'}}
I was checking out pandas from_dict method with orient='index', but it's not quite what I need.
This is what I have that works:
df_pasted_data = pd.DataFrame()
for v in dictionary.values():
# need to set the index to 0 since im passing in a basic dictionary that looks like: col1: data, col2: data
# otherwise it will throw an error saying ValueError: If using all scalar values, you must pass an index
temp = pd.DataFrame(v, index=[0])
# append doesn't happen in place so i need to set it to itself
df_pasted_data = df_pasted_data.append(temp, ignore_index=True)
This works, but I've read online that doing appends and stuff is not very efficient, is there a better way of going about this?

make use of DataFrame() method and T(Transpose) attribute:
import pandas as pd
df_pasted_data=pd.DataFrame(dictionary).T
#output
print(df_pasted_data)
0 1 2
1 1 2 3
2 4 5 6

Find a column name and retaining certain string in that entire column values

I would like to format the "status" column in a csv and retain the string inside single quotation adjoining comma ('sometext',)
Example:
Input
as in row2&3 - if more than one values are found in any column values then it should be concatenated with a pipe symbol(|)Ex. Phone|Charger
Expected output should get pasted in same status column like below
My attempt (not working):
import pandas as pd
df = pd.read_csv("test projects.csv")
scol = df.columns.get_loc("Status")
statusRegex = re.
compile("'\t',"?"'\t',") mo = statusRegex.search (scol.column)

Let say you have df as :
df = pd.DataFrame([[[{'a':'1', 'b': '4'}]], [[{'a':'1', 'b': '2'}, {'a':'3', 'b': '5'}]]], columns=['pr'])
df:
pr
0 [{'a': '1', 'b': '4'}]
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}]
df['comb'] = df.pr.apply(lambda x: '|'.join([i['a'] for i in x]))
df:
pr comb
0 [{'a': '1', 'b': '4'}] 1
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}] 1|3

import pandas as pd
# simplified mock data
df = pd.DataFrame(dict(
value=[23432] * 3,
Status=[
[{'product.type': 'Laptop'}],
[{'product.type': 'Laptop'}, {'product.type': 'Charger'}],
[{'product.type': 'TV'}, {'product.type': 'Remote'}]
]
))
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# save to desired column
df['output'] = df['Status'].apply(da_piper) # apply the method to the Status col
Additional help: You do not need to use read_excel since csv is not an excel format. It is comma separated values which is a standard format. in this case you can just do this:
import pandas as pd
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# read csv to dataframe
df = pd.read_csv("test projects.csv")
# apply method and save to desired column
df['Status'] = df['Status'].apply(da_piper) # apply the method to the Status col

Thank you all for the help and suggestions. Please find the final working codes.
df = pd.read_csv('test projects.csv')
rows = len(df['input'])
def get_values(value):
m = re.findall("'(.+?)'",value)
word = ""
for mm in m:
if 'value' not in str(mm):
if 'autolabel_strategy' not in str(mm):
if 'String Matching' not in str(mm):
word += mm + "|"
return str(word).rsplit('|',1)[0]
al_lst =[]
ans_lst = []
for r in range(rows):
auto_label = df['autolabeledValues'][r]
answers = df['answers'][r]
al = get_values(auto_label)
ans = get_values(answers)
al_lst.append(al)
ans_lst.append(ans)
df['a'] = al_lst
df['b'] = ans_lst
df.to_csv("Output.csv",index=False)

Delete specific string from array column python dataframe

I'm trying to remove string '$A' from column a array elements.
But below code doesn't seems to work.
In the below code I'm trying to replace $A string with empty string (it doesn't work though) also, instead I would like to just delete that string.
df = pd.DataFrame({'a': [['$A','1'], ['$A', '3','$A'],[]], 'b': ['4', '5', '6']})
df['a'] = df['a'].replace({'$A': ''}, regex=True)
print(df['a'])

replace doesn't check inside the list element, you'll have to use loops/apply in this case:
df['a'] = df.a.apply(lambda x: [s for s in x if s != '$A'])
df
# a b
#0 [1] 4
#1 [3] 5
#2 [] 6

Checking if a data series is strings

I want to check if a column in a dataframe contains strings. I would have thought this could be done just by checking dtype, but that isn't the case. A pandas series that contains strings just has dtype 'object', which is also used for other data structures (like lists):
df = pd.DataFrame({'a': [1,2,3], 'b': ['Hello', '1', '2'], 'c': [[1],[2],[3]]})
df = pd.DataFrame({'a': [1,2,3], 'b': ['Hello', '1', '2'], 'c': [[1],[2],[3]]})
print(df['a'].dtype)
print(df['b'].dtype)
print(df['c'].dtype)
Produces:
int64
object
object
Is there some way of checking if a column contains only strings?

You can use this to see if all elements in a column are strings
df.applymap(type).eq(str).all()
a False
b True
c False
dtype: bool
To just check if any are strings
df.applymap(type).eq(str).any()

You could map the data with a function that converts all the elements to True or False if they are equal to str-type or not, then just check if the list contains any False elements
The example below tests a list containing element other then str. It will tell you True if data of other type is present
test = [1, 2, '3']
False in map((lambda x: type(x) == str), test)
Output: True

convert lists of uniform dicts into pandas Dataframe with nested dicts as multi-index

At a bit of loss despite much searching & experimentation...
Given this:
dictA = {'order': '1',
'char': {'glyph': 'A',
'case': 'upper',
'vowel': True}
}
dictB = {'order': '2',
'char': {'glyph': 'B',
'case': 'upper',
'vowel': False}
}
dictC = {'order': '3',
'char': {'glyph': 'C',
'case': 'upper',
'vowel': False}
}
dictD = {'order': '4',
'char': {'glyph': 'd',
'case': 'lower',
'vowel': False}
}
dictE = {'order': '5',
'char': {'glyph': 'e',
'case': 'lower',
'vowel': True}
}
letters = [dictA, dictB, dictC, dictD, dictE]
how to turn letters into into this: (first column is index)
order char
glyph case vowel
0 1 A upper True
1 2 B upper False
2 3 C upper False
3 4 d lower False
4 5 e lower True
... and as a plus, then be able operate on this frame to tally/plot number of entries that are uppercase, number of entries that are vowels, etc.
Any ideas?
EDIT: My initial example was maybe too simple, but I'll leave it for posterity.
Given:
import re
class Glyph(dict):
def __init__(self, glyph):
super(Glyph, self).__init__()
order = ord(glyph)
self['glyph'] = glyph
self['order'] = order
kind = {'type': None}
if re.search('\s+', glyph):
kind = {'type': 'whitespace'}
elif order in (range(ord('a'), ord('z')) +
range(ord('A'), ord('Z'))
):
lowercase = glyph.lower()
kind = {
'type': lowercase,
'vowel': lowercase in ['a', 'e', 'i', 'o', 'u'],
'case': ['upper', 'lower'][lowercase == glyph],
'number': (ord(lowercase) - ord('a') + 1)
}
self['kind'] = kind
chars = [Glyph(x) for x in 'Hello World']
I can do this:
import pandas as pd
df = pd.DataFrame(chars) # dataframe where 'order' & 'glyph' are OK...
# unpack 'kind' Series into list of dicts and use those to make a table
kindDf = pd.DataFrame(data=[x for x in df['kind']])
My intuition would lead me to think I could then do this:
df['kind'] = kindDf
...But that only adds the first column of my kindDF and puts it under 'kind' in df. Next attempt:
df.pop('kind') # get rid of this column of dicts
joined = df.join(kindDf) # flattens 'kind'...
joined is so close! The trouble is I want those columns from kind to be under a 'kind' hierarchy, rather than flat (as the joined result is). I've tried stack/unstack magic, but I can't grasp it. Do I need a MultiIndex?

This gets you close on the first part:
## a list for storing properly formated dataframes
container=[]
for l in letters:
## loop through list of dicts, turn each into a dataframe
## then add `order` to the index. Then make the dataframe wide using unstack
temp = pd.DataFrame(data=l).set_index('order',append=True).unstack(level=[0])
container.append(temp)
## throw all the dataframes together into one
result = pd.concat(container).reset_index()
result
order char
case glyph vowel
0 1 upper A True
1 2 upper B False
2 3 upper C False
3 4 lower d False
4 5 lower e True
For the second part, you can just rely on groupby and then the built in plotting functions for quick visuals. Omit the plot call after size() if you just want to see the tally.
result.groupby(result.char.vowel).size().plot(kind='bar',
figsize=[8,6])
title('Glyphs are awesome')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas replace() string with int "Cannot set non-string value in StringArray" - python

Related

Turn a dictionary of dictionaries into a dataframe

Find a column name and retaining certain string in that entire column values

Delete specific string from array column python dataframe

Checking if a data series is strings

convert lists of uniform dicts into pandas Dataframe with nested dicts as multi-index

Categories

Resources