What to do when pandas column renaming creates column name duplicates - python

Why doesn't a pandas.DataFrame object complain when I rename a column if the new column name already exists?
This makes referencing the new column in the future return a pandas.DataFrame as opposed to a pandas.Series , which can cause further errors.
Secondly, is there a suggested way to handle such a situation?
Example:
import pandas
df = pd.DataFrame( {'A' : ['foo','bar'] ,'B' : ['bar','foo'] } )
df.B.map( {'bar':'foo','foo':'bar'} )
# 0 foo
# 1 bar
# Name: B, dtype: object
df.rename(columns={'A':'B'},inplace=True)
Now, the following will fail:
df.B.map( {'bar':'foo','foo':'bar'} )
#AttributeError: 'DataFrame' object has no attribute 'map'

Let's say you had a dictionary mapping old columns to new column names. When renaming your DataFrame, you could use a dictionary comprehension to test if the new value v is already in the DataFrame:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
d = {'a': 'B', 'b': 'B'}
df.rename(columns={k: v for k, v in d.iteritems() if v not in df}, inplace=True)
>>> df
a B
0 1 3
1 2 4
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
d = {'a': 'b'}
df.rename(columns={k: v for k, v in d.iteritems() if v not in df}, inplace=True)
>>> df
a b
0 1 3
1 2 4

Related

Create a dictionary from a list

I'm trying to create a dictionary in Python from this output:
["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
I tried with this code:
list_columns = list(df2.columns)
list_dictionary = []
for row in list_columns:
resultado = "'"+str(row)+"'" + "=" + "df2[" + "'" + row + "'" + "]"
list_dictionary.append(resultado)
clean_list_dictionary = ','.join(list_dictionary).replace('"','')
dictionary = dict(clean_list_dictionary)
print(dictionary)
But I get an error:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Do you have any idea how I can make this work?
Thank you in advance!
Output dictionary should look like this:
{
'a' : df2['a'],
'b' : df2['b'],
'c' : df2['c'],
'd' : df2['d']
}
Method 1: Transforming your list of string for an eval later
As you have mentioned in your comment -
I would like to create a dictionary for with this format: ''' {'a' : df2['a'], 'b' : df2['b'], 'c' : df2['c'], 'd' : df2['d']} ''' I will use it as global variables in an eval() function.
You can use the following to convert your input string
#dummy dataframe
df2 = pd.DataFrame([[1,2,3,4]], columns=['a','b','c','d']) #Dummy dataframe
#your list of strings
l = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
#Solution
def dict_string(l):
s0 = [i.split('=') for i in l]
s1 = '{' + ', '.join([': '.join([k,v]) for k,v in s0]) + '}'
return s1
output = dict_string(l)
print(output)
eval(output)
#String before eval
{'a': df2['a'], 'b': df2['b'], 'c': df2['c'], 'd': df2['d']} #<----
#String after eval
{'a': 0 1
Name: a, dtype: int64,
'b': 0 2
Name: b, dtype: int64,
'c': 0 3
Name: c, dtype: int64,
'd': 0 4
Name: d, dtype: int64}
Method 2: Using eval as part of your iteration of the list of strings
Here is a way to do this using list comprehensions and eval, as part of the iteration on the list of strings itself. This will give you the final output that you would get if you were to use eval on the dictionary string you are expecting.
#dummy dataframe
df2 = pd.DataFrame([[1,2,3,4]], columns=['a','b','c','d']) #Dummy dataframe
#your list of strings
l = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
#Solution
def eval_dict(l):
s0 = [(eval(j) for j in i.split('=')) for i in l]
s1 = {k:v for k,v in s0}
return s1
output = eval_dict(l)
print(output)
{'a': 0 1
Name: a, dtype: int64,
'b': 0 2
Name: b, dtype: int64,
'c': 0 3
Name: c, dtype: int64,
'd': 0 4
Name: d, dtype: int64}
The output is a dict that has 4 keys, (a,b,c,d) and 4 corresponding values for columns a, b, c, d from df2 respectively.
You can loop over the list,split by charater and convert to dict.
Code:
dic= {}
[dic.update(dict( [l.split('=')])) for l in ls]
dic
I think this is exactly what you want.
data = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
dic = {}
for d in data:
k = d.split("=")[0]
v = df2[d.split("=")[1].split("\'")[1]]
dic.update({k: v})
print(dic)
Its not clear what exactly you want to achieve.
If You have a pd.DataFrame() and you want to convert it to a dictionary where column names are keys and column values are dict values you should use df.to_dict('series').
import pandas as pd
# Generate the dataframe
data = {'a': [1, 2, 1, 0], 'b': [2, 3, 4, 5], 'c': [10, 11, 12, 13], 'd': [21, 22, 23, 24]}
df = pd.DataFrame.from_dict(data)
# Convert to dictionary
result = df.to_dict('series')
print(result)
If you have a list of strings that you need to convert to desired output than you should do it differently. What you have are strings 'df' while df in your dict is a variable. So you only need to extract the column names and use the variable df not the string 'df'
import pandas as pd
# Generate the dataframe
data = {'a': [1, 2, 1, 0], 'b': [2, 3, 4, 5], 'c': [10, 11, 12, 13], 'd': [21, 22, 23, 24]}
df = pd.DataFrame.from_dict(data)
# create string list
lst = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
# Convert to dictionary
result = {}
for item in lst:
key = item[1]
result[key] = df[key]
print(result)
The results are the same but in second case list of strings is created for no reason because first example can achieve the same results without it..

What to pass to aggfunc in pandas pivot table for summing up counters

I have a data table with two columns "A" and "B", and the elements in column "B" are counters. For example,
c = Counter(a=4, b=2)
df = pd.DataFrame({"A": ["group1", "group1", "group1", "group2", "group2"],
"B": [c, c, c, c, c]})
I would like to create a pivot table, where I group over element values in column "A" and aggregate over column "B" by adding up the counters. I wonder what should I pass to aggfunc?
This is what I have tried, but sadly it does not work:
pt = pd.pivot_table(df, index = ['A'], values = ['B'], aggfunc = ['+'])
Any suggestions?
My expected output is
table
group1 Counter(a=12, b=6) # i.e., c+c+c
group2 Counter(a=8, b=4) # i.e., c+c
It's just sum.
>>> df.groupby('A')['B'].sum()
A
group1 {'a': 12, 'b': 6}
group2 {'a': 8, 'b': 4}
Name: B, dtype: object
Two notes:
Putting dictionaries into dataframe columns is usually not a good practice. I would use two columns to hold the value for 'a' and 'b' respectively.
"B": [c, c, c, c, c] initializes each element of column 'B' with the same counter object.
Demo:
>>> df.loc[0, 'B']['a'] = 100
>>> df
Out[9]:
A B
0 group1 {'a': 100, 'b': 2}
1 group1 {'a': 100, 'b': 2}
2 group1 {'a': 100, 'b': 2}
3 group2 {'a': 100, 'b': 2}
4 group2 {'a': 100, 'b': 2}
You might want "B": [c.copy() for _ in range(5)] - if you want to keep your original design at all, that is.

If item within list in pandas column is a dictionary key, replace with value, if not in dictionary, delete

If a pandas column contains a list, you can use a dictionary to convert all the values using
df['listColumn'] = df['listColumn'].apply(lambda x: [columnDictionary[i] for i in x])
However, there are instances where not all the items in a list are keys to the dictionary. In that case, how do you replace those items with nothing.
For example
columnDictionary = {a:1, b:2, d:7, f:8 }
Specific Pandas row/column: [ a, b, c, d, e]
Specific Pandas row/column after conversion: [ 1, 2, 7]
With simple condition to check if a list value is in target dict keys list:
In [47]: df = pd.DataFrame({'listColumn': ['a', 123, list('abcde')]})
In [48]: repl_dict = {'a':1, 'b':2, 'd':7, 'f':8 }
In [49]: df['listColumn'].apply(lambda x: [repl_dict[v] for v in x if v in repl_dict] if isinstance(x, list) else x)
Out[49]:
0 a
1 123
2 [1, 2, 7]
Name: listColumn, dtype: object
Use "if else" inside the lamdba function :
Method 1: apply lambda on columns, below on one column only ( axis = 0 )
# apply lambda on 1 column (axis = 0)
d = {'col1':[ 'a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data=d)
columnDictionary ={'a':1, 'b':2, 'd':7, 'f':8 }
df['col1'] = df['col1'].apply(lambda x: [columnDictionary[x] if x in columnDictionary else ''])
df
Method 2: apply lambda on rows (axis = 1), row by row (I think it is slower)
d = {'col1':[ 'a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data=d)
columnDictionary ={'a':1, 'b':2, 'd':7, 'f':8 }
df['listColumn'] = df.apply(lambda x: [columnDictionary[i] if i in columnDictionary else '' for i in x],axis=1)
df
Result :
col1 listColumn
0 a [1]
1 b [2]
2 c []
3 d [7]
4 e []
There is a build-in function to check if something is list, it called isinstance(mydata, list) whitch will return True or False respectivelly.

Select columns of pandas dataframe using a dictionary list value

I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]

Prevent Pandas from unpacking a tuple when creating a dataframe from dict

When creating a DataFrame in Pandas from a dictionary, a tuple is automatically expanded, i.e.
import pandas
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pandas.DataFrame.from_dict(d)
print(df)
returns
a b c
0 1 2 3
1 1 2 4
Apart from converting the tuple to string first, is there any way to prevent this from happening? I would want the result to be
a b c
0 1 2 (3, 4)
Try add [], so value in dictionary with key c is list of tuple:
import pandas
d = {'a': 1, 'b': 2, 'c': [(3,4)]}
df = pandas.DataFrame.from_dict(d)
print(df)
a b c
0 1 2 (3, 4)
Pass param orient='index' and transpose the result so it doesn't broadcast the scalar values:
In [13]:
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pd.DataFrame.from_dict(d, orient='index').T
df
Out[13]:
a c b
0 1 (3, 4) 2
To handle the situation where the first dict entry is a tuple, you'd need to enclose all the dict values into a list so it's iterable:
In [20]:
d = {'a': (5,6), 'b': 2, 'c': 1}
d1 = dict(zip(d.keys(), [[x] for x in d.values()]))
pd.DataFrame.from_dict(d1, orient='index').T
Out[23]:
a b c
0 (5, 6) 2 1

Categories