Why can pandas DataFrames change each other? - python

I'm trying to keep of a copy of a pandas DataFrame, so that I can modify it while saving the original. But when I modify the copy, the original dataframe changes too. Ex:
df1=pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df1
col1 col2
a 1
b 2
c 3
d 4
df2=df1
df2['col2']=df2['col2']+1
df1
col1 col2
a 2
b 3
c 4
d 5
I set df2 equal to df1, then when I modified df2, df1 also changed. Why is this and is there any way to save a "backup" of a pandas DataFrame without it being modified?

This is much deeper than dataframes: you are thinking about Python variables the wrong way. Python variables are pointers, not buckets. That is to say, when you write
>>> y = [1, 2, 3]
You are not putting [1, 2, 3] into a bucket called y; rather you are creating a pointer named y which points to [1, 2, 3].
When you then write
>>> x = y
you are not putting the contents of y into a bucket called x; you are creating a pointer named x which points to the same thing that y points to. Thus:
>>> x[1] = 100
>>> print(y)
[1, 100, 3]
because x and y point to the same object, modifying it via one pointer modifies it for the other pointer as well. If you'd like to point to a copy instead, you need to explicitly create a copy. With lists you can do it like this:
>>> y = [1, 2, 3]
>>> x = y[:]
>>> x[1] = 100
>>> print(y)
[1, 2, 3]
With DataFrames, you can create a copy with the copy() method:
>>> df2 = df1.copy()

You need to make a copy:
df2 = df1.copy()
df2['col2'] = df2['col2'] + 1
print(df1)
Output:
col1 col2
0 a 1
1 b 2
2 c 3
3 d 4
You just create a second name for df1 with df2 = df1.

When you set a data frame equal to another it keeps the same location for its data in the computer's memory. This means if you change one value in the new data frame it will change that value in the old one. To fix this you should make a copy of it instead of just making it equal to the original. Example : df2 = df1.copy()

Related

Iterating over lists produces unexpected results

In the first example below, I am iterating over a list of dataframes. The For loop creates column 'c'. Printing each df shows that both elements in the list were updated.
In the second example, I am iterating over a list of variables. The For loop applys some math to each element. But when printing, the list does not reflect the changes made in the For loop.
Please help me to understand why the elements in the second example are not being impacted by the For loop, like they are in the first example.
import pandas as pd
df1 = pd.DataFrame([[1,2],[3,4]], columns=['a', 'b'])
df2 = pd.DataFrame([[3,4],[5,6]], columns=['a', 'b'])
dfs = [df1, df2]
for df in dfs:
df['c'] = df['a'] + df['b']
print(df1)
print(df2)
result:
a b c
0 1 2 3
1 3 4 7
a b c
0 3 4 7
1 5 6 11
Second example:
a, b = 2, 3
test = [a, b]
for x in test:
x = x * 2
print(test)
result: [2, 3]
expected result: [4, 6]
In your second example, test is a list of ints which are not mutable. If you want a similar effect to your first snippet, you will have to store something mutable in your list:
a, b = 2, 3
test = [[a], [b]]
for x in test:
x[0] = x[0] * 2
print(test)
Output: [[4], [6]]
When you iterate in a list like this x takes the value at the current position.
for x in test:
x = x * 2
When you try to assign a new value to x you are not changing the element in the list, you are changing what the variable x contains.
To change the actual value in the list iterate by index:
for i in range(len(test)):
test[i] = test[i] * 2

add a list into panda data frame cell [duplicate]

I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.
Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3
Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.
df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.
Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]
Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.
As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.
I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.
I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)

Replace list based on column condition [duplicate]

I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.
Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3
Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.
df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.
Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]
Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.
As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.
I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.
I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)

Min of Str Column in Pandas

I have a dataframe where one column contains a list of values, e.g.
dict = {'a' : [0, 1, 2], 'b' : [4, 5, 6]}
df = pd.DataFrame(dict)
df.loc[:, 'c'] = -1
df['c'] = df.apply(lambda x: [x.a, x.b], axis=1)
So I get:
a b c
0 0 4 [0, 4]
1 1 5 [1, 5]
2 2 6 [2, 6]
I now would like to save the minimum value of each entry of column c in a new column d, which should give me the following data frame:
a b c d
0 0 4 [0, 4] 0
1 1 5 [1, 5] 1
2 2 6 [2, 6] 2
Somehow though I always fail to do it with min() or similar. Right now I am using df.apply(lambda x: min(x['c'], axis=1). But that is too slow in my case. Do you know of a faster way of doing it?
Thanks!
You can get help from numpy:
import numpy as np
df['d'] = np.array(df['c'].tolist()).min(axis=1)
As stated in the comments, if you don't need the column c then:
df['d'] = df[['a','b']].min(axis=1)
Remember that series (like df['c']) are iterable. You can then create a new list and set it as a key, just like you would a dictionary. The list will automatically be cast to a pd.Series object. No need to use fancy pandas functions unless you are dealing with really (really) big data.
df['d'] = [min(c) for c in df['c']]
Edit: update to comments below
df['d'] = [min(c, key=lambda v: v - df.a) for c in df['c']]
This doesn't work because v is a value (in the first iteration it is passed 0, then 4, for example). df.a is a series. v - df.a is a new series with the elements [v - df.a[0], v - df.a[1], ...]. Then min tries to compare these series keys, which doesn't make any sense, because it will be testing if True, False, ...] or something like that which pandas throws an error for because it doens't really make sense. What you need is
df['d'] = [min(c, key=lambda v: v - df['a'][i]) for i, c in enumerate(df['c'])]
# I prefer to use df['a'] rather than df.a
so you take each value of df['a'] in turn from v, not the entire series df['a'].
However, taking a constant when calculating the minimum will do absolutely nothing, but I'm guessing this is simplified from what you are actually doing. The two samples above will do exactly the same thing.
This is a functional solution.
df['d'] = list(map(min, df['c']))
It works because:
df['c'] is a pd.Series, which is an iterable object.
map is a lazy operator which applies a function to each element of an iterable.
Since map is lazy, we must apply list in order to assign to a series.

Pandas - Sorting By Column

I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?
use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)
Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8
When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.
Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8

Categories