How to create a copy of an existing DataFrame(panda)? - python

I have just started exploring pandas. I tried applying logarithmic scaling to a Dataframe column without affecting the source Dataframe. I passed the existing DataFrame(data_source) to the DataFrame constructor thinking that it would create a copy.
data_source = pd.read_csv("abc.csv")
log_data = pd.DataFrame(data = data_source).apply(lambda x: np.log(x + 1))
I think it works properly but is it a recommended/correct way of applying scaling on a copied DataFrame ? How is it different from the 'DataFrame.copy' function?

pd.DataFrame(data = data_source) does not make a copy. This is documented in the docs for the copy argument to the constructor:
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
This is also easily observed by trying to mutate the result:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> y = pandas.DataFrame(x)
>>> x
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y.iloc[0, 0] = 2
>>> x
x y
0 2 1.0
1 2 2.0
2 3 3.0
If you want a copy, call the copy method. You don't need a copy, though. apply already returns a new dataframe, and better yet, you can call numpy.log or numpy.log1p on dataframes directly:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> numpy.log1p(x)
x y
0 0.693147 0.693147
1 1.098612 1.098612
2 1.386294 1.386294

DataFrame.apply, .applymap and np.log do not change the original data, so it is not necessary to copy()
also, np.log accepts arrays, so in this particular case it would be better to write:
log_data = pd.DataFrame(np.log(data_source.values + 1),
columns=data_source.columns,
index=data_source.index)

Related

nested list comprehension to populate dataframe

Objective: Compute some bivariate polynomial e.g. f(x,y) = sin(x^2 + y^2) for x ∈ [-1,1] and y ∈ [-1,1] and stick values in a dataframe.
What I have...
def sunbrero(x,y):
return np.sin(x**2 + y**2)
lower=-1
upper=1
length=1000
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
Z = pd.DataFrame(index=X,columns=Y)
# [[sunbrero(x,y) for x in X] for y in Y]
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
What I'm hoping to do is something that replaces...
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
...with something like...
[[Z[y] = sunbrero(x,y) for x in X] for y in Y]
But obviously the above doesn't work.
I know that this works...
Z = [[sunbrero(x,y) for x in X] for y in Y]
...but it creates a list of lists rather than a dataframe.
Note 1: if others think a 2D vector is more sensible c.f dataframe, I'm open to that as well.
Note 2: I don't think lambda functions work as it only allows one variable to be defined. Happy to be corrected.
I think the more Panda-esque way of doing this would be to calculate the values first and put them into a dataframe afterwards, not vice versa. Performing the calculations in a list comprehension does not put the internal vector optimizations of Numpy and Pandas to good use.
Instead, you can make use of Numpy's broadcasting to get the matrix first:
length = 5
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
result = sunbrero(X[:, None], Y)
array([[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.84147098, 0.24740396, 0. , 0.24740396, 0.84147098],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743]])
and put that in a dataframe like so:
df = pd.DataFrame(result, index=X, columns=Y)
-1.0 -0.5 0.0 0.5 1.0
-1.0 0.909297 0.948985 0.841471 0.948985 0.909297
-0.5 0.948985 0.479426 0.247404 0.479426 0.948985
0.0 0.841471 0.247404 0.000000 0.247404 0.841471
0.5 0.948985 0.479426 0.247404 0.479426 0.948985
1.0 0.909297 0.948985 0.841471 0.948985 0.909297
You're almost there:
df = pd.DataFrame([[sunbrero(x,y) for x in X] for y in Y])
You can do your list comprehension, then have pandas create a dataframe from a list of lists, for example:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(list_of_lists)
to get
0 1 2
0 1 2 3
1 4 5 6

Filling an empty dataframe by assignment via loc selection with tuple indices

Why does this work?
a=pd.DataFrame()
a.loc[1,2]=0
>
2
1 0.0
And, this does not work?
a=pd.DataFrame()
a.loc[(1,2),2]=0
>
KeyError: '[1 2] not in index'
The latter is what I would like to do. I will be filling the values by assignment via loc selection with tuple specified index, from a dataframe with no values, 0 rows, 0 columns.
Using a tuple as index will work if your dataframe already has a multi-index:
import pandas as pd
# Define multi-index
index = pd.MultiIndex.from_product([[],[]], names=['first', 'second'])
# or
# index = pd.MultiIndex.from_tuples([], names=['first', 'second'])
a = pd.DataFrame(index=index)
a.loc[(1,2), 2]=0
# 2
# first second
# 1.0 2.0 0.0
I like Julien's Answer as it feels less like magic. All of these are efforts to set a 2 level multiindex.
set_index with empty arrays
i = np.array([])
a = pd.DataFrame().set_index([i, i])
a.loc[(1, 2), 2] = 0
a
2
1.0 2.0 0.0
Slightly more concise
a = pd.DataFrame().set_index([np.array([])] * 2)
a.loc[(1, 2), 2] = 0
pd.concat
a = pd.concat([pd.DataFrame()] * 2, keys=[1, 2])
a.loc[(1, 2), 2] = 0
a
2
1 2 0.0

replace empty list with NaN in pandas dataframe

I'm trying to replace some empty list in my data with a NaN values. But how to represent an empty list in the expression?
import numpy as np
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
d.loc[d['x'] == [],['x']] = d.loc[d['x'] == [],'x'].apply(lambda x: np.nan)
d
ValueError: Arrays were different lengths: 4 vs 0
And, I want to select [text] by using d[d['x'] == ["text"]] with a ValueError: Arrays were different lengths: 4 vs 1 error, but select 3 by using d[d['y'] == 3] is correct. Why?
If you wish to replace empty lists in the column x with numpy nan's, you can do the following:
d.x = d.x.apply(lambda y: np.nan if len(y)==0 else y)
If you want to subset the dataframe on rows equal to ['text'], try the following:
d[[y==['text'] for y in d.x]]
I hope this helps.
You can use function "apply" to match the specified cell value no matter it is the instance of string, list and so on.
For example, in your case:
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
if you use d == 3 to select the cell whose value is 3, it's totally ok:
x y
0 False False
1 False False
2 False True
3 False False
However, if you use the equal sign to match a list, there may be out of your exception, like d == [text] or d == ['text'] or d == '[text]', such as the following:
There's some solutions:
Use function apply() on the specified Series in your Dataframe just like the answer on the top:
A more general method with the function applymap() on a Dataframe may be used for the preprocessing step:
d.applymap(lambda x: x == [])
x y
0 False False
1 False False
2 False False
3 True False
Wish it can help you and the following learners and it would be better if you add a type check in you applymap function which would otherwise cause some exceptions probably.
To answer your main question, just leave out the empty lists altogether. The NaN's will automatically get populated in if there's a value in one column and not the other if you use pandas.concat instead of building a dataframe from a dictionary.
>>> import pandas as pd
>>> ser1 = pd.Series([[1,2,3], [1,2], ["text"]], name='x')
>>> ser2 = pd.Series([1,2,3,4], name='y')
>>> result = pd.concat([ser1, ser2], axis=1)
>>> result
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 NaN 4
About your second question, it seems that you can't search inside of an element. Perhaps you should make that a separate question since it's not really related to your main question.

Why can pandas DataFrames change each other?

I'm trying to keep of a copy of a pandas DataFrame, so that I can modify it while saving the original. But when I modify the copy, the original dataframe changes too. Ex:
df1=pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df1
col1 col2
a 1
b 2
c 3
d 4
df2=df1
df2['col2']=df2['col2']+1
df1
col1 col2
a 2
b 3
c 4
d 5
I set df2 equal to df1, then when I modified df2, df1 also changed. Why is this and is there any way to save a "backup" of a pandas DataFrame without it being modified?
This is much deeper than dataframes: you are thinking about Python variables the wrong way. Python variables are pointers, not buckets. That is to say, when you write
>>> y = [1, 2, 3]
You are not putting [1, 2, 3] into a bucket called y; rather you are creating a pointer named y which points to [1, 2, 3].
When you then write
>>> x = y
you are not putting the contents of y into a bucket called x; you are creating a pointer named x which points to the same thing that y points to. Thus:
>>> x[1] = 100
>>> print(y)
[1, 100, 3]
because x and y point to the same object, modifying it via one pointer modifies it for the other pointer as well. If you'd like to point to a copy instead, you need to explicitly create a copy. With lists you can do it like this:
>>> y = [1, 2, 3]
>>> x = y[:]
>>> x[1] = 100
>>> print(y)
[1, 2, 3]
With DataFrames, you can create a copy with the copy() method:
>>> df2 = df1.copy()
You need to make a copy:
df2 = df1.copy()
df2['col2'] = df2['col2'] + 1
print(df1)
Output:
col1 col2
0 a 1
1 b 2
2 c 3
3 d 4
You just create a second name for df1 with df2 = df1.
When you set a data frame equal to another it keeps the same location for its data in the computer's memory. This means if you change one value in the new data frame it will change that value in the old one. To fix this you should make a copy of it instead of just making it equal to the original. Example : df2 = df1.copy()

A GroupBy with combinations of the categorical variables

Let's say I have data:
pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
which gives:
column
index
a 1
b 2
c 3
a 4
b 1
c 2
Then to get the mean of each subgroup one would:
df.groupby(df.index).mean()
column
index
a 2.5
b 1.5
c 2.5
However, what I've been trying to achieve without constantly looping and slicing the data, is how do I get the mean for pairs of subgroups?
For instance, the mean of a & b is 2? As if their values were combined.
The output would be something akin to:
column
index
a & a 2.5
a & b 2.0
a & c 2.5
b & b 1.5
b & c 2.0
c & c 2.5
Preferably this would involve manipulating the parameters in 'groupby' but as it is, I'm having to resort to looping and slicing. With the ability to build all combinations of subgroups at some point.
I've revisited this 3 years later with a general solution to this problem.
It's being used in this open source library, which is why I'm now able to do this here and it works with any number of indexes and creates combinations on them using numpy matrix broadcasting
So first of all, that is not a valid dataframe. The indexes aren't unique. Let's add another index to that object and make it a Series:
df = pd.DataFrame({
'unique': [1, 2, 3, 4, 5, 6],
'index': ['a','b','c','a','b','c'],
'column': [1,2,3,4,1,2]
}).set_index(['unique','index'])
s = df['column']
Let's unstack that index:
>>> idxs = ['index'] # set as variable to be used later on
>>> unstacked = s.unstack(idxs)
column
index a b c
unique
1 1.0 NaN NaN
2 NaN 2.0 NaN
3 NaN NaN 3.0
4 4.0 NaN NaN
5 NaN 1.0 NaN
6 NaN NaN 2.0
>>> vals = unstacked.values
array([[ 1., nan, nan],
[ nan, 2., nan],
[ nan, nan, 3.],
[ 4., nan, nan],
[ nan, 1., nan],
[ nan, nan, 2.]])
>>> sum = np.nansum(vals, axis=0)
>>> count = (~np.isnan(vals)).sum(axis=0)
>>> mean = (sum + sum[:, np.newaxis]) / (count + count[:, np.newaxis])
array([[ 2.5, 2. , 2.5],
[ 2. , 1.5, 2. ],
[ 2.5, 2. , 2.5]])
Now recreate the output dataframe:
>>> new_df = pd.DataFrame(mean, unstacked.columns, unstacked.columns.copy())
index_ a b c
index
a 2.5 2.0 2.5
b 2.0 1.5 2.0
c 2.5 2.0 2.5
>>> idxs_ = [ x+'_' for x in idxs ]
>>> new_df.columns.names = idxs_
>>> new_df.stack(idxs_, dropna=False)
index index_
a a 2.5
b 2.0
c 2.5
b a 2.0
b 1.5
c 2.0
c a 2.5
b 2.0
c 2.5
My current implementation is:
import pandas as pd
import itertools
import numpy as np
# get all pair of categories here
def all_pairs(df, ix):
hash = {
ix: [],
'p': []
}
for subset in itertools.combinations(np.unique(np.array(df.index)), 2):
hash[ix].append(subset)
hash['p'].append(df.loc[pd.IndexSlice[subset], :]).mean)
return pd.DataFrame(hash).set_index(ix)
Which gets the combinations and then adds them to the has that then builds back up into a data frame. It's hacky though :(
Here's an implementation that uses a MultiIndex and an outer join to handle the cross join.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])
groupedDF = df.groupby(df.index).mean()
# Create new MultiIndex using from_product which gives a paring of the elements in each iterable
p = pd.MultiIndex.from_product([groupedDF.index, groupedDF.index])
# Add column for cross join
groupedDF[0] = 0
# Outer Join
groupedDF = pd.merge(groupedDF, groupedDF, how='outer', on=0).set_index(p)
# get mean for every row (which is the average for each pair)
# unstack to get matrix for deduplication
crossJoinMeans = groupedDF[['column_x', 'column_y']].mean(axis=1).unstack()
# Create Identity matrix because each pair of itself will be needed
b = np.identity(3, dtype='bool')
# set the first column to True because it contains the rest of the unique means (the identity portion covers the first part)
b[:,0] = True
# invert for proper use of DataFrame Mask
b = np.invert(b)
finalDF = crossJoinMeans.mask(b).stack()
I'd guess that this could be cleaned up and made more concise.

Categories