Why does pandas.DataFrame change the data source? - python

I'm learning Python, and I found a thing I can't understand.
I created a pandas.DataFrame from a ndarray, and then only modified the DF instead of ndarray.
And to my suprise, the ndarray has changed too!
Is the data cached inside DF?
If yes, why does they changed inside ndarray?
If no, how about a DF created without any source?
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
nda1 = np.zeros((3,3), dtype=float)
print(f'original nda1:\n{nda1}\n')
df1 = DataFrame(nda1)
print(f'original df1:\n{df1}\n')
df1.iat[2,2] = 999
#print(f'df1 in main:\n{df}\n')
print(f'nda1 after modify:\n{nda1}\n')

DataFrames are using numpy arrays under the hood. As you have a full homogeneous type, the array is kept as is.
You can check it with:
pd.DataFrame(nda1).values.base is nda1
# True
You can force a copy to avoid the issue:
df1 = pd.DataFrame(nda1.copy())
or copy from within the constructor:
df1 = pd.DataFrame(nda1, copy=True)
check that the underlying array is different:
pd.DataFrame(nda1, copy=True).values.base is nda1
# False

Many programmers experience this. This is because of this line:
df1 = DataFrame(nda1)
When you set these 2 things as equal, both will be intertwined. If you want to have a "no source" dataframe, use:
df2 = df1.copy()
or
df1 = DataFrame(nda1.copy())
High relevant post:
Why can pandas dataframes change each other

Related

Finding the Length of a Pandas Dataframe Within a Function

The objective of the code below is to create another identical pandas dataframe, where all values are replaced with zero.
input numpy as np
import pandas as pd
#Given preexisting dataframe
len(df) #Returns 1502
def zeroCreator(data):
zeroFrame = pd.DataFrame(np.zeros(len(data),1))
return zeroFrame
print(zeroCreator(df)) #Returns a TypeError: data type not understood
How do I work around this TypeError?
Edit: Thank you for all your clarifications, it appears that I hadn't entered the dataframe parameters correctly into np.zeros (missing a pair of parentheses), although a simpler solution does exist.
Just clone a new df and assign 0 to it
zero_df = df.copy()
zero_df[:] = 0

Efficient way to empty a pandas dataframe using its mutable alias property

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Applying functions declared as strings to a pandas dataframe

I have a pandas dataframe. I want to create new columns in the dataframe with
mathematical functional values of the existing columns.
I know how to do it for simple cases:
import pandas as pd
import numpy as np
# Basic dataframe
df = pd.DataFrame(data={'col1': [1,2], 'col2':[3,5]})
for i in df.columns:
df[f'{i}_sqrt'] = df[i].apply(lambda x :np.sqrt(x))
produces
Now I want to extend it to the cases where the functions are written as strings like:
one_func = ['(x)', '(np.sqrt(x))']
two_func = ['*'.join(i) for i in itertools.product(one_func, one_func)]
so that two_func = ['(x)*(x)','(x)*(np.sqrt(x))','(np.sqrt(x))*(x)', '(np.sqrt(x))*(np.sqrt(x))']. Is there any way I can create columns like the first example with these new functions?
That looks like a bad design, but I won't go down that road.
Answering your question, you can use df.eval
First of all, set
one_func = ['{x}', '(sqrt({x}))']
with {} instead of () such that you can replace {x} for your actual column name.
Then, for instance,
expr = two_func[0].format(x='col1')
df.eval(expr)
The food loop your look like
for col in df.columns:
for func in two_func: df[func] = df.eval(func.format(x=col))

Save pandas dataframe with numpy arrays column

Let us consider the following pandas dataframe:
df = pd.DataFrame([[1,np.array([6,7])],[4,np.array([8,9])]], columns = {'A','B'})
where the B column is composed by two numpy arrays.
If we save the dataframe and the load it again, the numpy array is converted into a string.
df.to_csv('test.csv', index = False)
df.read_csv('test.csv')
Is there any simple way of solve this problem? Here is the output of the loaded dataframe.
you can pickle the data instead.
df.to_pickle('test.csv')
df = pd.read_pickle('test.csv')
This will ensure that the format remains the same. However, it is not human readable
If human readability is an issue, I would recommend converting it to a json file
df.to_json('abc.json')
df = pd.read_json('abc.json')
Use the following function to format each row.
def formatting(string_numpy):
"""formatting : Conversion of String List to List
Args:
string_numpy (str)
Returns:
l (list): list of values
"""
list_values = string_numpy.split(", ")
list_values[0] = list_values[0][2:]
list_values[-1] = list_values[-1][:-2]
return list_values
Then use the following apply function to convert it back into numpy arrays.
df[col] = df.col.apply(formatting)

Set value to an entire column of a pandas dataframe

I'm trying to set the entire column of a dataframe to a specific value.
In [1]: df
Out [1]:
issueid industry
0 001 xxx
1 002 xxx
2 003 xxx
3 004 xxx
4 005 xxx
From what I've seen, loc is the best practice when replacing values in a dataframe (or isn't it?):
In [2]: df.loc[:,'industry'] = 'yyy'
However, I still received this much talked-about warning message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
If I do
In [3]: df['industry'] = 'yyy'
I got the same warning message.
Any ideas? Working with Python 3.5.2 and pandas 0.18.1.
EDIT Jan 2023:
Given the volume of visits on this question, it's worth stating that my original question was really more about dataframe copy-versus-slice than "setting value to an entire column".
On copy-versus-slice: My current understanding is that, in general, if you want to modify a subset of a dataframe after slicing, you should create the subset by .copy(). If you only want a view of the slice, no copy() needed.
On setting value to an entire column: simply do df[col_name] = col_value
You can use the assign function:
df = df.assign(industry='yyy')
Python can do unexpected things when new objects are defined from existing ones. You stated in a comment above that your dataframe is defined along the lines of df = df_all.loc[df_all['issueid']==specific_id,:]. In this case, df is really just a stand-in for the rows stored in the df_all object: a new object is NOT created in memory.
To avoid these issues altogether, I often have to remind myself to use the copy module, which explicitly forces objects to be copied in memory so that methods called on the new objects are not applied to the source object. I had the same problem as you, and avoided it using the deepcopy function.
In your case, this should get rid of the warning message:
from copy import deepcopy
df = deepcopy(df_all.loc[df_all['issueid']==specific_id,:])
df['industry'] = 'yyy'
EDIT: Also see David M.'s excellent comment below!
df = df_all.loc[df_all['issueid']==specific_id,:].copy()
df['industry'] = 'yyy'
df.loc[:,'industry'] = 'yyy'
This does the magic. You are to add '.loc' with ':' for all rows. Hope it helps
You can do :
df['industry'] = 'yyy'
Assuming your Data frame is like 'Data' you have to consider if your data is a string or an integer. Both are treated differently. So in this case you need be specific about that.
import pandas as pd
data = [('001','xxx'), ('002','xxx'), ('003','xxx'), ('004','xxx'), ('005','xxx')]
df = pd.DataFrame(data,columns=['issueid', 'industry'])
print("Old DataFrame")
print(df)
df.loc[:,'industry'] = str('yyy')
print("New DataFrame")
print(df)
Now if want to put numbers instead of letters you must create and array
list_of_ones = [1,1,1,1,1]
df.loc[:,'industry'] = list_of_ones
print(df)
Or if you are using Numpy
import numpy as np
n = len(df)
df.loc[:,'industry'] = np.ones(n)
print(df)
This provides you with the possibility of adding conditions on the rows and then change all the cells of a specific column corresponding to those rows:
df.loc[(df['issueid'] == '001'), 'industry'] = str('yyy')
Seems to me that:
df1 = df[df['col1']==some_value] will not create a new DataFrame, basically, changes in df1 will be reflected in the parent df. This leads to the warning.
Whereas, df1 = df[df['col1]]==some_value].copy() will create a new DataFrame, and changes in df1 will not be reflected in df. The copy method is recommended if you don't want to make changes to your original df.
I had a similar issue before even with this approach df.loc[:,'industry'] = 'yyy', but once I refreshed the notebook, it ran well.
You may want to try refreshing the cells after you have df.loc[:,'industry'] = 'yyy'.
Only use them instead:
df.iloc[:]['industry'] = 'yyy'
remember: this only works with exist columns in dataframe
this for people who didn't work .loc
For anyone else coming for this answer and doesn't want to use copy -
df['industry'] = df['industry'].apply(lambda x: '')
if you just create new but empty data frame, you cannot directly sign a value to a whole column. This will show as NaN because the system wouldn't know how many rows the data frame will have!You need to either define the size or have some existing columns.
df = pd.DataFrame()
df["A"] = 1
df["B"] = 2
df["C"] = 3

Categories