Finding the Length of a Pandas Dataframe Within a Function - python

The objective of the code below is to create another identical pandas dataframe, where all values are replaced with zero.
input numpy as np
import pandas as pd
#Given preexisting dataframe
len(df) #Returns 1502
def zeroCreator(data):
zeroFrame = pd.DataFrame(np.zeros(len(data),1))
return zeroFrame
print(zeroCreator(df)) #Returns a TypeError: data type not understood
How do I work around this TypeError?
Edit: Thank you for all your clarifications, it appears that I hadn't entered the dataframe parameters correctly into np.zeros (missing a pair of parentheses), although a simpler solution does exist.

Just clone a new df and assign 0 to it
zero_df = df.copy()
zero_df[:] = 0

Related

Why does pandas.DataFrame change the data source?

I'm learning Python, and I found a thing I can't understand.
I created a pandas.DataFrame from a ndarray, and then only modified the DF instead of ndarray.
And to my suprise, the ndarray has changed too!
Is the data cached inside DF?
If yes, why does they changed inside ndarray?
If no, how about a DF created without any source?
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
nda1 = np.zeros((3,3), dtype=float)
print(f'original nda1:\n{nda1}\n')
df1 = DataFrame(nda1)
print(f'original df1:\n{df1}\n')
df1.iat[2,2] = 999
#print(f'df1 in main:\n{df}\n')
print(f'nda1 after modify:\n{nda1}\n')
DataFrames are using numpy arrays under the hood. As you have a full homogeneous type, the array is kept as is.
You can check it with:
pd.DataFrame(nda1).values.base is nda1
# True
You can force a copy to avoid the issue:
df1 = pd.DataFrame(nda1.copy())
or copy from within the constructor:
df1 = pd.DataFrame(nda1, copy=True)
check that the underlying array is different:
pd.DataFrame(nda1, copy=True).values.base is nda1
# False
Many programmers experience this. This is because of this line:
df1 = DataFrame(nda1)
When you set these 2 things as equal, both will be intertwined. If you want to have a "no source" dataframe, use:
df2 = df1.copy()
or
df1 = DataFrame(nda1.copy())
High relevant post:
Why can pandas dataframes change each other

How to implement an Conditions in an Panda Data Frame, to save it

I would like to create a counter for a Panda DataFrame to save only until a certain Value in a specific column.
f.e save only until df['cycle'] == 2.
For what I gathered from the answers below is that df[df['cycle']<=2] will solve my Problem.
Edit: If I am correct python pandas always read the whole file, only if us nrows than you say f.e go to index x but want if I don't want to use Index but a specific value from a column. How can I do that?
See my code below:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
df[df['cycle']<=2]
df.to_csv(path_or_buf='test.out', index=True, sep='\t', columns=['time','A','B','cycle'], decimal='.')
So I modified the answer according to the suggestion from the users.
I am glad for every help that I can get.

Python set to array and dataframe

Interpretation by a friendly editor:
I have data in the form of a set.
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code converts the set without a problem into a numpy array.
But when I try to create a DataFrame from it I get the following error:
ValueError: DataFrame constructor not properly called!
So is there any way to convert a python set/nested set into a numpy array/dictionary so I can create a DataFrame from it?
Original Question:
I have a data in form of set .
Code
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code returns same set for numpyarray and DataFrame constructor not called at o/p . So is there any way to convert python set , nested set into numpy array and dictionary ??
Pandas can't deal with sets (dicts are ok you can use p.DataFrame.from_dict(s) for those)
What you need to do is to convert your set into a list and then convert to DataFrame:
import pandas as pd
s = {12,34,78,100}
s = list(s)
print(pd.DataFrame(s))
You can use list(s):
import pandas as p
s = {12,34,78,100}
df = p.DataFrame(list(s))
print(df)
Why do you want to convert it to a list first? The DataFrame() method accepts data which can be iterable. Sets are iterable.
dataFrame = pandas.DataFrame(yourSet)
This will create a column header: "0" which you can rename it like so:
dataFrame.columns = ['columnName']
import numpy as n , pandas as p
s={12,34,78,100}
#Create DataFrame directly from set
df = p.DataFrame(s)
#Can also create a keys, values pair (dictionary) and then create Data Frame,
#it useful as key will be used as Column Header and values as data
df1 = p.DataFrame({'Values': data} for data in s)

Save pandas dataframe with numpy arrays column

Let us consider the following pandas dataframe:
df = pd.DataFrame([[1,np.array([6,7])],[4,np.array([8,9])]], columns = {'A','B'})
where the B column is composed by two numpy arrays.
If we save the dataframe and the load it again, the numpy array is converted into a string.
df.to_csv('test.csv', index = False)
df.read_csv('test.csv')
Is there any simple way of solve this problem? Here is the output of the loaded dataframe.
you can pickle the data instead.
df.to_pickle('test.csv')
df = pd.read_pickle('test.csv')
This will ensure that the format remains the same. However, it is not human readable
If human readability is an issue, I would recommend converting it to a json file
df.to_json('abc.json')
df = pd.read_json('abc.json')
Use the following function to format each row.
def formatting(string_numpy):
"""formatting : Conversion of String List to List
Args:
string_numpy (str)
Returns:
l (list): list of values
"""
list_values = string_numpy.split(", ")
list_values[0] = list_values[0][2:]
list_values[-1] = list_values[-1][:-2]
return list_values
Then use the following apply function to convert it back into numpy arrays.
df[col] = df.col.apply(formatting)

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Categories