why the values of my copied dataframe also change [duplicate] - python

This question already has an answer here:
DataFrame.apply in python pandas alters both original and duplicate DataFrames
(1 answer)
Closed 3 months ago.
so here my issue :
# creation of dataframe
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
if I run the code below the values of the column priorty change for the dataframe dg (ok good) but for the dataframe df too, why ??
# map function
dg = df
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
print(dg)

simply use df.copy()
because df is a DataFrame instance and so an object and when you set it to another variable, in reality you point to one object with 2 variables and pandas does not create a new object

That's because pandas dataframes are mutables.
pandas.DataFrame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
You want pandas.DataFrame.copy to keep the original dataframe (in your case df) unchanged.
# map function
dg = df.copy()
dg["priority"] = dg["priority"].map({"yes":True, "no":False})

Related

Pandas dropna() not removing entire row

When I serached a way to remove an entire column in pandas if there is a null/NaN value, the only appropriate function I found was dropna(). For some reason, it's not removing the entire row as intended, but instead replacing the null values with zero. As I want to discard the entire row to then make a mean age of the animals from the dataframe, I need a way to not count the NaN values.
Here's the code:
import numpy as np
import pandas as pd
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, labels)
df.dropna(inplace= True)
df.head()
In this case, I need to delete the Dog 'd' and Cat 'h'. But the code that comes out is:
To note I have also done this, and it didn't work either:
df2 = df.dropna()
you have to specify the axis = 1 and any to remove column
see : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
df.dropna(axis=1, inplace= True, how='any')
if you want just delet the row :
df.dropna(inplace= True, how='any')

How to properly apply filters in pandas from the given set of filters?

I am having trouble applying filters with pandas. The problem looks like this.
The first variable in the set (filter_names) should correspond to the first variable in the set (filter_values). The value of the second variable should be bigger or equal to the value given.
In other words, in the input like this:
df = pd.DataFrame({'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'name': ['Murzik', 'Pushok', 'Kaa', 'Bobik', 'Strelka', 'Vaska', 'Kaa2', 'Murka', 'Graf', 'Muhtar'],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']},
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
filter_names = ["animal", "age"]
filter_values = ["cat", 3]
the condition to be put in the query looks like this:
"cat"=="animal", "age"<3.
It should provide the DF below:
animal age name visits priority
a cat 2.5 Murzik 1 yes
f cat 2.0 Vaska 3 no
I wrote the following code to achieve this effect:
df_filtered = df[(filter_names[0]==filter_values[0])&(df[filter_names[1]]>=filter_values[1])]
to no avail.
What do I seem to be missing?
I think you lost df[...]in the first condition and use the wrong sign in the second one:
df[(df[filter_names[0]] == filter_values[0]) & (df[filter_names[1]] < filter_values[1])]
It will work like this:
In [2]: df[(df[filter_names[0]] == filter_values[0]) & (df[filter_names[1]] < filter_values[1])]
Out[2]:
animal age name visits priority
a cat 2.5 Murzik 1 yes
f cat 2.0 Vaska 3 no

Trying to to print only even values of a column Dataframe in pandas

This is the dataframe:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
I'm trying to print only even values of a column named "attempts"
is there a function I can use?
such as:
df["attempts"].even()
You can use
df.attempts[df.attempts % 2 == 0]
If you look at df.attempts % 2 == 0, you'll see it is a series of Trues and Falses. Then we use it as a boolean mask to select desired entries which give "0 as remainder when divided by 2" i.e., even ones.
Use DataFrame.loc for filter by mask - compare modulo 2 with 0 and column name:
df = pd.DataFrame(exam_data)
print (df.loc[df.attempts % 2 == 0, 'attempts'])
2 2
4 2
8 2
Name: attempts, dtype: int64
For odd values compare by 1:
print (df.loc[df.attempts % 2 == 1, 'attempts'])
0 1
1 3
3 3
5 3
6 1
7 1
9 1
Name: attempts, dtype: int64
You can try with the following:
print(df[df['attempts']%2==0]['attempts'].values)

Pandas count groupby values [duplicate]

This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed 2 years ago.
Example df:
df = pd.DataFrame({
'group': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b',],
'value': ['yes', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no', 'no']
})
I need to count each value in group and get something like this:
group yes no
a 3 2
b 1 3
I tried df.groupby(['group', 'value'])['value'].count().to_frame() and its look fine but too multiindexed, i need simple table like example above
try:
df.groupby(['group', 'value'])['value'].count().reset_index()
next, you can change the column names

Create a DataFrame birds from this dictionary data which has the index labels

Consider the following Python dictionary data and Python list labels:**
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Create a DataFrame birds from this dictionary data which has the index labels using Pandas
Assuming your dictionary is already ordered into the correct ordering for the labels
import pandas as pd
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
data['labels'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, columns=['birds', 'age', 'visits', 'priority', 'labels'])
df.set_index('labels')
Try the code below,
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {
'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
'labels' : labels
}
df = pd.DataFrame.from_dict(data)
df.set_index('labels')
You can reduce some code like:
DataFrame provides us a flexibilty to provide some values like data,columns,index and the list goes on.
If we are dealing with Dictionary, then by default dictionaries keys are treated as column and values will be as rows.
In following Code I have used name attribute through DataFrame object
df=pd.DataFrame(data,index=Labels) # Custom indexes
df.index.name='labels' # After Running df.index.name you will get index as none, by this approach you can set any name to the column
I hope this will be help full for you.
Even I encountered the same exact issue few days back and we have a very beautiful library to handle dataframes and is better than pandas.
Search for turicreate in python, it is very very similar to the pandas but has a lot more to offer than pandas.
You can define the Sframes in Turienter image description here, somewhat similar to the pandas dataframe. After that you just have to run:
dataframe_name.show()
.show() visualizes any data structure in Turi Create.
You can visit the mentioned notebook for a better understanding: https://colab.research.google.com/drive/1DIFmRjGYx0UOiZtvMi4lOZmaBMnu_VlD
You can try out this.
import pandas as pd
import numpy as np
from pandas import DataFrame
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
df=DataFrame(data,index=labels)
print(df)

Categories