Pandas dropna() not removing entire row - python

When I serached a way to remove an entire column in pandas if there is a null/NaN value, the only appropriate function I found was dropna(). For some reason, it's not removing the entire row as intended, but instead replacing the null values with zero. As I want to discard the entire row to then make a mean age of the animals from the dataframe, I need a way to not count the NaN values.
Here's the code:
import numpy as np
import pandas as pd
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, labels)
df.dropna(inplace= True)
df.head()
In this case, I need to delete the Dog 'd' and Cat 'h'. But the code that comes out is:
To note I have also done this, and it didn't work either:
df2 = df.dropna()

you have to specify the axis = 1 and any to remove column
see : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
df.dropna(axis=1, inplace= True, how='any')
if you want just delet the row :
df.dropna(inplace= True, how='any')

Related

why the values of my copied dataframe also change [duplicate]

This question already has an answer here:
DataFrame.apply in python pandas alters both original and duplicate DataFrames
(1 answer)
Closed 3 months ago.
so here my issue :
# creation of dataframe
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
if I run the code below the values of the column priorty change for the dataframe dg (ok good) but for the dataframe df too, why ??
# map function
dg = df
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
print(dg)
simply use df.copy()
because df is a DataFrame instance and so an object and when you set it to another variable, in reality you point to one object with 2 variables and pandas does not create a new object
That's because pandas dataframes are mutables.
pandas.DataFrame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
You want pandas.DataFrame.copy to keep the original dataframe (in your case df) unchanged.
# map function
dg = df.copy()
dg["priority"] = dg["priority"].map({"yes":True, "no":False})

How to properly apply filters in pandas from the given set of filters?

I am having trouble applying filters with pandas. The problem looks like this.
The first variable in the set (filter_names) should correspond to the first variable in the set (filter_values). The value of the second variable should be bigger or equal to the value given.
In other words, in the input like this:
df = pd.DataFrame({'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'name': ['Murzik', 'Pushok', 'Kaa', 'Bobik', 'Strelka', 'Vaska', 'Kaa2', 'Murka', 'Graf', 'Muhtar'],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']},
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
filter_names = ["animal", "age"]
filter_values = ["cat", 3]
the condition to be put in the query looks like this:
"cat"=="animal", "age"<3.
It should provide the DF below:
animal age name visits priority
a cat 2.5 Murzik 1 yes
f cat 2.0 Vaska 3 no
I wrote the following code to achieve this effect:
df_filtered = df[(filter_names[0]==filter_values[0])&(df[filter_names[1]]>=filter_values[1])]
to no avail.
What do I seem to be missing?
I think you lost df[...]in the first condition and use the wrong sign in the second one:
df[(df[filter_names[0]] == filter_values[0]) & (df[filter_names[1]] < filter_values[1])]
It will work like this:
In [2]: df[(df[filter_names[0]] == filter_values[0]) & (df[filter_names[1]] < filter_values[1])]
Out[2]:
animal age name visits priority
a cat 2.5 Murzik 1 yes
f cat 2.0 Vaska 3 no

Trying to to print only even values of a column Dataframe in pandas

This is the dataframe:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
I'm trying to print only even values of a column named "attempts"
is there a function I can use?
such as:
df["attempts"].even()
You can use
df.attempts[df.attempts % 2 == 0]
If you look at df.attempts % 2 == 0, you'll see it is a series of Trues and Falses. Then we use it as a boolean mask to select desired entries which give "0 as remainder when divided by 2" i.e., even ones.
Use DataFrame.loc for filter by mask - compare modulo 2 with 0 and column name:
df = pd.DataFrame(exam_data)
print (df.loc[df.attempts % 2 == 0, 'attempts'])
2 2
4 2
8 2
Name: attempts, dtype: int64
For odd values compare by 1:
print (df.loc[df.attempts % 2 == 1, 'attempts'])
0 1
1 3
3 3
5 3
6 1
7 1
9 1
Name: attempts, dtype: int64
You can try with the following:
print(df[df['attempts']%2==0]['attempts'].values)

Create a DataFrame birds from this dictionary data which has the index labels

Consider the following Python dictionary data and Python list labels:**
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Create a DataFrame birds from this dictionary data which has the index labels using Pandas
Assuming your dictionary is already ordered into the correct ordering for the labels
import pandas as pd
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
data['labels'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, columns=['birds', 'age', 'visits', 'priority', 'labels'])
df.set_index('labels')
Try the code below,
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {
'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
'labels' : labels
}
df = pd.DataFrame.from_dict(data)
df.set_index('labels')
You can reduce some code like:
DataFrame provides us a flexibilty to provide some values like data,columns,index and the list goes on.
If we are dealing with Dictionary, then by default dictionaries keys are treated as column and values will be as rows.
In following Code I have used name attribute through DataFrame object
df=pd.DataFrame(data,index=Labels) # Custom indexes
df.index.name='labels' # After Running df.index.name you will get index as none, by this approach you can set any name to the column
I hope this will be help full for you.
Even I encountered the same exact issue few days back and we have a very beautiful library to handle dataframes and is better than pandas.
Search for turicreate in python, it is very very similar to the pandas but has a lot more to offer than pandas.
You can define the Sframes in Turienter image description here, somewhat similar to the pandas dataframe. After that you just have to run:
dataframe_name.show()
.show() visualizes any data structure in Turi Create.
You can visit the mentioned notebook for a better understanding: https://colab.research.google.com/drive/1DIFmRjGYx0UOiZtvMi4lOZmaBMnu_VlD
You can try out this.
import pandas as pd
import numpy as np
from pandas import DataFrame
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
df=DataFrame(data,index=labels)
print(df)

Replicate pandas dataframe for each instance of field in another dataframe

I have two dataframes. Input_df1 is a concatenation of files with a 'filename' column, 'unitid', and many columns of data which I simplify as 'data' for this example. Input_df2 has 'filename' column in addition to 'group'. I'm trying to replicate the data in input_df1 per filename for each instance of filename in input_df2 and add a group column in the output dataframe showing which group it belongs to.
import pandas as pd
input_df1 = pd.DataFrame(data={
'filename' : ['A', 'A', 'B', 'C'],
'unitid' : [ 1, 2, 3, 4 ],
'data' : [11, 12, 13, 14 ]
})
input_df2 = pd.DataFrame(data={
'filename' : ['A', 'B', 'C', 'C', 'A' ],
'group' : ['g1', 'g2', 'g3', 'g4', 'g5']
})
output_df = pd.DataFrame(data={
'filename' : [ 'A', 'A', 'A', 'A', 'B', 'C', 'C', 'A', 'A', 'A', 'A'],
'unitid' : [ 1, 2, 1, 2, 3, 4, 4, 1, 2, 1, 2],
'data' : [ 11, 12, 11, 12, 13, 14, 14, 11, 12, 11, 12],
'group' : ['g1', 'g1', 'g1', 'g1', 'g2', 'g3', 'g4', 'g5', 'g5', 'g5', 'g5']
})
Output_df is what I'm trying to create: replicated rows of the input_df1 per instance of filename in input_df2 with the 'group' value added.
Another question I have is, if I need to filter the rows of each replicated dataframe based on the group type, is it better to do that before joining or after? I was planning on filtering after since I have a better idea of how to do it, but I figure computing on unneeded rows is inefficient when they could be dropped during the replication process. Also I'm dealing with about 30k rows in input_df1 and 800 rows in input_df2 so potential for 24m rows total.
Any direction on which functions I should research to achieve this would be very appreciated.

Categories