Drop rows that contain NaN while preserving index - python

I am trying to clean a very large data frame using Pandas.
The data set contains duplicate columns for metrics like height, weight, sex, and age. Some of the rows have data for column name currentAge while other rows have data for column name currentAge2.
So, I want to drop the rows that have NaN in both currentAge and currentAge2 for example because they are useless data points. I would like to do the same for all of the other metrics.
The index of my data frame starts from 0. Below is the code I have tried.
for index, row in csv.iterrows():
if ((math.isnan(row['currentAge']) and math.isnan(row['currentAge2'])) == True):
csv.drop(csv.index[index])
This does not work and when I use in place=True I get an index out of bounds error. If someone could shed light on how I could properly clean this data frame that would be great. csv is the name of my data frame.

I do not think we need iterrows here.
csv[~(csv['currentAge'].isnull())&(csv['currentAge2'].isnull())]

If you want to drop the rows with NaN in both currentAge and currentAge2 inplace, you can also try:
csv.dropna(how='all', subset=['currentAge','currentAge2'], inplace=True)
The docs explain how the kwargs how and subset work. This is also easier to use if you need to consider more columns.
I hope that helps.

Related

Python Conditional NaN Value Replacement of existing Values in Dataframe

I try to transform my DataFrame witch i loaded from a CSV.
In that CSV are columns that have NaN / no Values. The goal is to replace them all!
For Example in column 'gh' row 45 (as shown in the picture: Input Dataframe) is a value missing. I like to replace it with the value of row 1, because 'latitude','longitude', 'time' ,'step','valid_time' are equal. So I like to have a Condition based replacement by those values. But not just for 'gh' but also for meanSea, msl, t, u and v.
Input Dataframe
I tryed something like that (just for 'gh'):
for i,row in df.iterrows():
value = row["gh"]
if pd.isnull(value):
for j,rowx in df.iterrows():
if row["latitude"]==rowx["latitude"] and row["longitude"]==rowx["longitude"] and row["time"]==rowx["time"] and row["step"]==rowx["step"]and row["valid_time"]==rowx["valid_time"]:
valuex = rowx["gh"]
row["gh"]=valuex
break;
My Try
This is very inefficent for big Data Frames so I need a better solution.
Assuming all values can be found somewhere in the dataset, the easiest way is to sort your df by those columns ('latitude','longitude', 'time' ,'step','valid_time') and forward fill your NaN's:
df.sort_values(by=['latitude','longitude', 'time' ,'step','valid_time']).ffill()
However, this fails if there are rows which do not have a counterpart somewhere else in the dataset.

Pandas: how to keep data that has all the needed columns

I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.

How to insert data into a existing dataframe, replacing values according to a conditional

I'm looking to insert information into a existing dataframe, this dataframe shape is 2001 rows × 13 columns, however, only the first column has information.
I have 12 more columns, but these are not the same dimension as the main dataframe, so I'd like to insert this additional columns into the main one using a conditional.
Example dataframe:
This in an example, I want to insert the var column into the 2001 × 13 dataframe, using the date as a conditional and in case there is no date, it skips the row or simply adds a 0.
I'm really new to python and programming in general.
Without a minimal working example it is hard to provide you with clear recommendations, but I think what you are looking for is the .loc a pd.DataFrame. What I would recommend you doing is the following:
Selection of rows with .loc works better in your case if the dates are first converted to date-time, so a first step is to make this conversion as:
# Pandas is quite smart about guessing date format. If this fails, please check the
# documentation https://docs.python.org/3/library/datetime.html to learn more about
# format strings.
df['date'] = pd.to_datetime(df['date'])
# Make this the index of your data frame.
df.set_index('date', inplace=True)
It is not clear how you intend to use conditionals/what is the content of your other columns. Using .loc this is pretty straightforward
# At Feb 1, 2020, add a value to columns 'var'.
df.loc['2020-02-01', 'var'] = 0.727868
This could also be used for ranges:
# Assuming you have a second `df2` which as a datetime columns 'date' with the
# data you wish to add to `df`. This will only work if all df2['date'] are found
# in df.index. You can workout the logic for your case.
df.loc[df2['date'], 'var2'] = df2['vals']
If the logic is to complex and the dataframe is not too large, iterating with .iterrows could be easier, specially if you are beginning with Python.
for idx, row in df.iterrows():
if idx in list_of_other_dates:
df.loc[i, 'var'] = (some code here)
Please clarify a bit your problem and you will get better answers. Do not forget to check the documentation.

How to tell pandas to read columns from the left?

I have a csv in which one header column is missing. Eg, I have n data columns, but n-1 header names. When this happens, it seems like pandas shifts my first column to be an index, as shown in the image. So what happens is the column to the right of date_time in the csv, is under the date_time column in the pandas data frame.
My question is: how can I force pandas to read from the left so that the date_time data remains under the date_time column instead of becoming the index? I'm thinking if pandas can simply read from left to right and add dummy column names at the end of the file, that would be great.
Side note: I concede that my input csv should be "clean", however, I think that pandas/frameworks in general should be able to handle the case in which some data might be unclean, but the user wants to proceed with the analysis instead of spending 30 minutes writing a side function/script to fix these minor issues. In my case, the data I care about is usually in the first 15 columns and I don't really care if the columns after that are misaligned. However, when I read the dataframe into pandas, I'm forced to care and waste time fixing these issues even though I don't care about the remaining columns.
Since you don't care about the last column, just set index_col=False
df = pd.read_csv(file, index_col=False)
That way, it will sequentially match the columns with data for the first n-1 columns. Data after that will not be in the data frame
You may also skip the first row to have all your data in the data frame first
df = pd.read_csv(file, skiprows=1)
and then just set the column name after
df.columns = ['col1', 'col2', ....] + ['dummy_col1', 'dummy_col2'...]
where the first list comes from the row=0 of your csv, and the second list you just fill dinamically with a list comprehension.

How to deal with missing values in Pandas DataFrame?

I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')

Categories