Python Conditional NaN Value Replacement of existing Values in Dataframe - python

I try to transform my DataFrame witch i loaded from a CSV.
In that CSV are columns that have NaN / no Values. The goal is to replace them all!
For Example in column 'gh' row 45 (as shown in the picture: Input Dataframe) is a value missing. I like to replace it with the value of row 1, because 'latitude','longitude', 'time' ,'step','valid_time' are equal. So I like to have a Condition based replacement by those values. But not just for 'gh' but also for meanSea, msl, t, u and v.
Input Dataframe
I tryed something like that (just for 'gh'):
for i,row in df.iterrows():
value = row["gh"]
if pd.isnull(value):
for j,rowx in df.iterrows():
if row["latitude"]==rowx["latitude"] and row["longitude"]==rowx["longitude"] and row["time"]==rowx["time"] and row["step"]==rowx["step"]and row["valid_time"]==rowx["valid_time"]:
valuex = rowx["gh"]
row["gh"]=valuex
break;
My Try
This is very inefficent for big Data Frames so I need a better solution.

Assuming all values can be found somewhere in the dataset, the easiest way is to sort your df by those columns ('latitude','longitude', 'time' ,'step','valid_time') and forward fill your NaN's:
df.sort_values(by=['latitude','longitude', 'time' ,'step','valid_time']).ffill()
However, this fails if there are rows which do not have a counterpart somewhere else in the dataset.

Related

New column in DataFrame from other columns AND rows

I want to create a new column, V, in an existing DataFrame, df. I would like the value of the new column to be the difference between the value in the 'x' column in that row, and the value of the 'x' column in the row below it.
As an example, in the picture below, I want the value of the new column to be
93.244598 - 93.093285 = 0.151313.
I know how to create a new column based on existing columns in Pandas, but I don't know how to reference other rows using this method. Is there a way to do this that doesn't involve iterating over the rows in the dataframe? (since I have read that this is generally a bad idea)
You can use pandas.DataFrame.shift for your use case.
The last row will not have any row to subtract from so you will get the value for that cell as NaN
df['temp_x'] = df['x'].shift(-1)
df[`new_col`] = df['x'] - df['temp_x']
or one liner :
df[`new_col`] = df['x'] - df['x'].shift(-1)
the column new_col will contain the expected data
An ideal solution is to use diff:
df['new'] = df['x'].diff(-1)

Convert Embedded JSON Dictionary to Pandas Dataframe

I have an embedded set of data given to me which needs to be converted to a pandas Dataframe
"{'rows':{'data':[[{'column_name':'column','row_value':value}]]}"
It's just a snippet of what it looks like at the start. Everything inside data repeats over and over. i.e.
{‘column_name’:’name’, ’row_value :value }
I want the values of column_name to be the column headings. And the values of row_value to be the values in each row.
Ive tried a few different ways. I thought it would be something along the lines of
df = pd.DataFrame(data=[data_rows['row_value'] for data_rows in raw_data['rows']['data']], columns=['column_name'])
But I might be way off. I probably not stepping into the data right with raw_data['rows']['data']
Any suggestions would be great.
You can try to add another loop in your list comprehension to get elements out:
df = pd.DataFrame(data=[data_row for data_rows in raw_data['rows']['data'] for data_row in data_rows])
print(df)
name value type
0 dynamic_tag_tracker null null

How to insert a data point at a time to a pandas dataframe?

It might be non-pythonic (if yes, let me know also) but I am running a function that produces only one data point at a time, and I would like to add those points to my dataframe. The reason for this is that for each row, I have 252 rows (in another dataframe) that I will input in a function, and will return me a single number.
I am using this method:
data.loc[row, 'ColumnA'] = some integer
but it appends the rows/values at the end of the dataframe, when I want to create a new column and populate it a data point at a time. So, for example, if I have this column in a dataframe:
Column A
NaN
NaN
NaN
and I run this:
data.loc[0, 'ColumnA'] = 10
I would like to see:
Column A
10
NaN
NaN
Thank you!
Have a look at the .at function on a dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html
It can be used to look at and change the value using a row/column pair
df.at[row, column] = value
So your code would look like
data.at[row, 'columnA'] = 10

Drop rows that contain NaN while preserving index

I am trying to clean a very large data frame using Pandas.
The data set contains duplicate columns for metrics like height, weight, sex, and age. Some of the rows have data for column name currentAge while other rows have data for column name currentAge2.
So, I want to drop the rows that have NaN in both currentAge and currentAge2 for example because they are useless data points. I would like to do the same for all of the other metrics.
The index of my data frame starts from 0. Below is the code I have tried.
for index, row in csv.iterrows():
if ((math.isnan(row['currentAge']) and math.isnan(row['currentAge2'])) == True):
csv.drop(csv.index[index])
This does not work and when I use in place=True I get an index out of bounds error. If someone could shed light on how I could properly clean this data frame that would be great. csv is the name of my data frame.
I do not think we need iterrows here.
csv[~(csv['currentAge'].isnull())&(csv['currentAge2'].isnull())]
If you want to drop the rows with NaN in both currentAge and currentAge2 inplace, you can also try:
csv.dropna(how='all', subset=['currentAge','currentAge2'], inplace=True)
The docs explain how the kwargs how and subset work. This is also easier to use if you need to consider more columns.
I hope that helps.

Cleaning Data: Replacing Current Column Values with Values mapped in Dictionary

I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.

Categories