How do I replace missing values with NaN - python

I am using the IMDB dataset for machine learning, and it contains a lot of missing values which are entered as '\N'. Specifically in the StartYear column which contains the movie year release I want to convert the values to integers. Which im not able to do right now, I could drop these values but I wanted to see why they're missing first. I tried several things but no success.
This is my latest attempt:

Here is a way to do it without using replace:
import pandas as pd
import numpy as np
df_basics = pd.DataFrame({'startYear':['\\N']*78760+[2017]*18267 + [2018]*18263+[2016]*17837+[2019]*17769+['1996 ','1993 ','2000 ','2019 ','2029 ']})
print(pd.value_counts(df_basics.startYear))
df_basics.loc[df_basics.startYear == '\\N','startYear'] = np.NaN
print(pd.value_counts(df_basics.startYear, dropna=False))
Output:
NaN 78760
2017 18267
2018 18263
2016 17837
2019 17769
1996 1
1993 1
2000 1
2019 1
2029 1

Related

Manipulate Dataframe

Lets say I'm working on a dataset: # dummy dataset
import pandas as pd
data = pd.DataFrame({"Name_id" : ["John","Deep","Julia","John","Sandy",'Deep'],
"Month_id" : ["December","March","May","April","May","July"],
"Colour_id" : ["Red",'Purple','Green','Black','Yellow','Orange']})
data
How can I convert this data frame into something like this:
Where the A_id is unique and forms new columns based on both the value and the existence / non-existence of the other columns in order of appearance? I have tried to use pivot but I noticed it's more used for numerical data instead of categorical.
Probably you should try pivot
data['Rowid'] = data.groupby('Name_id').cumcount()+1
d = data.pivot(index='Name_id', columns='Rowid',values = ['Month_id','Colour_id'])
d.reset_index(inplace=True)
d.columns = ['Name_id','Month_id1', 'Colour_id1', 'Month_id2', 'Colour_id2']
which gives
Name_id Month_id1 Colour_id1 Month_id2 Colour_id2
0 Deep March July Purple Orange
1 John December April Red Black
2 Julia May NaN Green NaN
3 Sandy May NaN Yellow NaN

How to plot average of values for a year

I have a data frame like so. I am trying to make a plot with the mean of 'number' for each year on the y and the year on the x. I think what I have to do to do this is make a new data frame with 2 columns 'year' and 'avg number' for each year. How would I go about doing that?
year number
0 2010 40
1 2010 44
2 2011 33
3 2011 32
4 2012 34
5 2012 56
When opening a question about pandas please make sure you following these guidelines: How to make good reproducible pandas examples. It will help us reproduce your environment.
Assuming your dataframe is stored in the df variable:
df.groupby('year').mean().plot()

How to use the value of one column as part of a string to fill NaNs in another column?

Let's say I have the following df:
year date_until
1 2010 -
2 2011 30.06.13
3 2011 NaN
4 2015 30.06.18
5 2020 -
I'd like to fill all - and NaNs in the date_until column with 30/06/{year +1}. I tried the following but it uses the whole year column instead of the corresponding value of the specific row:
df['date_until] = df['date_until].str.replace('-', f'30/06/{df["year"]+1}')
my final goal is to calculate the difference between the year and the year of date_until, so maybe the step above is even unnecessary.
We can use pd.to_datetime here with errors='coerce' to ignore the faulty dates. Then use the dt.year to calculate the difference:
df['date_until'] = pd.to_datetime(df['date_until'], format='%d.%m.%y', errors='coerce')
df['diff_year'] = df['date_until'].dt.year - df['year']
year date_until diff_year
0 2010 NaT NaN
1 2011 2013-06-30 2.0
2 2011 NaT NaN
3 2015 2018-06-30 3.0
4 2020 NaT NaN
For everybody who is trying to replace values just like I wanted to in the first place, here is how you could solve it:
for i in range(len(df)):
if pd.isna(df['date_until'].iloc[i]):
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}'
if df['date_until'].iloc[i] == '-':
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}
But #Erfan's approach is much cleaner

Using groupby calculations in Pandas data frames

I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)

Vectorized looping pandas

Hi I need to create a column with values 1 or 0 based on certain conditions. My dataframe is enormous, so a general for loop or even apply are extremely slow. I want to used Pandas or even more preferably Numpy vectorization. Below is a sample of the data and my code that does not work:
election_year D_president
1992 0
1992 0
1996 0
1996 0
2000 0
2004 0
2008 0
2012 0
test_df['D_president'] = 0
election_year = test_df['election_year']
test_df['D_president'] = test_df.loc[((election_year == 1992) |
(election_year == 1996) |
(election_year == 2008)|
(election_year == 2012)), 'D_president'] = 1
So basically I need to get a value of 1 in a column 'D_president' for these certain years. However, when I execute this code I get all 1 even for 2000 and 2004. Can't understand what's wrong.
Also how could I transform this into a Numpy vectorization with .values?
It looks like you're having two "=" assignments on the same row. Try removing the leftmost one test_df['D_president'] Also, for the test, you can replace it with election_year.isin([1992, 1996, 2008, 2012]))

Categories