Python longitudinal data, filling NaN values - python

I have a dataset on health indicators, with columns such as 'Country', 'Year', 'GDP', and 'Life expectancy'. The data covers the years 2000-2015.
So, there is data for many health indicators for each country for each of the years from 2000-2015.
Many of the variables have missing (NaN) data for specific years/countries.
So, for instance, How would I replace NaN values with average/mean values specific to the given country/year range for all countries?
Additionally, since this is longitudinal data, it would be great to maintain the general trend over time within each country's 16 years of data. Is there a way to replace NaN data for each country, accounting for the general trend for that country/variable over time?
If you guys could explain both methods, that would be phenomenal.
link to data: https://www.kaggle.com/kumarajarshi/life-expectancy-who
Thanks,
D
screenshot of data

You probably want to look into the pd.Dataframe.interpolate() method. It has different methods for filling NaNs in a time series or filling in missing values.

Related

How can I select with least amount of nan values for a certain time period in panda?

I have dataset with quite a lot data missing which stores hourly data for several years. I would now to implement a seasonal filling method where I need the best data I have for two following years (2*8760 entries). This means the least amount of data missing (or least amount of nan values) for two following years. I then need then the end time and start time of this period in datetime format. My data is stored in a dataframe where the index is the hourly datetime. How can I achieve this?
EDIT:
To make it a bit clearer I need to select all entries (values and nan values) from a time period of of two years (or of 2*8760 rows) where the least amount of nan values occur.
You can remove all the NAN values from your data by using df = df.dropna()

Problem assigning a new column in pandas when extracting values from different columns

I'm having problem creating new column with average percantage of discount for a product category. My dataframe consists of rows with orders. Each order has its id, item name, category of a product, month of purchase, its retail and discounted price, I've also added discount in percentage column. I want to add a new column which would consist of an average discount per category. To put it in simple terms I want to know how much on average the products in Furniture were discounted. I then want to plot the top 3 categories with their discounts over time to see if there's seasonality (I was thinking of a bar plot).
That's example data
data = {'level_0': ['Furniture', 'Jewllery','Watches', 'Footwear', 'Furniture', 'Watches'],
'Discount_in_%': ['0.6', '.2', '0.3', '0.8', '0.7', '0.1']}
data = pd.DataFrame (data, columns = ['level_0','Discount_in_%'])
data
My problem is generating the column with the mean discount per category.
I was trying using groupby() but I get a column sull of NaNs
df['discount_in_%'] = 1 - df['discounted_price']/df['retail_price']
df['mean_discount_cat'] = df.groupby('level_0')['discount_in_%'].sum()/len(df)
df['mean_discount_cat']
#level_0 is the main category column
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
19995 NaN
19996 NaN
19997 NaN
19998 NaN
19999 NaN
I tried summing it up and then dividing it per length as when I used mean() I also got NaNs.
Can you please direct me how to fix this? Also I'm not sure how to then plot the mean discount just per top 3 categories, but that might be another issue.
I'd really appreciate your help.
Thank you!
There are several problems here.
Initializing the dataframe. What you have isn't quite right. It mixes two different ways of initializing. If you name the columns in the data, you do not pass the columns parameters into the initialization. See https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/
data = {'level_0': ['Furniture', 'Jewllery','Watches', 'Footwear', 'Furniture', 'Watches'],
'discount_in_%': [0.6, .2, 0.3, 0.8, 0.7, 0.1]}
df = pd.DataFrame (data)
Now you have a proper dataframe.
group by is not quite right. groupby function returns a special object that needs an aggregation function (not a column address) to produce results:
print(df.groupby('level_0').sum())
dividing your results by len(df) doesn't make a lot of sense. If you have 1 item in a category that has a 5% discount, what would dividing it by 100 items in the whole dataframe accomplish? I am guessing you are looking for
print(df.groupby('level_0').mean())

Find specific combination of values in pandas dataframe

I am preparing a dataframe for machine learning. The data set contains weather data from several weather stations in australia over a period of 10 years. One of the measured attributes is Evaporation. It has about 50% missing values.
Now I want to find out, whether the missing values are evenly distributed over all weather stations or if roughly half of the weather stations just never measured Evaporation.
How can I find out about the distribution of a value in combination with another attribute?
I basically want to loop over the weather stations and get a count of NaNs and normal values.
rain_df.query('Location == "Albury"').Location.count()
This gives me the number of measurement points from the weaher station in Albury. Now how can I find out how many NaNs were measured in Albury compared to normal (non-NaN) measurements?
You can use .isnull() to mask a series with True for NaNs and False for everything else. Then you can use .value_counts(normalize=True) to get the proportions of NaN and non NaN in that series.
rain_df.query('Location == "Albury"').Location.isnull().value_counts(normalize=True)

Should I use multi-rows or multi-columns with Pandas DataFrame?

Let's suppose I have data with the following structure:
(year, country, region, values)
Example:
Year, Country, Region, Values
2010 A 1 [1,2,3,...(1000 values)]
2010 A 2 [1,2,3,...(1000 values)]
...
2014 J 5 [1,2,3,...(1000 values)]
There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.
I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?
There are many possible ways to store this data, for example:
Multi-row (country, region), single column (year) and an array of
values
Multi-column (year, country, region) and a single value per
row
Multi-row (Country, region), multi column (Year, index of value)
Single row and have one column for year, another for country, another for region and another for the array of values.
Option 3 seems to be very bad, because there will be 5 years x 1000 columns.
Option 4 also seems to be very bad, because I would need to group by every time I need something.
You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.
Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.
Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.
Depending what you want to do. But I would go for multi-row as I feel like pandas is built for handling columnar data. Although, long data format seems to be the preferred in general too. A quick google on 'long' and 'wide' data yields many results on wide-to-long but not other way around.
This blog post also points out some of the advantages of long over wide data format.

JSON file with different array lengths in Python

I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below
import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
# download the data
j = requests.get(url=json_url)
# load the json
content = json.loads(j.content)
list(content.keys())
The last line of code above gives me the below output:
['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']
I then tried to have a look at the lengths of 'Value', 'size' and 'role'
print (len(content['value']))
print (len(content['size']))
print (len(content['role']))
And I got the results as below:
22200
5
3
As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths.
How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:
1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year
Exploratory data analysis to show how the population progressed by regions and age groups
You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.
Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.
But because of the 5 dimensions, it is probably better to first use a numpy matrix...
The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.
I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.
Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.
This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.
You can explore this dataset using the JSON-stat explorer

Categories