I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below
import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
# download the data
j = requests.get(url=json_url)
# load the json
content = json.loads(j.content)
list(content.keys())
The last line of code above gives me the below output:
['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']
I then tried to have a look at the lengths of 'Value', 'size' and 'role'
print (len(content['value']))
print (len(content['size']))
print (len(content['role']))
And I got the results as below:
22200
5
3
As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths.
How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:
1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year
Exploratory data analysis to show how the population progressed by regions and age groups
You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.
Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.
But because of the 5 dimensions, it is probably better to first use a numpy matrix...
The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.
I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.
Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.
This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.
You can explore this dataset using the JSON-stat explorer
Related
I'm using an imported .csv dataset, and I'm trying to get some information about a specific column. I am not entirely sure if I've done the correct thing here, thoughts? The other 17 columns in this list are different weather types such as visibility, pressure, dew point, etc.
weather=df.head(10)
print(weather)
Don't know exactly what your DataFrame looks like, so I'm using a toy example. The idea is to use groupby on the weather column, count it, and then sort it.
df = pd.DataFrame({'Weather': ['Rainy', 'Cloudy', 'Rainy', 'Sunny', 'Storm', 'Sunny']})
df.groupby('Weather')['Weather'].agg(weather_count = 'count').sort_values('weather_count', ascending = False)
So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation
It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html
Let's suppose I have data with the following structure:
(year, country, region, values)
Example:
Year, Country, Region, Values
2010 A 1 [1,2,3,...(1000 values)]
2010 A 2 [1,2,3,...(1000 values)]
...
2014 J 5 [1,2,3,...(1000 values)]
There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.
I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?
There are many possible ways to store this data, for example:
Multi-row (country, region), single column (year) and an array of
values
Multi-column (year, country, region) and a single value per
row
Multi-row (Country, region), multi column (Year, index of value)
Single row and have one column for year, another for country, another for region and another for the array of values.
Option 3 seems to be very bad, because there will be 5 years x 1000 columns.
Option 4 also seems to be very bad, because I would need to group by every time I need something.
You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.
Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.
Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.
Depending what you want to do. But I would go for multi-row as I feel like pandas is built for handling columnar data. Although, long data format seems to be the preferred in general too. A quick google on 'long' and 'wide' data yields many results on wide-to-long but not other way around.
This blog post also points out some of the advantages of long over wide data format.
I'm new to python and I would appreciate if you give me an answer as soon as possible.
I'm processing a file containing reviews for products that can belong to more than 1 category. What I need is to group the review ratings by the categories, and date at the same time. Since I don't know the exact number of categories, or dates in advance, I need to add rows and columns as I'm processing the reviews data (50 GB file).
I've seen how I can add columns, however my trouble is adding a row without knowing how many columns are currently in the dataframe.
Here is my code:
list1=['Movies & TV', 'Books'] #categories so far
dfMain=pandas.DataFrame(index=list1,columns=['2002-09']) #only one column at the beginnig
print(dfMain)
This is what dfMain looks like:
If I want to add a column, I simply do this:
dfMain.insert(0, date, 0) #where date is in format like '2002-09'
But if I want to add a new category(row) and fill all the dates(columns) with zeros? How do I do that? I've tried with method append, but it asks for all the columns as parameters. Method Insert doesn't seem to work either..
Here's a possible solution:
dfMain.append(pd.Series(index=dfMain.columns, name='NewRow').fillna(0))
2002-09
Movies & TV NaN
Books NaN
NewRow 0.0