From the Titanic Dataset from Kaggle, I'm trying to extract how many people survived and how many died from the survived column. To do this, I imported the pandas library and saved the dataset in the variable dataframe and used the following code:
dataframe['survived'].value_counts()
Which gave me the output as
0 809
1 500
Name: survived, dtype: int64
From this, how do I print just the number of people who survived? Like if I want the count of 1, I need the output as 500. Same thing for when I want just the count of 0.
I tried the following code only to get a SyntaxError
dataframe['survived'].value_counts().1
I'm new to pandas, so I'd really appreciate it if anyone could help me with this!
For your case, you can use sum instead of value_counts because you have a binary column: 1 for survived, 0 for died so the sum get you all survived people:
>>> dataframe['survived'].sum()
500
In case of your column is not binary, you can use:
# 1 stand for survived people here
>>> dataframe['survived'].eq(1).sum()
500
Here is a more human like logic answer.
Ask the data frame for all observations (rows) with survivors in it.
# some datasets would use 'yes', 'si', 'alive', ..
alive = 1
# eq() means equal; like ==
survivors = dataframe[dataframe.survived.eq(alive)]
And then count the observations (rows).
print(len(survivors))
You can use:
dataframe['survived'].value_counts()[0]
or:
dataframe['survived'].value_counts().loc[0]
The .column_name/.index_name syntax is not recommended as it restricts the possibilities to column names that are valid python variables. Strings starting with a number are not valid python variable names.
Related
I have a data frame where I want to replace the values '<=50K' and '>50K' in the 'Salary' column with '0' and '1' respectively. I have tried the replace function but it does not change anything. I have tried a lot of things but nothing seems to work. I am trying to do some logistic regression on the cells but the formulas do not work because of the datatype. The real data set has over 20,000 rows.
Age Workclass fnlwgt education education-num Salary
39 state-gov 455 Bachelors 13 <=50K
25 private 22 Masters 89 >50K
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']
This is the error i get when i try to do smf.logit(). See below code. I don't understand why i get an error because Age and education-num are both int64.
mod = smf.logit(formula = 'education-num ~ Age', data= dftrn)
resmod = modelAdm.fit()
ValueError: endog has evaluated to an array with multiple columns that
has shape (26049, 16). This occurs when the variable converted to
endog is non-numeric (e.g., bool or str).
You can try this and for check purpose I have created a new column, you can always change the same column as well just replace new_column with column;
df[df['new_salary']=='<=50K']= 0
df[df['new_salary']=='>50K']= 1
Regarding the first question, you should just use a single square bracket on the left side of the equation.
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']= df['Salary'].replace(['>50K'],'1')
df['Salary']
As for the second part of the question, you are naming the model as mod but you are calling the fit function on modelAdm.
Anyways those are 2 different questions and should be asked separately in 2 different posts.
I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below
import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
# download the data
j = requests.get(url=json_url)
# load the json
content = json.loads(j.content)
list(content.keys())
The last line of code above gives me the below output:
['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']
I then tried to have a look at the lengths of 'Value', 'size' and 'role'
print (len(content['value']))
print (len(content['size']))
print (len(content['role']))
And I got the results as below:
22200
5
3
As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths.
How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:
1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year
Exploratory data analysis to show how the population progressed by regions and age groups
You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.
Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.
But because of the 5 dimensions, it is probably better to first use a numpy matrix...
The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.
I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.
Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.
This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.
You can explore this dataset using the JSON-stat explorer
Summary
It's important in many scientific applications to keep track of different kinds of missing value. Is a value for 'weekly income from main job' missing because the person doesn't have a job, or because they have a job but refused to answer?
Storing all missing values as NA or NaN loses this information.
Storing missing value labels (e.g., 'missing because no job', 'missing because refused to answer') in a separate column means the researcher must keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors.
Storing missing value labels within the same column (e.g., as negative numbers, as in the example below, or very large numbers like 99999) means the researcher must manually keep track of how missing value labels are encoded for every column, and creates many other opportunities for mistakes (e.g., forgetting that a column includes missing values and taking a raw mean instead of using the correct mask).
It is very easy to handle this problem in Stata (see below), by using a data type that stores both numeric values and missing value labels, and with functions that know how to handle this data type. This is highly performant (data type remains numeric, not string or mixed – think of NumPy's data types, except instead of having just NaN we have NaN1, NaN2, etc.) What is the best way of achieving something like this in pandas?
Note: I'm an economist, but this is also an incredibly common workflow for political scientists, epidemiologists, etc. – anyone who deals with survey data. In this context, the analyst knows what the missing values are via a codebook, really cares about keeping track of them, and has hundreds or thousands of columns to deal with – so, indeed, needs an automated way of keeping track of them.
Motivation/context
It's extremely common when dealing with any kind of survey data to have multiple kinds of missing data.
Here is a minimal example from a government questionnaire used to produce official employment statistics:
[Q1] Do you have a job?
[Q2] [If Q1=Yes] What is your weekly income from that job?
The above occurs in pretty much every government-run labor force survey in the world (e.g., the UK Labour Force Survey, the US Current Population Survey, etc.).
Now, for a given respondent, if [Q2] is missing, it could be that (1) they answered No to [Q1], and so were ineligible to be asked [Q2], or that (2) they answered Yes to [Q1] but refused to answer [Q2] (perhaps because they were embarrassed at how much/little they earn, or because they didn't know).
As a researcher, it matters a great deal to me whether it was (1) that occurred, or whether it was (2). Suppose my job is to report the average weekly income of workers in the United States. If there are many missing values for this [Q2] column, but they are all labeled 'missing because respondent answered no to [Q1]', then I can take the average of [Q2] with confidence – it is, indeed, the average weekly income of people in work. (All the missing values are people who didn't have a job.)
On the other hand, if those [Q2] missing values are all labeled 'missing because respondent was asked this question but refused to answer', then I cannot simply report the average of [Q2] as the average weekly income of workers. I'll need to issue caveats around my results. I'll need to analyze the kinds of people who don't answer (are they missing at random, or are people in higher-income occupations more likely to refuse, for example, biasing my results?). Possibly I'll try to impute missing values, and so on.
The problem
Because these 'reasons for being missing' are so important, government statistical agencies will code the different reasons within the column:
So the column containing the answers to [Q2] above might contain the values [1500, -8, 10000, -2, 3000, -1, 6400].
In this case, '1500', '10000', and so on are 'true' answers to [Q2] ($1,500 weekly income, $10,000 weekly income, etc.); whereas '-8' means they weren't eligible to answer (because they answered No to [Q1]) '-2' means they were eligible to answer but refused to do so, and so on.
Now, obviously, if I take the average of this column, I'm going to get something meaningless.
On the other hand, if I just replace all negative values with NaN, then I can take the average – but I've lost all this valuable information about why values are missing. For example, I may want to have a function that takes any column and reports, for that column, statistics like the mean and median, the number of eligible observations (i.e., everything except value=-8), and the percent of those that were non-missing.
It works great in Stata
Doing this in Stata is extremely easy. Stata has 27 numeric missing categories: '.a' to '.z'. (More details here.) I can write:
replace weekly_income = .a if weekly_income == -1
replace weekly_income = .b if weekly_income == -8
and so on.
Then (in pseudocode) I can write
stats weekly_income if weekly_income!=.b
When reporting the mean, Stata will automatically ignore the values coded as missing (indeed, they're now not numeric); but it will also give me missing value statistics only for the observations I care about (in this case, those eligible to be asked the question, i.e., those who weren't originally coded '-8').
What is the best way to handle this in Pandas?
Setup:
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict({
'income': [1500, -8, 10000, -2, 3000, -1, 6400]})
Desired outcome:
>>> df.income.missing_dict = {'-1': ['.a', 'Don\'t know'], '-2': ['.b', 'Refused']} # etc.
>>> df
income
0 1500
1 Inapplic.
2 10000
3 Refused
4 3000
5 Don't know
6 6400
>>> assert df.income.mean() == np.mean([1500, 10000, 3000, 6400])
(passes)
The 'obvious' workaround
Clearly, one option is to split every column into two columns: one numeric column with non-missing values and NaNs, and the other a categorical column with categories for the different types of missing value.
But this is extremely inconvenient. These surveys often have thousands of columns, and a researcher might well use hundreds in certain kinds of economic analysis. Having two columns for every 'underlying' column means the researcher has to keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors. It also means that displaying the table is very wasteful – for any column, I need to now display two columns, one of which for any given observation is always redundant. (This is wasteful both of screen real estate, and of the human analysts' attention, having to identify which two columns are a 'pair'.)
Other ideas
Two other thoughts that occur to me, both probably non-ideal:
(1) Create a new data type in pandas that works similarly to Stata (i.e., adds '.a', '.b', etc. to allowable values for numeric columns).
(2) Use the two-columns solution above, but (re)write 'wrapper' functions in pandas so that 'groupby' etc. keeps track of the pairs of columns for me.
I suspect that (1) is the best solution for the long term, but it would presumably require a huge amount of development.
On the other hand, maybe there are already packages that solve this? Or people have better work-arounds?
To show the solution, I'm taking the liberty of changing the missing_dict keys to match the data type of income.
>>> df
income
0 1500
1 -8
2 10000
3 -2
4 3000
5 -1
6 6400
>>> df.income.missing_dict
{-8: ['.c', 'Stifled by companion'], -2: ['.b', 'Refused'], -1: ['.a', "Don't know"]}
Now, here's how to filter the rows according to the values being in the "missing" list:
>>> df[(~df.income.isin((df.income.missing_dict)))]
income
0 1500
2 10000
4 3000
6 6400
Note the extra parentheses around the filter values: we have to pass a tuple of values to isin. Then apply the tilde operator, bit-wise negation, to get a series of Booleans.
Finally, apply mean to the resulting data column:
>>> df[(~df.income.isin((df.income.missing_dict)))].mean()
income 5225.0
dtype: float64
Does that toss you in the right direction? From here, you can simply replace income with the appropriate column or variable name as needed.
Pandas recently introduced a custom array type called ExtensionArray that allows defining what is in essence a custom column type, allowing you to (sort of) use actual values alongside missing data without dealing with two columns. Here is a very, very crude implementation, which has barely been tested:
import numpy as np
import pandas as pd
from pandas.core.arrays.base import ExtensionArray
class StataData(ExtensionArray):
def __init__(
self, data, missing=None, factors=None, dtype=None, copy=False
):
def own(array, dtype=dtype):
array = np.asarray(array, dtype)
if copy:
array = array.copy()
return array
self.data = own(data)
if missing is None:
missing = np.zeros_like(data, dtype=int)
else:
missing = own(missing, dtype=int)
self.missing = missing
self.factors = own(factors)
#classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls(scalars, dtype=dtype, copy=copy)
#classmethod
def _from_factorized(cls, data, original):
return cls(original, None, data)
def __getitem__(self, key):
return type(self)(
self.data[key], self.missing[key], self.factors
)
def __setitem__(self, key, value):
self.data[key] = value
self.missing[key] = 0
def __len__(self):
return len(self.data)
def __iter__(self):
return iter(self.data)
#property
def dtype(self):
return self.data.dtype
#property
def shape(self):
return self.data.shape
#property
def nbytes(self):
return self.data.nbytes + self.missing.nbytes + self.factors.nbytes
def view(self):
return self
#property
def reason_missing(self):
return self.missing
def isna(self):
return self.missing != 0
def __repr__(self):
s = {}
for attr in ['data', 'missing', 'factors']:
s[attr] = getattr(self, attr)
return repr(s)
With this implementation, you can do the following:
>>> a = StataData([1, 2, 3, 4], [0, 0, 1, 0])
>>> s = pd.Series(a)
>>> print(s[s.isna()])
2 3
dtype: int32
>>> print(s[~s.isna()])
0 1
1 2
3 4
dtype: int32
>>> print(s.isna().values.reason_missing)
array([1])
Hopefully someone who understands this API can chime in and help improve this. For starters, a cannot be used in DataFrames, only Series.
>>> print(pd.DataFrame({'a': s}).isna())
0 False
1 False
2 False
3 False
I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).
I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333