Values in a dataframe column will not change - python

I have a data frame where I want to replace the values '<=50K' and '>50K' in the 'Salary' column with '0' and '1' respectively. I have tried the replace function but it does not change anything. I have tried a lot of things but nothing seems to work. I am trying to do some logistic regression on the cells but the formulas do not work because of the datatype. The real data set has over 20,000 rows.
Age Workclass fnlwgt education education-num Salary
39 state-gov 455 Bachelors 13 <=50K
25 private 22 Masters 89 >50K
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']
This is the error i get when i try to do smf.logit(). See below code. I don't understand why i get an error because Age and education-num are both int64.
mod = smf.logit(formula = 'education-num ~ Age', data= dftrn)
resmod = modelAdm.fit()
ValueError: endog has evaluated to an array with multiple columns that
has shape (26049, 16). This occurs when the variable converted to
endog is non-numeric (e.g., bool or str).

You can try this and for check purpose I have created a new column, you can always change the same column as well just replace new_column with column;
df[df['new_salary']=='<=50K']= 0
df[df['new_salary']=='>50K']= 1

Regarding the first question, you should just use a single square bracket on the left side of the equation.
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']= df['Salary'].replace(['>50K'],'1')
df['Salary']
As for the second part of the question, you are naming the model as mod but you are calling the fit function on modelAdm.
Anyways those are 2 different questions and should be asked separately in 2 different posts.

Related

implementing hot deck imputation in python

I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.

pandas dataframe throwing an empty list

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.

What's wrong with fillna in Pandas running twice?

I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).

How to plot a graph for correlation co-efficient between each attributes of a dataset and target attribute using Python

I am new to Python and I need to plot a graph between correlation coefficient of each attributes against target value. I have an input dataset with huge number of values. I have provided sample dataset value as below. We need to predict whether a particular consumer will leave or not in a company and hence Result column is the target variable.
SALARY DUE RENT CALLSPERDAY CALL DURATION RESULT
238790 7 109354 0 6 YES
56004 0 204611 28 15 NO
671672 27 371953 0 4 NO
786035 1 421999 19 11 YES
89684 2 503335 25 8 NO
904285 3 522554 0 13 YES
12072 4 307649 4 11 NO
23621 19 389157 0 4 YES
34769 11 291214 1 13 YES
945835 23 515777 0 5 NO
Here, if you see, the result column is String where as rest of the columns are integer. Similar to result, I also have few other columns(not mentioned in sample) which have string value. Here, I need to compute the values of column which has both string and integer values.
Using dictionary I have assigned a value to each of the columns which has string value.
Example: Result column has Yes or No. Hence assigned value as below:
D = {'NO': 0, 'YES': 1}
and using lambda function, looped through each columns of dataset and replaced NO with 0 and YES with 1.
I tried to calculate the correlation coefficient using the formula:
pearsonr(S.SALARY,targetVarible)
Where S is the dataframe which holds all values.
Similarly, I will loop through all the columns of dataset and calculate correlation coefficient of each columns agains target variable.
Is this an efficient way of calculating correlation coefficient?
Because, I am getting value as below
(0.088327739664096655, 1.1787456108540725e-25)
e^-25 seems to be too small.
Is there any other way to calculate it? Would you be suggesting any other way to input String values, so that it can be treated as integer when compared with other columns that has integer values(other than Dictionaries and lambdas which I used?)
Also I need to plot bar graph using the same code. I am planning to use from matplotlib import pyplot as plt library.
Would you be suggesting any other function to plot bar graph. Mostly Im using sklearn libraries,numpy and pandas to use existing functions from them.
It would be great, if someone helps me. Thanks.
As mentioned in the comments you can use df.corr() to get the correlation matrix of your data. Assuming the name of your DataFrame is df you can plot the correlation with:
df_corr = df.corr()
df_corr[['RESULT']].plot(kind='hist')
Pandas DataFrames have a plot function that uses matplotlib. You can learn more about it here: http://pandas.pydata.org/pandas-docs/stable/visualization.html

Merge misaligned pandas dataframes

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.
A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

Categories