I am working on an SMS dataset that has two columns a "label column" which is consists of "ham/spam" and another column with "messages" consist of a bunch of strings.
I converted the "Label column" to numeric labels, ham=1, and spam=0
#Converting our labels to numeric labels
# ham = 0 and spam = 1
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0})
dfcat.head()
so when I run the above code the first time it gave me the exact thing am looking for but after I ran it again it started giving me "Nan".
Out[108]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: label, dtype: float64
Please, I need a way to fix this.
#G. Anderson gave the reason why are you seeing those NaN the second time you rerun it.
As for a way to handle categorical variables in Python, one could use one hot encoding. Toy example below:
import pandas as pd
df = pd.DataFrame({"col1": ["a", "b", "c"], "label": ["ham", "spam", "ham"]})
df_ohe = pd.get_dummies(df, prefix="ohe", drop_first=True, columns=["label"])
df_ohe
However, it also depends on the amount of categorical variables and their cardinality (if high, one hot encoding might not be the best approach).
The behavior of the series.map() function is to replace the values in the provided dictionary, and change other values to NaN. If you want to run the same line of code multiple times, then all values need to be accounted for. You can either use a defaultdict, which allows a default value to be set, or just include the result of the first run as inputs in case you run it a second time.
Change
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0})
to
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0,1:1,0:0})
Related
So, I am working w/data from my research lab and am trying to sort it and move it around etc. And most of the stuff isn't important to my issue and I don't want to go into detail because confidentiality stuff, but I have a big table w/columns and rows and I want to specifically switch the elements of two columns ONLY in one row.
The extremely bad attempt at code I have for it is this (I rewrote the variables to be more vague though so they make sense):
for x in df.columna.values:
*some if statements*
df.loc[df.index([df.loc[df['columna'] == x]]), ['columnb', 'columna']] = df[df.index([df.loc[df['columna'] == x]]), ['columna', 'columnb']].numpy()
I am aware that the code I have is trash (and also the method - w/the for loops and if statements. I know I can abstract it a TON but I just want to actually figure out a way to make it work and them I will clean it up and make it prettier and more efficient. I learned pandas existed on tuesday so I am not an expert), but I think my issue lies in the way I'm getting the row.
One error I was recently getting for a while is the method I was using to get the row was giving me 1 row x 22 columns and I think I needed the name/index of the row instead. Which is why the index function is now there. However, I am now getting the error:
TypeError: 'RangeIndex' object is not callable
And I am just so confused all around. Sorry I've written a ton of text, basically: is there any simpler way to just switch the elements of two columns for one specific row (in terms of x, an element in that row)?
I think my biggest issue is trying to like- get the rows "name" in the format it wants. Although I may have a ton of other problems because honestly I am just really lost.
You're sooooo close! The error you're getting stems from trying to slice df.index([df.loc[df['columna'] == x]]). The parentheses are unneeded here and this should read as: df.index[df.loc[df['columna'] == x]].
However, here's an example on how to swap values between columns when provided a value (or multiple values) to swap at.
Sample Data
df = pd.DataFrame({
"A": list("abcdefg"),
"B": [1,2,3,4,5,6,7]
})
print(df)
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
Let's say we're going to swap the values where A is either "c" or "f". To do this we need to first create a mask that just selects those rows. To accomplish this, we can use .isin. Then to perform our swap, we actually take the same exact approach you had! Including the .to_numpy() is very important, because without it Pandas will actually realign your columns for you and cause the values to not be swapped. Putting it all together:
swap_at = ["c", "f"]
swap_at_mask = df["A"].isin(swap_at) # mask where columns "A" is either equal to "c" or "f"
# Without the `.to_numpy()` at the end, pandas will realign the Dataframe
# and no values will be swapped
df.loc[swap_at_mask, ["A", "B"]] = df.loc[swap_at_mask, ["B", "A"]].to_numpy()
print(df)
A B
0 a 1
1 b 2
2 3 c
3 d 4
4 e 5
5 6 f
6 g 7
I think it was probably a syntax problem. I am assuming you are using tensorflow with the numpy() function? Try this it switches the columns based on the code you provided:
for x in df.columna.values:
# *some if statements*
df.loc[
(df["columna"] == x),
['columna', 'columnb']
] = df.loc[(df["columna"] == x), ['columnb', 'columna']].values.numpy()
I am also a beginner and would recommend you aim to make it pretty from the get go. It will save you a lot of extra time in the long run. Trial and error!
It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html
I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.
I have a feature called smoking_status it has 3 different values :
1) smokes
2) formerly smoked
3) never smoked
The feature column (smoking_status) has above 3 values as well as lot of NaN values how can I treat the NaN values because my data is not numerical, if it was numerical I could have replaced it using median or mean. How can I replace NaN values in my case ?
There might be two better options than replacing NaN with unknown - at least in the context of a data science challenge which I think this is:
replace this with the most common value (mode).
predict the missing value using the data you have
Getting the most common value is easy. For this purpos you can use <column>.value_counts() to get the frequencies followed by a .idxmax() which gives you the index element from value_counts() with the highes frequency. After that you just call fillna():
import pandas as pd
import numpy as np
df = pd.DataFrame(['formerly', 'never', 'never', 'never',
np.nan, 'formerly', 'never', 'never',
np.nan, 'never', 'never'], columns=['smoked'])
print(df)
print('--')
print(df.smoked.fillna(df.smoked.value_counts().idxmax()))
Gives:
smoked
0 formerly
1 never
2 never
3 never
4 NaN
5 formerly
6 never
7 never
8 NaN
9 never
10 never
--
0 formerly
1 never
2 never
3 never
4 never
5 formerly
6 never
7 never
8 never
9 never
10 never
You don't have the data for those rows. You can simply fill it median or mean, most common value in that feature. But in this perticular case that's a bad idea considering the feature.
A better approach would be to fill with a string saying 'unknown'/'na'
df['smoking_status'].fillna('NA')
Then you can label encode it or convert the column to one hot encoding.
Example categorical data:
ser = pd.Categorical(['non', 'non', 'never', 'former', 'never', np.nan])
Fill it:
ser.add_categories(['unknown']).fillna('unknown')
Gives you:
[non, non, never, former, never, unknown]
Categories (4, object): [former, never, non, unknown]
Looks like the question is about methodology, not technical issue.
So you can try
1) The most frequent value among those three;
2) Use some other categorical fields statistics of your dataset (e.g. group most common smoking status);
3) Random values;
4) "UNKNOWN" category
Then you can do one-hot-encoding and definitely check your models on cross-validation to chose the proper way.
Also there is more tricky way: use this status as a target variable and try to predict those NaNs with scikit using all other data.
I have a large-ish survey dataset to clean (300 columns, 30000 rows) and the columns are mixed. I'm using Python with pandas and numpy. Am very much in the learner wheels stage using Python.
Some of the columns had Y or N answers to questions (and these are filled "Y" or "N").
Some were likert scale questions with 5 possible answers. In the CSV file each answer (agree, disagree etc.) has its own column. This has imported as 1 for a yes and NaN otherwise.
Other questions had up to 10 possible answers (e.g. for age) and these have imported as a string in one column - i.e. "a. 0-18" or "b. 19-25" and so on. Changing those will be interesting!
As I go through I'm changing the Y/N answers to 1 or 0. However, for the likert scale columns, I'm concerned that there might be a risk with doing the same thing. Does anyone have a view as to whether it would be preferable to leave the data for those as NaN for now? Gender is the same - there is a separate column for Males and one for Females, both populated with 1 for yes and NaN for no.
I'm intending to use Python for the data analysis/charting (will import matplotlib & seaborn). As this is new to me I'm guessing that changes I make now may have unintended consequences later!
Any guidance you can give would be much appreciated.
Thanks in advance.
If there aren't 0's that mean anything, it's fine to fill the NA's with a value (0 for convenience). It all depends on your data. That said, 300 x 30k isn't that big. Save it off as a CSV and just experiment in IPython Notebook, Pandas can probably read it in under a second, so if you screw anything up, just reload.
Here's a quick bit of code that can condense whatever multi-column question sets into single columns with some number:
df = pd.DataFrame({
1: {'agree': 1},
2: {'disagree': 1},
3: {'whatevs': 1},
4: {'whatevs': 1}}).transpose()
df
question_sets = {
'set_1': ['disagree', 'whatevs', 'agree'], # define these lists from 1 to whatever
}
for setname, setcols in question_sets.items():
# plug the NaNs with 0
df[setcols].fillna(0)
# scale each column with 0 or 1 in the question set with an ascending value
for val, col in enumerate(setcols, start=1):
df[col] *= val
# create new column by summing all the question set columns
df[setname] = df[question_set_columns].sum(axis=1)
# delete all the old columns
df.drop(setcols, inplace=True, axis=1)
df