K Means Clustering - Handling Non-Numerical Data - python

I have twitter data that I want to cluster. It is text data and I learned that K means can not handle Non-Numerical data. I wanted to cluster data just on the basis of the tweets. The data looks like this.
I found this code that can converts the text into numerical data.
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x += 1
df[column] = list(map(convert_to_int, df[column]))
return df
df = handle_non_numerical_data(data)
print(df.head())
output
label tweet
0 9 24
1 5 11
2 17 45
3 14 138
4 18 112
I'm quite new to this and I don't think this is what I need to fit the data. What is a better way to handle Non-Numerical data (text) of this nature?
Edit: When running K means clustering Algorithm on raw text data I get this error.
ValueError: could not convert string to float

The most typical way of handling non-numerical data is to convert a single column into multiple binary columns. This is called "getting dummy variables" or a "one hot encoding" (among many other snobby terms).
There are other things you can do to translate the data to numbers, such as sentiment analysis (i.e. cetagorize each tweet into happy, sad, funny, angry, etc...), analyzing the tweets to determine if they are about a certain subject or not (i.e. Does this tweet talk about a virus?), the number of words in each tweet, the number of spaces per tweet, if it has good grammar or not, etc. As you can see, you are asking about a very broad subject.
When transforming data to binary columns, you get the number of unique values in your column and make that many new columns, each one of them filled with zeros and ones.
Let's focus on your first column:
import pandas as pd
df = pd.DataFrame({'account':['realdonaldtrump','naredramodi','pontifex','pmoindia','potus']})
account
0 realdonaldtrump
1 narendramodi
2 pontifex
3 pmoindia
4 potus
This is equivalent to:
pd.get_dummies(df, columns=['account'], prefix='account')
account_naredramodi account_pmoindia account_pontifex account_potus \
0 0 0 0 0
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 0 0 1
account_realdonaldtrump
0 1
1 0
2 0
3 0
4 0
This is one of many methods. You can check out this article about one hot encoding here.
NOTE: When you have many unique values, doing this will give you many columns and some algorithms will crash due to not having enough degrees of freedom (too many variables, not enough observations). Last, if you are running a regression, you will run into perfect multicollinearity if you do not drop one of the columns.
Going back to your example, if you want to turn all your columns into this kind of data, try:
pd.get_dummies(df)
However, I wouldn't do this for the tweet column because each tweet is its own unique value.

As k-means is a method of vector quantization, you should vectorize your textual data in one way or another.
See some examples of using k-means over text:
Word2Vec
tf-idf

Related

How to replace values in only some columns in Python without it affecting the same values in other columns?

I have a Pandas data frame with different columns.
Some columns are just “yes” or “no” answers that I would need to replace with 1 and 0
Some columns are 1s and 2s, where 2 equals no - these 2 need to be replaced by 0
Other columns are Numerical categories, for example 1,2,3,4,5 where 1 = lion 2 = dog
Other columns are string categories, like: “A lot”, “A little” etc
The first 2 columns are the target variables
My problem issues:
If I just change all 2 to 0 in the data frame, it would end up changing the 2 in the target variable (which in this case act as a score rather than a “No”)
Another problem would be that columns with categories as numbers, will have their 2s changed to 0 as well
How can I clean this dataframe so that
2. all columns with either yes or 1 and those with either no or 2 -> become 1 and 0s
3. the two target variables -> stay as scores from 1-5
4. and all categorical variables remain unchanged until I do onehot encoding with them.
These are the steps I took:
To change all the “yes” or “no” to 0 and 1
df.replace(('Yes', 'No'), (1, 0), inplace=True)
Now in order to replace all the 2s that act as “No”s with 0s -
without it affecting neither the “2” that act as a score in first two target columns
nor the “2” that act as a category value in columns that have more than 2 unique values, I think I would need to combine the following two lines of code, is that correct? I am trying different ways to combine them but I keep getting errors
df.loc[:, df.nunique() <= 2] or df[df.columns.difference([‘target1 ‘,’target2 '])].replace(2, 0)
It would be better if you showed your code here and a sample of the database. I'm a bit confused. Here is what I gleaned:
First, here is a dummy dataset I created:
Here is the code that I think solves your two problems. If there is something missing, it's because I didn't quite get the explanation as I said.
import pandas as pd
import numpy as np
import os
filename = os.path.join(os.path.dirname(__file__),'data.csv')
sample = pd.read_csv(filename)
# This solves your first problem. Here we create a new column using numeric values instead of yes/no string values
#with a function
def create_answers_column(sample, colname):
def is_yes(a):
if a == 'yes':
return 1
else:
return 0
return sample[colname].apply(is_yes)
sample['Answers Numeric'] = create_answers_column(sample, 'Answers')
#This solves your second problem
#Using replace()
sample['Numbers'] = sample.Numbers.replace({2:0})
print(sample)
And here's the output:
Answers Numbers Animals Quantifiers Answers Numeric
0 yes 1 1 a lot 1
1 yes 0 2 little 1
2 no 0 3 many 0
3 yes 1 4 some 1
4 no 1 5 several 0

Why is Pandas DataFrame Function 'isin()' taking so much time?

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]

Pandas Faster Way for One Hot Encoding vs pd.get_dummies

I need to one hot encode categorical variables on my pandas data frame.
My dataset is really big with over 2000 productIDs to be one hot encoded.
I tried pd.get_dummies and it always crashes.
I have also tried scikit-learn's OneHotEncoder which also crashes! (it works fine with a smaller subset of dataframe)
What other methods are there? What is the most efficient way to one hot encode categorical variables for very big data set?
My data frame:
Month User ProductID
1 A ProdA
3 A ProdB
11 A ProdC
12 A ProdD
Required output:
Month User ProdA ProdB ProdC ProdD
1 A 1 0 0 0
3 A 0 1 0 0
11 A 0 0 1 0
12 A 0 0 0 1
My dataset is really big with over 2000 productIDs and million of user rows.
This will result in a huge dataset. Presumably it's crashing because of memory.
Perhaps you should consider alternatives to full one-hot encoding.
One way is to create dummies of the top categories, and "other" for the rest.
tops = df.ProductID.value_counts().head(10)
will give you the top product ids. You can then use
df.ProductID[~df.ProductID.isin(tops)] = 'other'
and create dummies out of that.
If you have a response variable, you might alternatively use mean encoding.
For a feature with so many different possible values, one-hot encoding may not be the best option.
I suggest using Target Encoding (https://contrib.scikit-learn.org/categorical-encoding/). Unlike one-hot encoding, which will create k columns for k unique values of the feature, target encoding transforms the one feature into one column.

How to convert strings to numeric values?

I am cleaning a csv file on jupyter to do machine learning.
However, several columns have string values, like the column "description":
I know I need to use NLP to clean, but could not find how to do it on jupyter.
Could you advice me how to convert these values to numeric values?
Thank you
Numerical values are better for creating learning models than words or images.(Why? dimensionality reduction)
Common machine learning algorithms expect a numerical input.
The technique used to convert a word to a corresponding numerical value is called word embedding.
In word embedding, strings are converted to feature vectors(numbers).
Bag of words, word2vec, GloVe can be used for implementing this.
It is generally advisable to ignore those fields which wouldn't be significant for the model.So include description only if is absolutely essential.
The problem you are describing is that of converting categorical data, usually in the form of strings or numerical ID's to purely numerical data. I'm sure you are aware that using numerical ID's has a problem: it leads to the false interpretation that the data has some sort of order. Like apple < orange < lime, when this is not the case.
It is common to use one-hot encoding to produce numerical indicator variables. After encoding one column, you have N columns, where N is the amount of unique labels. The columns have a value of 1 when the corresponding categorical variable had that value and 0 otherwise. This is especially handy if there are few unique labels in one column. Both Pandas and sklearn have these sorts of functions available, albeit they are not as feature complete as one would hope.
The "description" column you have seems to be a bit trickier, because it actually includes language, not just categorical data. So that column would need to be parsed or handled in some other way. Although, the one-hot encoding scheme may very well be used for all the words in the description, producing a vector that has more 1's.
For example:
>>> import pandas as pd
>>> df = pd.DataFrame(['a', 'b', 'c', 'a', 'a', pd.np.nan])
>>> pd.get_dummies(df)
0_a 0_b 0_c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
5 0 0 0
Additional processing would be needed to get the encoding word by word. This approach considers only the full values as variables.

pandas.get_dummies on float numbers for machine learning

I have some data in the form of a panda dataframe that has the columns watercolor (string), place (string), temperature(float).
I want to use one hot encoding to turn the data into categories like
color: darkblue, lightblue, teal
1 0 0
0 1 0
For the strings it is no problem, but how do I set the intervals for the temperature(float)?
I tried writing:
output = pd.get_dummies(df.astype(str))
The problem is that all the unique float values are turned into a separate category like:
temperature: 37,6 37,7 37,9 38
0 1 0 0
1 0 0 0
That means that my program will overfit the data since all the temperatures are turned into separate categories. I would like to specify the interval for the third column(temperature). So I want to group all values from say 37,5-39 and from 39-41,5 and so on.
try use cut before creating the dummies columns
pd.cut(df['temperature'], [37.5, 39, 41,.....], labels=['37.5-39', '39-41',.....])

Categories