Re-categorize a column in a pandas dataframe - python

I am trying to build a simple classification model for my data stored in the pandas dataframe train. To make this model more efficient, I created a list of column names of columns I know to store categorical data, called category_cols. I categorize these columns as follows:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)
My target variable, material, is categorical, and has 64 unique labels it can be assigned to. However, some of these labels only appear once in train, which is too few to train the model well. So I'd like to filter any observations in train that have these rare material labels. This answer provided a useful groupby+filter combination:
print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))
min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45
This works great in that it does filter the observations with rare material labels. However, something under the hood in the category type seems to maintain all the previous values for material even after they've been filtered. This becomes a problem when trying to create dummy variables, and happens even if I try to rerun my same categorize method:
filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)
I would have expected the shape of the filtered dummies to be (19963, 45). However, pd.get_dummies includes columns for labels that have no appearances in filtered. I assume this has something to do with how the category type works. If so, could someone please explain how to re-categorize a column? Or if that is not possible, how to get rid of the unnecessary columns in the filtered dummies?
Thank you!

You can use category.cat.remove_unused_categories:
Usage
df['category'].cat.remove_unused_categories(inplace=True)
Example
df = pd.DataFrame({'label': list('aabbccd'),
'value': [1] * 7})
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
6 d 1
Lets set label as type category
df['label'] = df.label.astype('category')
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
6 d
Name: label, dtype: category
Categories (4, object): [a, b, c, d]
Filter DataFrame to remove label d
df = df[df.label.ne('d')]
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
Remove unused_categories
df.label.cat.remove_unused_categories(inplace=True)
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
Name: label, dtype: category
Categories (3, object): [a, b, c]

Per this answer, this can be solved via reindexing and transposing the dummy dataframe:
labels = filtered['material'].unique()
dummies = pd.get_dummies(filtered['material'])
dummies = dummies.T.reindex(labels).T
print(dummies.shape)
----------------------
(19963, 45)

Related

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

How to classify String Data into Integers?

I need to classify the String Values of a feature of my dataset, so that I can further use it for other things, let's say predicting or plotting.
How do I convert it?
I found this solution, but here I've to manually type in the code for every unique value of the feature. For 2-3 unique values, it's alright, but I've got a feature with more than 50 unique values of countries, I can't write code for every country.
def sex_class(x):
if x == 'male':
return 1
else:
return 0
This changes the male values to 1 and female values to 0 in the feature - sex.
You can make use of the scikit-learn LabelEncoder
#given a list containing all possible labels
sex_classes = ['male', 'female']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sex_classes)
This will assign labels to all the unique values in the given list. You can save this label encoder object as a pickle file for later use as well.
rank or pd.factorize
df['ID_int'] = df['id'].rank(method='dense').astype(int)
df['ID_int2'] = pd.factorize(df['id'])[0]
Output:
id ID_int ID_int2
0 a 2 0
1 b 3 1
2 c 4 2
3 a 2 0
4 b 3 1
5 c 4 2
6 A 1 3
7 b 3 1
The labels are different, but consistent.
You can use a dictionary instead.
sex_class = {'male': 1, 'female': 0}

Unpack DataFrame with tuple entries into separate DataFrames

I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684

python dask dataframes - concatenate groupby.apply output to a single data frame

I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64

Importing Data and Columns from Another Python Pandas Data Frame

I have been trying to select a subset of a correlation matrix using the Pandas Python library.
For instance, if I had a matrix like
0 A B C
A 1 2 3
B 2 1 4
C 3 4 1
I might want to select a matrix where some of the variables in the original matrix are correlated with some of the other variables, like :
0 A C
A 1 3
C 3 1
To do this, I tried using the following code to slice the original correlation matrix using the names of the desired variables in a list, transpose the correlation matrix, reassign the original column names, and then slice again.
data = pd.read_csv("correlationmatrix.csv")
initial_vertical_axis = pd.DataFrame()
for x in var_list:
a = data[x]
initial_vertical_axis = initial_vertical_axis.append(a)
print initial_vertical_axis
initial_vertical_axis = pd.DataFrame(data=initial_vertical_axis, columns= var_list)
initial_matrix = pd.DataFrame()
for x in var_list:
a = initial_vertical_axis[x]
initial_matrix = initial_matrix.append(a)
print initial_matrix
However, this returns an empty correlation matrix with the right row and column labels but no data like
0 A C
A
C
I cannot find the error in my code that would lead to this. If there is a simpler way to go about this, I am open to suggestions.
Suppose data contains your matrix,
In [122]: data
Out[122]:
A B C
0
A 1 2 3
B 2 1 4
C 3 4 1
In [123]: var_list = ['A','C']
In [124]: data.loc[var_list,var_list]
Out[124]:
A C
0
A 1 3
C 3 1

Categories