I have a data frame with two columns :
state total_sales
AL 16714
AR 6498
AZ 107296
CA 33717
Now I want to map the strings in state column to int from 1 to N(where N is the no of rows,here 4 ) based on increasing order of values in total_sales . Result should be stored in another column (say label). That is, wanted a result like this :
state total_sales label
AL 16714 3
AR 6498 4
AZ 107296 1
CA 33717 2
Please suggest a vectorised implementation .
You can use rank with cast to int:
df['label'] = df['total_sales'].rank(method='dense', ascending=False).astype(int)
print (df)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
One option for converting a column of values to integers is pandas.Categorical.
This actually groups identical values, which in a case like this, where all values are unique each "group" has only one value. The resulting object has a codes attribute, which is a Numpy array of integers indicating which group each input value is in.
Applied to this problem, if you have
In [12]: data = pd.DataFrame({
'state': ['AL', 'AR', 'AZ', 'CA'],
'total_sales': [16714, 6498, 107296, 33717]
})
you can add the label column as described using
In [13]: data['label'] = len(data) - pd.Categorical(data.total_sales, ordered=True).codes
In [14]: print(data)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
For this example it is not as fast as jezrael's answer but it has a wide range of applications and it was faster for a larger series I was encoding to integers. It should be noted that if there are two identical values in the total_sales column it will assign them the same label.
Related
I am new to learning machine learning on datasets in python and am trying to perform the following on the below dataframe (only shown a snippet)
id
country
device
label
100
sg
samsung
0
100
ch
galaxy s
0
200
ab
pocophone
1
200
ee
iphone 1
1
200
my
iphone 2
1
i am trying to
get a list of all the countries where the labels have been = 1
for each id, out of all the countries , count the countries that are in the list in 1), and get the total count of the countries present for each id.
Update:
I have managed to get a list of countries where label = 1. For each id, how to find the number of countries that they have which falls into the list mentioned before?
You can use
df.loc[ df['label'] == 1 ] ['country']
This will find which indices have df['label'] as 1, locate them, and take the 'country' Series from them.
Try via loc accessor and boolean masking:
count=df.loc[df['label'].eq(1),['id','country']].value_counts()
#count the values of country where 'label' is 1
lst=count.index.get_level_values(1).unique().tolist()
#get the index of count for country names
output of lst:
['ab', 'ee', 'my']
output of count:
id country
200 ab 1
ee 1
my 1
dtype: int64
If I understand correctly:
unique countries with label = 1
>>> df.query('label == 1')['country'].unique()
array(['ab', 'ee', 'my'], dtype=object)
count of unique countries per id when label = 1
>>> df.query('label == 1').groupby('id')['country'].nunique()
id
200 3
Name: country, dtype: int64
Updated version:
countries = df.query('label == 1')['country'].unique()
df.query('country in #countries').groupby('id')['country'].nunique()
I have the following dataframe:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Placeholder','Football','Placeholder','Tennis','Placeholder','Placeholder']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
I want to replace the value 'Placeholder' randomly with any value from the following list:
extra_sports = ['Football','Basketball','Tennis','Rowing']
The final outcome should look something like this whereby the value 'Placeholder' is now gone and replaced randomly with values from the list:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Tennis','Football','Rowing','Tennis','Football','Tennis']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
And if possible how would I implement random.seed so that I can reproduce the results.
I believe you need replace only values Placeholder with list, for length of list use sum of boolean Trues for correct length of benerated array:
extra_sports = ['Football','Basketball','Tennis','Rowing']
np.random.seed(1)
m = df['Sport'].eq('Placeholder')
df.loc[m, 'Sport'] = np.random.choice(extra_sports, size=m.sum())
print (df)
Names Sport
0 Antonio Basketball
1 Bianca Basketball
2 Chad Football
3 Damien Rowing
4 Edward Tennis
5 Frances Football
6 George Football
My dataset looks like this:
ID | country
1 | USA
2 | USA
3 | Zimbabwe
4 | Germany
I do the following to take the name of the first country and its corresponding value. So in my case it would be:
df.groupby(['country']).country.value_counts().nlargest(5).index[0]
df.groupby(['country']).country.value_counts().nlargest(5)[0]
df.groupby(['country']).country.value_counts().nlargest(5).index[1]
df.groupby(['country']).country.value_counts().nlargest(5)[1]
etc.
and the output would be:
(USA), 388
(DEU), 245
etc.
And then I repeat it until I get the top 5 countries in my dataset.
However, how can I get a 'Other' or 'Rest' column whereby all other countries are lumped together. So countries like below are not so common in my dataset:
Zimbabwe, Irak, Malaysia, Kenya, Australia etc.
So I would like a sixth value with output that would look like this:
(Other), 3728
How can I achieve this in pandas?
Use:
N = 5
#get counts of column
s = df.country.value_counts()
#select top 5 values
out = s.iloc[:N]
#add sum of another values
out.loc['Other'] = s.iloc[N:].sum()
Last if need 2 column DataFrame:
df = out.reset_index()
df.columns=['country','count']
Replace less frequent countries with 'Other' before using value_counts. One efficient way to achieve this is via Categorical Data. If you want to keep your original data, then you work with a copy, e.g. new_country_series = df['country'].copy().
# convert series to categorical
df['country'] = df['country'].astype('category')
# extract labels
others = df['country'].value_counts().index[5:]
label = 'Other'
# apply new category label
df['country'] = df['country'].cat.add_categories([label])
df['country'] = df['country'].replace(others, label)
Then extract countries together with their counts:
for country, count in df['country'].value_counts():
print(country, count)
I have a problem filtering a pandas dataframe.
city
NYC
NYC
NYC
NYC
SYD
SYD
SEL
SEL
...
df.city.value_counts()
I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.
What would be the way to do so without manually dropping them city by city?
Here you go with filter
df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]:
city
0 NYC
1 NYC
2 NYC
3 NYC
Solution two transform
sub_df = df[df.groupby('city').city.transform('count')>3].copy()
# add copy for future warning when you need to modify the sub df
This is one way using pd.Series.value_counts.
counts = df['city'].value_counts()
res = df[~df['city'].isin(counts[counts < 5].index)]
counts is a pd.Series object. counts < 5 returns a Boolean series. We filter the counts series by the Boolean counts < 5 series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~ is the negation operator.
Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts.
I think you're looking for value_counts()
# Import the great and powerful pandas
import pandas as pd
# Create some example data
df = pd.DataFrame({
'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})
# Get the count of each value
value_counts = df['city'].value_counts()
# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index
# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]
Another solution :
threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)
City
0 NYC
1 NYC
2 NYC
3 NYC
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')