My dataset looks like this:
ID | country
1 | USA
2 | USA
3 | Zimbabwe
4 | Germany
I do the following to take the name of the first country and its corresponding value. So in my case it would be:
df.groupby(['country']).country.value_counts().nlargest(5).index[0]
df.groupby(['country']).country.value_counts().nlargest(5)[0]
df.groupby(['country']).country.value_counts().nlargest(5).index[1]
df.groupby(['country']).country.value_counts().nlargest(5)[1]
etc.
and the output would be:
(USA), 388
(DEU), 245
etc.
And then I repeat it until I get the top 5 countries in my dataset.
However, how can I get a 'Other' or 'Rest' column whereby all other countries are lumped together. So countries like below are not so common in my dataset:
Zimbabwe, Irak, Malaysia, Kenya, Australia etc.
So I would like a sixth value with output that would look like this:
(Other), 3728
How can I achieve this in pandas?
Use:
N = 5
#get counts of column
s = df.country.value_counts()
#select top 5 values
out = s.iloc[:N]
#add sum of another values
out.loc['Other'] = s.iloc[N:].sum()
Last if need 2 column DataFrame:
df = out.reset_index()
df.columns=['country','count']
Replace less frequent countries with 'Other' before using value_counts. One efficient way to achieve this is via Categorical Data. If you want to keep your original data, then you work with a copy, e.g. new_country_series = df['country'].copy().
# convert series to categorical
df['country'] = df['country'].astype('category')
# extract labels
others = df['country'].value_counts().index[5:]
label = 'Other'
# apply new category label
df['country'] = df['country'].cat.add_categories([label])
df['country'] = df['country'].replace(others, label)
Then extract countries together with their counts:
for country, count in df['country'].value_counts():
print(country, count)
Related
I am new to learning machine learning on datasets in python and am trying to perform the following on the below dataframe (only shown a snippet)
id
country
device
label
100
sg
samsung
0
100
ch
galaxy s
0
200
ab
pocophone
1
200
ee
iphone 1
1
200
my
iphone 2
1
i am trying to
get a list of all the countries where the labels have been = 1
for each id, out of all the countries , count the countries that are in the list in 1), and get the total count of the countries present for each id.
Update:
I have managed to get a list of countries where label = 1. For each id, how to find the number of countries that they have which falls into the list mentioned before?
You can use
df.loc[ df['label'] == 1 ] ['country']
This will find which indices have df['label'] as 1, locate them, and take the 'country' Series from them.
Try via loc accessor and boolean masking:
count=df.loc[df['label'].eq(1),['id','country']].value_counts()
#count the values of country where 'label' is 1
lst=count.index.get_level_values(1).unique().tolist()
#get the index of count for country names
output of lst:
['ab', 'ee', 'my']
output of count:
id country
200 ab 1
ee 1
my 1
dtype: int64
If I understand correctly:
unique countries with label = 1
>>> df.query('label == 1')['country'].unique()
array(['ab', 'ee', 'my'], dtype=object)
count of unique countries per id when label = 1
>>> df.query('label == 1').groupby('id')['country'].nunique()
id
200 3
Name: country, dtype: int64
Updated version:
countries = df.query('label == 1')['country'].unique()
df.query('country in #countries').groupby('id')['country'].nunique()
I have the following dataframe:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Placeholder','Football','Placeholder','Tennis','Placeholder','Placeholder']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
I want to replace the value 'Placeholder' randomly with any value from the following list:
extra_sports = ['Football','Basketball','Tennis','Rowing']
The final outcome should look something like this whereby the value 'Placeholder' is now gone and replaced randomly with values from the list:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Tennis','Football','Rowing','Tennis','Football','Tennis']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
And if possible how would I implement random.seed so that I can reproduce the results.
I believe you need replace only values Placeholder with list, for length of list use sum of boolean Trues for correct length of benerated array:
extra_sports = ['Football','Basketball','Tennis','Rowing']
np.random.seed(1)
m = df['Sport'].eq('Placeholder')
df.loc[m, 'Sport'] = np.random.choice(extra_sports, size=m.sum())
print (df)
Names Sport
0 Antonio Basketball
1 Bianca Basketball
2 Chad Football
3 Damien Rowing
4 Edward Tennis
5 Frances Football
6 George Football
I am new to python and hence will appreciate any help on this!
Suppose i have a bunch of columns in a dataset with categorical values. Let's say Gender, marital status, etc.
While doing input validation of the dataset, i need to check if the values of columns are within an acceptable range.
For instance, if the column is gender, accpetable values as male, female. Suppose column is marital status, acceptable values are single, married, divorced.
if for instance, the user inputs a dataset with values for these variables outside the acceptable range, i need to write a function to point it out.
How do i do this?
suppose i create a static acceptable value mapping list like below, for all datasets:
dataset variable acceptable_values
demographics gender male,female
demographics marital status single,married,divorced
purchase region south,east,west,north
Ideally code should go through all variables in all datasets listed in above mapping file and see if the values are in the "acceptable_values" list
Suppose below are new datasets, the code show throw an output saying,
unacceptable values found for dataset: demographics, for variable: gender - Boy,Other,missing,(blank)
unacceptable values found for dataset: demographics, for variable: maritalstatus- separated
demographics:
id gender maritalstatus
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
Let me know how this can be achieved. Looks fairly complicated to my understanding
It would be great if the code could convert the "new"/"unacceptable" values to NaN or 0 or something like that as well, but this is good to have.
You could do something like the following, where we assume that you're storing your data frames in a dictionary called df_dict, and the collection of accepted values in a data frame called df_accepted:
# First, use the dataset and variable name as indices in df_accepted
# to make it easier to perform lookups
df_accepted.set_index(['dataset', 'variable'], inplace=True)
# Loop over all data frames
for name, df in df_dict.items():
# Loop over all columns in the current data frame
for c in df:
# Find the indices for the given column for which the values
# does /not/ belong to the list of accepted values for this column.
try:
mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
# Print the values that did not belong to the list
print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
# Convert them into NaNs
df[c][mask] = np.nan
except KeyError:
print(f'Skipping validation of {c} in {name}')
With your given input:
In [200]: df_accepted
Out[200]:
dataset variable acceptable_values
0 demographics gender male,female
1 demographics maritalstatus single,married,divorced
2 purchase region south,east,west,north
In [201]: df_dict['demographics']
Out[201]:
gender maritalstatus
id
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
In [202]: df_dict['purchase']
Out[202]:
region count
0 south 60
1 west 90210
2 north-east 10
In [203]: df_accepted.set_index(['dataset', 'variable'], inplace=True)
...:
...: for name, df in df_dict.items():
...: for c in df:
...: try:
...: mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
...: print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
...: df[c][mask] = np.nan
...: except KeyError:
...: print(f'Skipping validation of {c} in {name}')
...:
Bad values for gender in demographics: Boy, Other, missing, (blank)
Bad values for maritalstatus in demographics: separated
Bad values for region in purchase: north-east
Skipping validation of count in purchase
In [204]: df_accepted
Out[204]:
acceptable_values
dataset variable
demographics gender male,female
maritalstatus single,married,divorced
purchase region south,east,west,north
In [205]: df_dict['demographics']
Out[205]:
gender maritalstatus
id
1 male single
2 male single
3 NaN single
4 NaN married
5 NaN divorced
6 NaN NaN
In [206]: df_dict['purchase']
Out[206]:
region count
0 south 60
1 west 90210
2 NaN 10
There might be a simpler way to do this, but this solution works:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['region', 'number'], data=[['north',0],['south',-4],['hello',15]])
valid_values = {'region': {'north','south','west','east'}}
df = df.apply(lambda column:
column.apply(lambda x: x if x in valid_values[column.name] else np.nan)
if column.name in valid_values else column)
I have a problem filtering a pandas dataframe.
city
NYC
NYC
NYC
NYC
SYD
SYD
SEL
SEL
...
df.city.value_counts()
I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.
What would be the way to do so without manually dropping them city by city?
Here you go with filter
df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]:
city
0 NYC
1 NYC
2 NYC
3 NYC
Solution two transform
sub_df = df[df.groupby('city').city.transform('count')>3].copy()
# add copy for future warning when you need to modify the sub df
This is one way using pd.Series.value_counts.
counts = df['city'].value_counts()
res = df[~df['city'].isin(counts[counts < 5].index)]
counts is a pd.Series object. counts < 5 returns a Boolean series. We filter the counts series by the Boolean counts < 5 series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~ is the negation operator.
Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts.
I think you're looking for value_counts()
# Import the great and powerful pandas
import pandas as pd
# Create some example data
df = pd.DataFrame({
'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})
# Get the count of each value
value_counts = df['city'].value_counts()
# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index
# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]
Another solution :
threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)
City
0 NYC
1 NYC
2 NYC
3 NYC
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')