Python: Removing Rows on Count condition - python

I have a problem filtering a pandas dataframe.
city
NYC
NYC
NYC
NYC
SYD
SYD
SEL
SEL
...
df.city.value_counts()
I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.
What would be the way to do so without manually dropping them city by city?

Here you go with filter
df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]:
city
0 NYC
1 NYC
2 NYC
3 NYC
Solution two transform
sub_df = df[df.groupby('city').city.transform('count')>3].copy()
# add copy for future warning when you need to modify the sub df

This is one way using pd.Series.value_counts.
counts = df['city'].value_counts()
res = df[~df['city'].isin(counts[counts < 5].index)]
counts is a pd.Series object. counts < 5 returns a Boolean series. We filter the counts series by the Boolean counts < 5 series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~ is the negation operator.
Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts.

I think you're looking for value_counts()
# Import the great and powerful pandas
import pandas as pd
# Create some example data
df = pd.DataFrame({
'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})
# Get the count of each value
value_counts = df['city'].value_counts()
# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index
# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]

Another solution :
threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)
City
0 NYC
1 NYC
2 NYC
3 NYC

Related

Optimize For Loop goes over a unique IDs to remove Sport after target Sport

I hope you can help me optimize a For loop that goes over unique IDs.
I have a data frame with an Index, ID, and Sport as follows:
I want to remove all sports after our target sport which is Volleyball (keep only sport before our target), the output will be like this:
The data is sorted out by index.
I tried to use a For loop, but when I have a large data set, I have a large amount of time for execution
df = pd.read_csv('data.csv')
#Target sport
TargetArgs={'End_sport':'Volleyball'}
#empty dataframe to append all the dataframes from the loop
df_selected = pd.DataFrame()
# select unique IDs to use them in the loop
unique_id = df['ID'].unique()
for i in unique_id:
# create df for each ID a dataFrame
df_id = df[df['ID']==i]
# select the last index where the target sport is last seen
last_index = df_id['Sport'].where(df_id['Sport'] == TargetArgs.get('End_sport')).last_valid_index()
# select the first index of an ID has first occurs
first_index = df_id['ID'].where(df_id['ID']== i).first_valid_index()
# create a dataframe which is the selection from first index and last index
df_for_id = df_id.loc[first_index:last_index]
# append dataframe for each id
df_selected = df_selected.append(df_for_id)
Dataframes are not meant to be processed with for loops. It's like
taking individual bristles of your toothbrush and brushing your teeth with them one by one. You can achieve what you want with a combination of optimized, pandas-internal methods like groupby, loc and idxmax.
import pandas as pd
import numpy as np
# set up test data
data = {'ID': [1, 1, 1, 2, 2, 2, 3],
'Sport': ['Volleyball',
'Foot',
'Basket',
'Tennis',
'Volleyball',
'Swimming',
'Volleyball']}
df = pd.DataFrame(data, index=np.arange(7) + 1)
ID
Sport
1
1
Volleyball
2
1
Foot
3
1
Basket
4
2
Tennis
5
2
Volleyball
6
2
Swimming
7
3
Volleyball
For each ID group, get all rows up to the first occurrence of "Volleyball" in the Sport column and reset the index:
df2 = (df.groupby("ID")
.apply(lambda x:x.loc[:x.Sport.eq("Volleyball").idxmax()])
.reset_index(level=0, drop=True)
)
Result:
ID
Sport
1
1
Volleyball
4
2
Tennis
5
2
Volleyball
7
3
Volleyball

How to replace values in a column randomly using values from a list on Pandas?

I have the following dataframe:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Placeholder','Football','Placeholder','Tennis','Placeholder','Placeholder']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
I want to replace the value 'Placeholder' randomly with any value from the following list:
extra_sports = ['Football','Basketball','Tennis','Rowing']
The final outcome should look something like this whereby the value 'Placeholder' is now gone and replaced randomly with values from the list:
data = {'Names':['Antonio','Bianca','Chad','Damien','Edward','Frances','George'],'Sport':['Basketball','Tennis','Football','Rowing','Tennis','Football','Tennis']}
df = pd.DataFrame(data, columns = ['Names','Sport'])
And if possible how would I implement random.seed so that I can reproduce the results.
I believe you need replace only values Placeholder with list, for length of list use sum of boolean Trues for correct length of benerated array:
extra_sports = ['Football','Basketball','Tennis','Rowing']
np.random.seed(1)
m = df['Sport'].eq('Placeholder')
df.loc[m, 'Sport'] = np.random.choice(extra_sports, size=m.sum())
print (df)
Names Sport
0 Antonio Basketball
1 Bianca Basketball
2 Chad Football
3 Damien Rowing
4 Edward Tennis
5 Frances Football
6 George Football

What is the pythonic way to do a conditional count across pandas dataframe rows with apply?

I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)

Take nlargest 5 and sum/count the rest in pandas

My dataset looks like this:
ID | country
1 | USA
2 | USA
3 | Zimbabwe
4 | Germany
I do the following to take the name of the first country and its corresponding value. So in my case it would be:
df.groupby(['country']).country.value_counts().nlargest(5).index[0]
df.groupby(['country']).country.value_counts().nlargest(5)[0]
df.groupby(['country']).country.value_counts().nlargest(5).index[1]
df.groupby(['country']).country.value_counts().nlargest(5)[1]
etc.
and the output would be:
(USA), 388
(DEU), 245
etc.
And then I repeat it until I get the top 5 countries in my dataset.
However, how can I get a 'Other' or 'Rest' column whereby all other countries are lumped together. So countries like below are not so common in my dataset:
Zimbabwe, Irak, Malaysia, Kenya, Australia etc.
So I would like a sixth value with output that would look like this:
(Other), 3728
How can I achieve this in pandas?
Use:
N = 5
#get counts of column
s = df.country.value_counts()
#select top 5 values
out = s.iloc[:N]
#add sum of another values
out.loc['Other'] = s.iloc[N:].sum()
Last if need 2 column DataFrame:
df = out.reset_index()
df.columns=['country','count']
Replace less frequent countries with 'Other' before using value_counts. One efficient way to achieve this is via Categorical Data. If you want to keep your original data, then you work with a copy, e.g. new_country_series = df['country'].copy().
# convert series to categorical
df['country'] = df['country'].astype('category')
# extract labels
others = df['country'].value_counts().index[5:]
label = 'Other'
# apply new category label
df['country'] = df['country'].cat.add_categories([label])
df['country'] = df['country'].replace(others, label)
Then extract countries together with their counts:
for country, count in df['country'].value_counts():
print(country, count)

Pandas: append row to DataFrame with multiindex in columns

I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')

Categories