i have this dataframe:
0 name data
1 alex asd
2 helen sdd
3 alex dss
4 helen sdsd
5 john sdadd
so i am trying to get the most frequent value or values(in this case its values)
so what i do is:
dataframe['name'].value_counts().idxmax()
but it returns only the value: Alex even if it Helen appears two times as well.
By using mode
df.name.mode()
Out[712]:
0 alex
1 helen
dtype: object
To get the n most frequent values, just subset .value_counts() and grab the index:
# get top 10 most frequent names
n = 10
dataframe['name'].value_counts()[:n].index.tolist()
You could try argmax like this:
dataframe['name'].value_counts().argmax()
Out[13]: 'alex'
The value_counts will return a count object of pandas.core.series.Series and argmax could be used to achieve the key of max values.
df['name'].value_counts()[:5].sort_values(ascending=False)
The value_counts will return a count object of pandas.core.series.Series and sort_values(ascending=False) will get you the highest values first.
You can use this to get a perfect count, it calculates the mode a particular column
df['name'].value_counts()
Use:
df['name'].mode()
or
df['name'].value_counts().idxmax()
It will give top five most common names:
df['name'].value_counts().nlargest(5)
Here's one way:
df['name'].value_counts()[df['name'].value_counts() == df['name'].value_counts().max()]
which prints:
helen 2
alex 2
Name: name, dtype: int64
Not Obvious, But Fast
f, u = pd.factorize(df.name.values)
counts = np.bincount(f)
u[counts == counts.max()]
array(['alex', 'helen'], dtype=object)
Simply use this..
dataframe['name'].value_counts().nlargest(n)
The functions for frequencies largest and smallest are:
nlargest() for mostfrequent 'n' values
nsmallest() for least frequent 'n' values
to get top 5:
dataframe['name'].value_counts()[0:5]
You could use .apply and pd.value_counts to get a count the occurrence of all the names in the name column.
dataframe['name'].apply(pd.value_counts)
To get the top five most common names:
dataframe['name'].value_counts().head()
my best solution to get the first is
df['my_column'].value_counts().sort_values(ascending=False).argmax()
I had a similar issue best most compact answer to get lets say the top n (5 is default) most frequent values is:
df["column_name"].value_counts().head(n)
Identifying the top 5, for example, using value_counts
top5 = df['column'].value_counts()
Listing contents of 'top_5'
top5[:5]
n is used to get the number of top frequent used items
n = 2
a=dataframe['name'].value_counts()[:n].index.tolist()
dataframe["name"].value_counts()[a]
Getting top 5 most common lastname pandas:
df['name'].apply(lambda name: name.split()[-1]).value_counts()[:5]
Related
I have a DataFrame with two columns From and To, and I need to know the most frequent combination of locations From and To.
Example:
From To
------------------
Home Office
Home Office
Home Office
Airport Home
Restaurant Office
if the order does matter:
df['FROM_TO'] = df['FROM'] + df['TO']
df['COUNT'] = 1
df.groupby(['FROM_TO'])['COUNT'].sum()
gives you all the occurrences in one go. Simply take the max to find the largest occurrence.
If the order does matter first sort the values before:
df.loc[:,:] = np.sort(df.values,axis=1) # if the df only consists of the FROM adn TO columns.
You can group by the two columns together and count the number of occurrences of each pair, then sort the pairs by this count.
The following code does the job:
df.groupby(["From", "To"]).size().sort_values(ascending=False)
and, for the example of the question, it returns:
From To
-----------------------
Home Office 3
Restaurant Office 1
Airport Home 1
IIUC, SeriesGroupBy.value_counts and Series.idxmax
df.groupby('From')['To'].value_counts().idxmax()
Output
('Home', 'Office')
in general groupby.value_counts is faster than groupby.size
Another way:
df.apply(tuple, axis=1).value_counts().idxmax()
Or
df.apply(tuple, axis=1).mode()
Output
0 (Home, Office)
dtype: object
B0e.png
I am still relatively new to python. I am trying to do something more complicated. How can I use a for loop or iteration so that I count the same names ranked 1 and add them but also place them into a list format and also place the counted names in a separate list. The reason for this is that I will create a plot, and that I can do but I am stuck on how to separate the total counts of the same name and the names already counted.
Using this as example DataFrame:
RANK NAME
0 1 EMILY
1 1 DANIEL
2 1 EMILY
3 1 ISABELLA
You can do this to get the counted names:
counted_names = name_file[name_file.RANK == 1]['NAME'].value_counts()
print(counted_names)
EMILY 2
DANIEL 1
ISABELLA 1
pandas.groupby()
To solve any form aggregation in Python, all you need is to crack the groupby Function
For your case if you want to sum over 'Count' for all the unique names and later plot it, use pd.groupby()
Make sure you convert it into a DataFrame first and then apply Groupby Magic
name_file = pd.DataFrame(name_file)
name_file.groupby('Name').agg({'Count':'sum'})
This gives you the aggregated sum of counts for eaxh unique name in your dataframe
To get the Count of each Name, use the size.reset_index() method below
pd.DataFrame(name_file).groupby('Name').size().reset_index()
This returns the frequency of occurence of each unique name in the name_file
Hope this helps! Cheers !
I want to know if the values in two different rows of a Dataframe are the same.
My df looks something like this:
df['Name1']:
Alex,
Peter,
Herbert,
Seppi,
Huaba
df['Name2']:
Alexander,
peter,
herbert,
Sepp,
huaba
First I want to Apply .rstrip() and .toLower(), but these methods seem to only work on Strings. I tried Str(df['Name1'] which worked, but the comparison gave me the wrong result.
I also tried the following:
df["Name1"].isin(df["Name2"]).value_counts())
df["Name1"].eq(df["Name2"]).value_counts())
Problem 1: I think .isin also returns true if a Substring is found e.g. alex.isin(alexander)would return true then. Which is not what I´m looking for.
Problem 2: I think .eg would do it for me. But I still have the problem with the .rstrip() and to.lower() methods.
What is the best way to count the amount of same entries?
print (df)
Name1 Name2
0 Alex Alexander
1 Peter peter
2 Herbert herbert
3 Seppi Sepp
4 Huaba huaba
If need compare each row:
out1 = df["Name1"].str.lower().eq(df["Name2"].str.lower()).sum()
If need compare all values of Name1 by all values by Name2:
out2 = df["Name1"].str.lower().isin(df["Name2"].str.lower()).sum()
Use set to find the common values between two dataframe columns
common_values = list(set(df.Name1) & set(df.Name2) )
count = len(common_values)
How can I replace the data 'Beer','Alcohol','Beverage','Drink' with only 'Drink'.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink')
doesn't work
You almost had it. You need to pass a dictionary to df.replace.
df
Col1
0 Beer
1 Alcohol
2 Beverage
3 Drink
df.replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Col1
0 Drink
1 Drink
2 Drink
3 Drink
This works for exact matches and replacements. For partial matches and substring matching, use
df.replace(
dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'),
regex=True
)
This is not an in-place operation so don't forget to assign the result back.
Try the following approach:
lst = ['Beer','Alcohol','Beverage','Drink']
pat = r"\b(?:{})\b".format('|'.join(lst))
df = df.replace(pat, 'Drink', regexp=True)
Looks like different from MaxU's solution :)
df.replace({'|'.join(['Beer','Alcohol','Beverage','Drink']):'Drink'},regex=True)
It seems that your initial method of doing it works in the the latest iteration of Python.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink', inplace=True)
Should work
Slight change in earlier answers:
Following code Replacing values of specific column/Columns
df[['Col1']] = df[['Col1']].replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
I have a dataframe with many columns of data and different types. I have encountered one column that has String and Integers contained within it. I am trying to find the values with the longest/shortest length (note not largest value). (NOTE: The eg I am using below only has integers in it because I couldn't work out how to mix dtypes and still call this an int64 column)
Name MixedField
a david 32252
b andrew 4023
c calvin 25
d david 2
e calvin 522
f david 35
The method I am using is to convert the df column to a String Series (because they might be double/int/string/combinations), and then I can get the max/min length items from this series:
df['MixedField'].apply(str).map(len).max()
df['MixedField'].apply(str).map(len).min()
But can't work out how to select the actual values that are the maximum and minimum length!?! (ie 32252 (longest) and 2 (shortest)
(I possibly don't need to explain this, but there is a subtle difference between largest and longest...(ie "aa" is longer than "z")). Appreciate your help. Thanks.
I think this should work if you df have unique indices.
field_length = df.MixedField.astype(str).map(len)
print df.loc[field_length.argmax(), 'MixedField']
print df.loc[field_length.argmin(), 'MixedField']