I want to know if the values in two different rows of a Dataframe are the same.
My df looks something like this:
df['Name1']:
Alex,
Peter,
Herbert,
Seppi,
Huaba
df['Name2']:
Alexander,
peter,
herbert,
Sepp,
huaba
First I want to Apply .rstrip() and .toLower(), but these methods seem to only work on Strings. I tried Str(df['Name1'] which worked, but the comparison gave me the wrong result.
I also tried the following:
df["Name1"].isin(df["Name2"]).value_counts())
df["Name1"].eq(df["Name2"]).value_counts())
Problem 1: I think .isin also returns true if a Substring is found e.g. alex.isin(alexander)would return true then. Which is not what I´m looking for.
Problem 2: I think .eg would do it for me. But I still have the problem with the .rstrip() and to.lower() methods.
What is the best way to count the amount of same entries?
print (df)
Name1 Name2
0 Alex Alexander
1 Peter peter
2 Herbert herbert
3 Seppi Sepp
4 Huaba huaba
If need compare each row:
out1 = df["Name1"].str.lower().eq(df["Name2"].str.lower()).sum()
If need compare all values of Name1 by all values by Name2:
out2 = df["Name1"].str.lower().isin(df["Name2"].str.lower()).sum()
Use set to find the common values between two dataframe columns
common_values = list(set(df.Name1) & set(df.Name2) )
count = len(common_values)
Related
B0e.png
I am still relatively new to python. I am trying to do something more complicated. How can I use a for loop or iteration so that I count the same names ranked 1 and add them but also place them into a list format and also place the counted names in a separate list. The reason for this is that I will create a plot, and that I can do but I am stuck on how to separate the total counts of the same name and the names already counted.
Using this as example DataFrame:
RANK NAME
0 1 EMILY
1 1 DANIEL
2 1 EMILY
3 1 ISABELLA
You can do this to get the counted names:
counted_names = name_file[name_file.RANK == 1]['NAME'].value_counts()
print(counted_names)
EMILY 2
DANIEL 1
ISABELLA 1
pandas.groupby()
To solve any form aggregation in Python, all you need is to crack the groupby Function
For your case if you want to sum over 'Count' for all the unique names and later plot it, use pd.groupby()
Make sure you convert it into a DataFrame first and then apply Groupby Magic
name_file = pd.DataFrame(name_file)
name_file.groupby('Name').agg({'Count':'sum'})
This gives you the aggregated sum of counts for eaxh unique name in your dataframe
To get the Count of each Name, use the size.reset_index() method below
pd.DataFrame(name_file).groupby('Name').size().reset_index()
This returns the frequency of occurence of each unique name in the name_file
Hope this helps! Cheers !
How can I replace the data 'Beer','Alcohol','Beverage','Drink' with only 'Drink'.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink')
doesn't work
You almost had it. You need to pass a dictionary to df.replace.
df
Col1
0 Beer
1 Alcohol
2 Beverage
3 Drink
df.replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Col1
0 Drink
1 Drink
2 Drink
3 Drink
This works for exact matches and replacements. For partial matches and substring matching, use
df.replace(
dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'),
regex=True
)
This is not an in-place operation so don't forget to assign the result back.
Try the following approach:
lst = ['Beer','Alcohol','Beverage','Drink']
pat = r"\b(?:{})\b".format('|'.join(lst))
df = df.replace(pat, 'Drink', regexp=True)
Looks like different from MaxU's solution :)
df.replace({'|'.join(['Beer','Alcohol','Beverage','Drink']):'Drink'},regex=True)
It seems that your initial method of doing it works in the the latest iteration of Python.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink', inplace=True)
Should work
Slight change in earlier answers:
Following code Replacing values of specific column/Columns
df[['Col1']] = df[['Col1']].replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
i have this dataframe:
0 name data
1 alex asd
2 helen sdd
3 alex dss
4 helen sdsd
5 john sdadd
so i am trying to get the most frequent value or values(in this case its values)
so what i do is:
dataframe['name'].value_counts().idxmax()
but it returns only the value: Alex even if it Helen appears two times as well.
By using mode
df.name.mode()
Out[712]:
0 alex
1 helen
dtype: object
To get the n most frequent values, just subset .value_counts() and grab the index:
# get top 10 most frequent names
n = 10
dataframe['name'].value_counts()[:n].index.tolist()
You could try argmax like this:
dataframe['name'].value_counts().argmax()
Out[13]: 'alex'
The value_counts will return a count object of pandas.core.series.Series and argmax could be used to achieve the key of max values.
df['name'].value_counts()[:5].sort_values(ascending=False)
The value_counts will return a count object of pandas.core.series.Series and sort_values(ascending=False) will get you the highest values first.
You can use this to get a perfect count, it calculates the mode a particular column
df['name'].value_counts()
Use:
df['name'].mode()
or
df['name'].value_counts().idxmax()
It will give top five most common names:
df['name'].value_counts().nlargest(5)
Here's one way:
df['name'].value_counts()[df['name'].value_counts() == df['name'].value_counts().max()]
which prints:
helen 2
alex 2
Name: name, dtype: int64
Not Obvious, But Fast
f, u = pd.factorize(df.name.values)
counts = np.bincount(f)
u[counts == counts.max()]
array(['alex', 'helen'], dtype=object)
Simply use this..
dataframe['name'].value_counts().nlargest(n)
The functions for frequencies largest and smallest are:
nlargest() for mostfrequent 'n' values
nsmallest() for least frequent 'n' values
to get top 5:
dataframe['name'].value_counts()[0:5]
You could use .apply and pd.value_counts to get a count the occurrence of all the names in the name column.
dataframe['name'].apply(pd.value_counts)
To get the top five most common names:
dataframe['name'].value_counts().head()
my best solution to get the first is
df['my_column'].value_counts().sort_values(ascending=False).argmax()
I had a similar issue best most compact answer to get lets say the top n (5 is default) most frequent values is:
df["column_name"].value_counts().head(n)
Identifying the top 5, for example, using value_counts
top5 = df['column'].value_counts()
Listing contents of 'top_5'
top5[:5]
n is used to get the number of top frequent used items
n = 2
a=dataframe['name'].value_counts()[:n].index.tolist()
dataframe["name"].value_counts()[a]
Getting top 5 most common lastname pandas:
df['name'].apply(lambda name: name.split()[-1]).value_counts()[:5]
I have a question, which I have a feeling might have already been asked before, but in a different form. Point me to the original if that's the case please.
Anyway, I am playing with Pandas extractall() method, and I don't quite like the fact it returns a DataFrame with MultiLevel index (original index -> 'match' index) with all found elements listed under match 0, match 1, match 2 ...
I would rather prefer if the output was a single indexed DataFrame, with multiple regex search results (if applicable) returned as a list in single cell. Is that possible at the moment?
Here's a visualization of what I have in mind:
Current output:
X
index match
0 0 thank
1 0 thank
1 thanks
2 thanking
2 0 thanked
Desired output
X
index
0 thank
1 [thank, thanks, thanking]
2 thanked
I`ll be grateful for any suggestions.
Let's try:
df.groupby(level=0)['X'].apply(list)
Output:
0 [thank]
1 [thank, thanks, thanking]
2 [thanked]
Name: X, dtype: object
How can I replace the data 'Beer','Alcohol','Beverage','Drink' with only 'Drink'.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink')
doesn't work
You almost had it. You need to pass a dictionary to df.replace.
df
Col1
0 Beer
1 Alcohol
2 Beverage
3 Drink
df.replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Col1
0 Drink
1 Drink
2 Drink
3 Drink
This works for exact matches and replacements. For partial matches and substring matching, use
df.replace(
dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'),
regex=True
)
This is not an in-place operation so don't forget to assign the result back.
Try the following approach:
lst = ['Beer','Alcohol','Beverage','Drink']
pat = r"\b(?:{})\b".format('|'.join(lst))
df = df.replace(pat, 'Drink', regexp=True)
Looks like different from MaxU's solution :)
df.replace({'|'.join(['Beer','Alcohol','Beverage','Drink']):'Drink'},regex=True)
It seems that your initial method of doing it works in the the latest iteration of Python.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink', inplace=True)
Should work
Slight change in earlier answers:
Following code Replacing values of specific column/Columns
df[['Col1']] = df[['Col1']].replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))