How can I replace the data 'Beer','Alcohol','Beverage','Drink' with only 'Drink'.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink')
doesn't work
You almost had it. You need to pass a dictionary to df.replace.
df
Col1
0 Beer
1 Alcohol
2 Beverage
3 Drink
df.replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Col1
0 Drink
1 Drink
2 Drink
3 Drink
This works for exact matches and replacements. For partial matches and substring matching, use
df.replace(
dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'),
regex=True
)
This is not an in-place operation so don't forget to assign the result back.
Try the following approach:
lst = ['Beer','Alcohol','Beverage','Drink']
pat = r"\b(?:{})\b".format('|'.join(lst))
df = df.replace(pat, 'Drink', regexp=True)
Looks like different from MaxU's solution :)
df.replace({'|'.join(['Beer','Alcohol','Beverage','Drink']):'Drink'},regex=True)
It seems that your initial method of doing it works in the the latest iteration of Python.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink', inplace=True)
Should work
Slight change in earlier answers:
Following code Replacing values of specific column/Columns
df[['Col1']] = df[['Col1']].replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Related
I have two dataframes (DF1 & DF2) with multiple columns of text information. I need to match and update one column in DF1
DF1:
Code Name
A A: Andrew
B B: Bill
C C: Chuck
DF2:
Number Codes
1 A
2 B;C
3 A;C
My required output is to transform DF2 as follows:
DF2:
Number Codes
1 A: Andrew
2 B: Bill;C: Chuck
3 A: Andrew;C: Chuck
So far I have tried to use:
df2['Codes'] = df2['Codes'].replace(to_replace="A", value="A: Andrew", regex=True)
But this is not practical for larger datasets.
Do I use the same df.replace function and do some looping to find every code and replace? Or is there other ways to do it better?
One option I'm trying to learn about is using sub() with regex, but i'm new to regex and learning the basics of it.
You should just try to split Column then apply dict with zip and replace
di=dict(zip(df1.Name.str.split(":").str[0],df1.Name))
df2["Codes"]=df2["Codes"].replace(di, regex=True)
How can I replace the data 'Beer','Alcohol','Beverage','Drink' with only 'Drink'.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink')
doesn't work
You almost had it. You need to pass a dictionary to df.replace.
df
Col1
0 Beer
1 Alcohol
2 Beverage
3 Drink
df.replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
Col1
0 Drink
1 Drink
2 Drink
3 Drink
This works for exact matches and replacements. For partial matches and substring matching, use
df.replace(
dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'),
regex=True
)
This is not an in-place operation so don't forget to assign the result back.
Try the following approach:
lst = ['Beer','Alcohol','Beverage','Drink']
pat = r"\b(?:{})\b".format('|'.join(lst))
df = df.replace(pat, 'Drink', regexp=True)
Looks like different from MaxU's solution :)
df.replace({'|'.join(['Beer','Alcohol','Beverage','Drink']):'Drink'},regex=True)
It seems that your initial method of doing it works in the the latest iteration of Python.
df.replace(['Beer','Alcohol','Beverage','Drink'],'Drink', inplace=True)
Should work
Slight change in earlier answers:
Following code Replacing values of specific column/Columns
df[['Col1']] = df[['Col1']].replace(dict.fromkeys(['Beer','Alcohol','Beverage','Drink'], 'Drink'))
I want to know if the values in two different rows of a Dataframe are the same.
My df looks something like this:
df['Name1']:
Alex,
Peter,
Herbert,
Seppi,
Huaba
df['Name2']:
Alexander,
peter,
herbert,
Sepp,
huaba
First I want to Apply .rstrip() and .toLower(), but these methods seem to only work on Strings. I tried Str(df['Name1'] which worked, but the comparison gave me the wrong result.
I also tried the following:
df["Name1"].isin(df["Name2"]).value_counts())
df["Name1"].eq(df["Name2"]).value_counts())
Problem 1: I think .isin also returns true if a Substring is found e.g. alex.isin(alexander)would return true then. Which is not what I´m looking for.
Problem 2: I think .eg would do it for me. But I still have the problem with the .rstrip() and to.lower() methods.
What is the best way to count the amount of same entries?
print (df)
Name1 Name2
0 Alex Alexander
1 Peter peter
2 Herbert herbert
3 Seppi Sepp
4 Huaba huaba
If need compare each row:
out1 = df["Name1"].str.lower().eq(df["Name2"].str.lower()).sum()
If need compare all values of Name1 by all values by Name2:
out2 = df["Name1"].str.lower().isin(df["Name2"].str.lower()).sum()
Use set to find the common values between two dataframe columns
common_values = list(set(df.Name1) & set(df.Name2) )
count = len(common_values)
i have this dataframe:
0 name data
1 alex asd
2 helen sdd
3 alex dss
4 helen sdsd
5 john sdadd
so i am trying to get the most frequent value or values(in this case its values)
so what i do is:
dataframe['name'].value_counts().idxmax()
but it returns only the value: Alex even if it Helen appears two times as well.
By using mode
df.name.mode()
Out[712]:
0 alex
1 helen
dtype: object
To get the n most frequent values, just subset .value_counts() and grab the index:
# get top 10 most frequent names
n = 10
dataframe['name'].value_counts()[:n].index.tolist()
You could try argmax like this:
dataframe['name'].value_counts().argmax()
Out[13]: 'alex'
The value_counts will return a count object of pandas.core.series.Series and argmax could be used to achieve the key of max values.
df['name'].value_counts()[:5].sort_values(ascending=False)
The value_counts will return a count object of pandas.core.series.Series and sort_values(ascending=False) will get you the highest values first.
You can use this to get a perfect count, it calculates the mode a particular column
df['name'].value_counts()
Use:
df['name'].mode()
or
df['name'].value_counts().idxmax()
It will give top five most common names:
df['name'].value_counts().nlargest(5)
Here's one way:
df['name'].value_counts()[df['name'].value_counts() == df['name'].value_counts().max()]
which prints:
helen 2
alex 2
Name: name, dtype: int64
Not Obvious, But Fast
f, u = pd.factorize(df.name.values)
counts = np.bincount(f)
u[counts == counts.max()]
array(['alex', 'helen'], dtype=object)
Simply use this..
dataframe['name'].value_counts().nlargest(n)
The functions for frequencies largest and smallest are:
nlargest() for mostfrequent 'n' values
nsmallest() for least frequent 'n' values
to get top 5:
dataframe['name'].value_counts()[0:5]
You could use .apply and pd.value_counts to get a count the occurrence of all the names in the name column.
dataframe['name'].apply(pd.value_counts)
To get the top five most common names:
dataframe['name'].value_counts().head()
my best solution to get the first is
df['my_column'].value_counts().sort_values(ascending=False).argmax()
I had a similar issue best most compact answer to get lets say the top n (5 is default) most frequent values is:
df["column_name"].value_counts().head(n)
Identifying the top 5, for example, using value_counts
top5 = df['column'].value_counts()
Listing contents of 'top_5'
top5[:5]
n is used to get the number of top frequent used items
n = 2
a=dataframe['name'].value_counts()[:n].index.tolist()
dataframe["name"].value_counts()[a]
Getting top 5 most common lastname pandas:
df['name'].apply(lambda name: name.split()[-1]).value_counts()[:5]
I'm looking for a way to filter pandas rows via alternatives in a string. I have many different terms I would like to search for, so it would be easier to put them in a few variables rather than list them every time I need to access them.
I currently do:
df = df[df["A"].str.contains("BULL|BEAR|LONG|SHORT", case=False)]
Instead do something like:
bull = "BULL|LONG"
bear = "BEAR|SHORT"
leverage = bull + bear
df = df[df["A"].find(leverage, case=False)]
The problem is that this method only filters out one alternative from each variable. It will find "BULL" but not "LONG", and it will find "SHORT" but not "BEAR". It seems what it selects is arbitrary. Depending on if and where these terms occur in the file I'm reading from, results may differ.
I am assuming this is due to | functions as OR which is mutually exclusive.
If so, is there a mutually inclusive option? I would like to continue to use strings to do this. The reason is that I use str.contains at another place that relies on the same variables:
df.loc[df["A"].str.contains(bull, case=False), "B"]
df.loc[df["A"].str.contains(bear, case=False), "B"]
You needed to add an additional '|' to join your terms:
In [227]:
df = pd.DataFrame({'A':['bull', 'bear', 'short', 'null', 'LONG']})
df
Out[227]:
A
0 bull
1 bear
2 short
3 null
4 LONG
In [228]:
bull = "BULL|LONG"
bear = "BEAR|SHORT"
leverage = bull + '|' + bear
df = df[df["A"].str.contains(leverage, case=False)]
df
Out[228]:
A
0 bull
1 bear
2 short
4 LONG