Remove series of characters in pandas - python

Somewhat of a beginner in pandas.
I am trying to clean data in a specific column by removing a series of characters.
Currently the data looks like this:
**Column A**
(F) Red Apples
(F) Oranges
Purple (F)Grapes
(F) Fried Apples
I need to remove the (F)
I used … df[‘Column A’]=df[‘Column A’].str.replace(‘[(F)]’,’ ‘)
This successfully removed the (F) but it also removed the other F letters (for example Fried Apples = ied Apples) How can I only remove the “series” of characters.

Try this -
df['Column A'].str.replace('\(F\)','')
0 Red Apples
1 Oranges
2 Purple Grapes
3 Fried Apples
Name: Column A, dtype: object
OR
df['Column A'].str.replace('(F)','', regex=False)

Please try this:
data={'Column A':["(F) Red Apples","(F) Oranges ","Purple (F)Grapes","(F) Fried Apples"]}
df=pd.DataFrame(data)
df['Column A']=df['Column A'].apply(lambda x: x.replace('(F)', ''))
0 Red Apples
1 Oranges
2 Purple Grapes
3 Fried Apples

Related

How to create a 100% stacked bar plot from a categorical dataframe

I have a dataframe structured like this:
User
Food 1
Food 2
Food 3
Food 4
Steph
Onions
Tomatoes
Cabbages
Potatoes
Tom
Potatoes
Tomatoes
Potatoes
Potatoes
Fred
Carrots
Cabbages
Eggplant
Phil
Onions
Eggplant
Eggplant
I want to use the distinct values from across the food columns as categories. I then want to create a Seaborn plot so the % of each category for each column is plotted as a 100% horizontal stacked bar.
My attempt to do this:
data = {
'User' : ['Steph', 'Tom', 'Fred', 'Phil'],
'Food 1' : ["Onions", "Potatoes", "Carrots", "Onions"],
'Food 2' : ['Tomatoes', 'Tomatoes', 'Cabbages', 'Eggplant'],
'Food 3' : ["Cabbages", "Potatoes", "", "Eggplant"],
'Food 4' : ['Potatoes', 'Potatoes', 'Eggplant', ''],
}
df = pd.DataFrame(data)
x_ax = ["Onions", "Potatoes", "Carrots", "Onions", "", 'Eggplant', "Cabbages"]
df.plot(kind="barh", x=x_ax, y=["Food 1", "Food 2", "Food 3", "Food 4"], stacked=True, ax=axes[1])
plt.show()
Replace '' with np.nan because empty stings will be counted as values.
Use pandas.DataFrame.melt to convert the dataframe to a long form.
Use pandas.crosstab with the normalize parameter to calculate the percent for each 'Food'.
Plot the dataframe with pandas.DataFrame.plot and kind='barh'.
Putting the food names on the x-axis is not the correct way to create a 100% stacked bar plot. One axis must be numeric. The bars will be colored by food type.
Annotate the bars based on this answer.
Move the legend outside the plot based on this answer.
seaborn is a high-level API for matplotlib, and pandas uses matplotlib as the default backend, and it's easier to produce a stacked bar plot with pandas.
seaborn doesn't support stacked barplots, unless histplot is used in a hacked way, as shown in this answer, and would require an extra step of melting percent.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1
Assignment expressions (:=) require python >= 3.8. Otherwise, use [f'{v.get_width():.2f}%' if v.get_width() > 0 else '' for v in c ].
import pandas as pd
import numpy as np
# using the dataframe in the OP
# 1.
df = df.replace('', np.nan)
# 2.
dfm = df.melt(id_vars='User', var_name='Food', value_name='Type')
# 3.
percent = pd.crosstab(dfm.Food, dfm.Type, normalize='index').mul(100).round(2)
# 4.
ax = percent.plot(kind='barh', stacked=True, figsize=(8, 6))
# 5.
for c in ax.containers:
# customize the label to account for cases when there might not be a bar section
labels = [f'{w:.2f}%' if (w := v.get_width()) > 0 else '' for v in c ]
# set the bar label
ax.bar_label(c, labels=labels, label_type='center')
# 6.
ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
DataFrame Views
dfm
User Food Type
0 Steph Food 1 Onions
1 Tom Food 1 Potatoes
2 Fred Food 1 Carrots
3 Phil Food 1 Onions
4 Steph Food 2 Tomatoes
5 Tom Food 2 Tomatoes
6 Fred Food 2 Cabbages
7 Phil Food 2 Eggplant
8 Steph Food 3 Cabbages
9 Tom Food 3 Potatoes
10 Fred Food 3 NaN
11 Phil Food 3 Eggplant
12 Steph Food 4 Potatoes
13 Tom Food 4 Potatoes
14 Fred Food 4 Eggplant
15 Phil Food 4 NaN
ct
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0 1 0 2 1 0
Food 2 1 0 1 0 0 2
Food 3 1 0 1 0 1 0
Food 4 0 0 1 0 2 0
total
Food
Food 1 4
Food 2 4
Food 3 3
Food 4 3
dtype: int64
percent
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0.00 25.0 0.00 50.0 25.00 0.0
Food 2 25.00 0.0 25.00 0.0 0.00 50.0
Food 3 33.33 0.0 33.33 0.0 33.33 0.0
Food 4 0.00 0.0 33.33 0.0 66.67 0.0

How to split two strings into different columns in Python with Pandas?

I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!
The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun
have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])

How to efficiently avoid replacing a substring when there are string necessary to be replaced with python pandas?

I would like to update the words/phrases with their link.
However, since the words phrases might be the substring of others, I am looking for an efficient method to replace all words/phrases without duplicate replacement.
Substitution list
An example of a : the following words/phrases need to be replaced with their corresponding markdown links after ">>":
ABC Apple >> [ABC Apple](http://abc_apple)
ABC Apples >> [ABC Apples](http://abc_apples)
Apple >> [Apple](http://apple)
Apples >> [Apples](http://apples)
Apple Pie >> [Apple Pie](http://apple_pie)
Red Apple >> [Red Apple](http://red_apple)
Red Apple Pie >> [Red Apple Pie](http://red_apple_pie)
Idea
When we have a data structure that each words/phrases (substring) store the words/phrases(string) that contain them (say list_l), we could check if a sentence contains element in list_l before check if it contains a substring
For example, now we have the following substring : {list_l(string)}
ABC Apple : {ABC Apples}
ABC Apples : {}
Apple : {ABC Apple, ABC Apples, Apples, Apple Pie, Red Apple,Red Apple Pie}
Apples : {}
Apple Pie : {Red Apple Pie}
Red Apple : {Red Apple Pie}
Red Apple Pie : {}
However, the computational effort will be quiet much, since each element in list_l, we still need to check the list_l of that element.
Examples
Some sentences to be replaced as examples (walked through from backward):
"I love Apple Pie.": Red Apple Pie(x) >> Red Apple(x) >> Apple Pie(o) >> Red Apple Pie(x)
"I like ABC Apple!": Red Apple Pie(x) >> Red Apple(x) >> Apple Pie(x) >> Apples(x) >> Apple(o) >> Red Apple Pie(x) >> Red Apple(x) >> Red Apple Pie(x) >> Apple Pie(x) >> Red Apple Pie(x) >> Apples(x) >> ABC Apples (x) >> ABC Apple (o) >> ABC Apples(x)
Computational effort O(n^3)
length of sentence x length of substitution list x length of list_l
(Original sentence >> result sentence:)
Expected result:
"I like ABC Apple!"` >> `"I like [ABC Apple](http://abc_apple)!"
Wrong result:
"I like ABC Apple!"` >> `"I like ABC [Apple](http://apple)!"
There is a greedy naive O(MN + MlogM) solution (N size of string and M size of all substitutions).
The first passage is to sort the possible substitution by length (O(MlogM)).
Then you search in the original sentence a substitution and in case is found you do the replacement (O(N)). You need to do this for every substitution and in order; so is O(MN)
The fact that you search in order should solve (if I understood well) your problem.
To maintain the complexity above in the development you'll probably need some tricks to not read an "already made replacement", but it shouldn't be too difficult.
In the end I think that there can be solution with a lower time complexity using some datastructures, but way more difficult to implement

For loop and if for multiple conditions and change another column in the same row in pandas

I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200

Assign specific nominal values randomly to rows using pandas

I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?
You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange
You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])

Categories