How to split two strings into different columns in Python with Pandas? - python

I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!

The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun

have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])

Related

How to rename Pandas columns based on mapping?

I have a dataframe where column Name contains values such as the following (the rest of the columns do not affect how this question is answered I hope):
Chicken
Chickens
Fluffy Chicken
Whale
Whales
Blue Whale
Tiger
White Tiger
Big Tiger
Now, I want to ensure that we rename these entries to be like the following:
Chicken
Chicken
Chicken
Whale
Whale
Whale
Tiger
Tiger
Tiger
Essentially substituting anything that has 'Chicken' to just be 'Chicken, anything with 'Whale' to be just 'Whale, and anything with 'Tiger' to be just 'Tiger'.
What is the best way to do this? There are almost 1 million rows in the dataframe.
Sorry just to add, I have a list of what we expect i.e.
['Chicken', 'Whale', 'Tiger']
They can appear in any order in the column
What I should also add is, the column might contain things like "Mushroom" or "Eggs" that do not need substituting from the original list.
Try with str.extract
#l = ['Chicken', 'Whale', 'Tiger']
df['new'] = df['col'].str.extract('('+'|'.join(l)+')')[0]
Out[10]:
0 Chicken
1 Chicken
2 Chicken
3 Whale
4 Whale
5 Whale
6 Tiger
7 Tiger
8 Tiger
Name: 0, dtype: object

Remove series of characters in pandas

Somewhat of a beginner in pandas.
I am trying to clean data in a specific column by removing a series of characters.
Currently the data looks like this:
**Column A**
(F) Red Apples
(F) Oranges
Purple (F)Grapes
(F) Fried Apples
I need to remove the (F)
I used … df[‘Column A’]=df[‘Column A’].str.replace(‘[(F)]’,’ ‘)
This successfully removed the (F) but it also removed the other F letters (for example Fried Apples = ied Apples) How can I only remove the “series” of characters.
Try this -
df['Column A'].str.replace('\(F\)','')
0 Red Apples
1 Oranges
2 Purple Grapes
3 Fried Apples
Name: Column A, dtype: object
OR
df['Column A'].str.replace('(F)','', regex=False)
Please try this:
data={'Column A':["(F) Red Apples","(F) Oranges ","Purple (F)Grapes","(F) Fried Apples"]}
df=pd.DataFrame(data)
df['Column A']=df['Column A'].apply(lambda x: x.replace('(F)', ''))
0 Red Apples
1 Oranges
2 Purple Grapes
3 Fried Apples

Filtering a column in a data frame to get only column entries that contain a specific word

print(data['PROD_NAME'])
0 Natural Chip Compny SeaSalt175g
1 CCs Nacho Cheese 175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
4 Kettle Tortilla ChpsHny&Jlpno Chili 150g
...
264831 Kettle Sweet Chilli And Sour Cream 175g
264832 Tostitos Splash Of Lime 175g
264833 Doritos Mexicana 170g
264834 Doritos Corn Chip Mexican Jalapeno 150g
264835 Tostitos Splash Of Lime 175g
Name: PROD_NAME, Length: 264836, dtype: object
I only want product names that have the word 'chip' in it somewhere.
new_data = pd.DataFrame(data['PROD_NAME'].str.contains("Chip"))
print(pd.DataFrame(new_data))
PROD_NAME
0 True
1 False
2 True
3 True
4 False
... ...
264831 False
264832 False
264833 False
264834 True
264835 False
[264836 rows x 1 columns]
My question is how do I remove the product_names that are False and instead of having True in the data frame above, get the product name which caused it to become True.
Btw, this is part of the Quantium data analytics virtual internship program.
Try using .loc with column names to select particular columns that meet the criteria you need. There is some documentation here, but the part before the comma is the boolean series you want to use as filter (in your case the str.contains('Chip') and after the comma are the column/columns you want to return (in your case 'PROD_NAME' but also works with another column/columns).
Example
import pandas as pd
example = {'PROD_NAME':['Chippy','ABC','A bag of Chips','MicroChip',"Product C"],'Weight':range(5)}
data = pd.DataFrame(example)
data.loc[data.PROD_NAME.str.contains('Chip'),'PROD_NAME']
#0 Chippy
#2 A bag of Chips
#3 MicroChip
you are almost there,
try this,
res = data[data['PROD_NAME'].str.contains("Chip")]
O/P:
prod_name
0 Natural Chip Compny SeaSalt175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
8 Doritos Corn Chip Mexican Jalapeno 150g

Splitting a dataframe column on a pattern of characters and numerals

I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)

For loop and if for multiple conditions and change another column in the same row in pandas

I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200

Categories