How to remove part of a string by condition - Python->Pandas? - python

I have the next DataFrame:
a = [{'name': 'AAA|YYY'},{ 'name': 'BBB|LLL'}]
df = pd.DataFrame(a)
print(df)
name
0 AAA|YYY
1 BBB|LLL
and I'm trying to remove the part of the string from the right up to the character |:
df['name'] = [i.split('|')[:-1] for i in df['name']]
but I get the following result:
name
0 [AAA]
1 [BBB]
how can I get the following result?:
name
0 AAA
1 BBB

You're selecting a range of items from the result of your split operation, because you're passing a slice object (:-1).
Actually, to get your result, you just have to select the first part of the split, which will correspond to the index 0:
df['name'] = [i.split('|')[0] for i in df['name']]
Or if you have multiple occurrences of '|', and want to remove only the last part, you can join the remaining part after your selection:
df['name'] = ['|'.join(i.split('|')[:-1]) for i in df['name']]

Related

Split One Column to Multiple Columns in Pandas

I want to split one current column into 3 columns. In screenshot we see the builder column, which need to be split in 3 more column such as b.name , city and country. So I use str.split() method in python to split the column which give me good result for 2 column ownerName = df['owner_name'] df[["ownername", "owner_country"]] = df["owner_name"].str.split("-", expand=True)
But when it come to three columns ownerName = df['owner_name'] df[["ownername", "city", "owner_country"]] = df["owner_name"].str.split("," ,"-", expand=True), where I use 2 delimiter ',' and '-' it give me this error:
File "C:\Users....\lib\site-packages\pandas\core\frame.py", line 3160, in setitem
self._setitem_array(key, value)
File "C:\Users....\lib\site-packages\pandas\core\frame.py", line 3189, in _setitem_array
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
whats best solution for 2 delimiter ',' and '-', Also there is some empty rows too.
Your exact input is unclear, but assuming the sample input kindly provided by #ArchAngelPwn, you could use str.split with a regex:
names = ['Builder_Name', 'City_Name', 'Country']
out = (df['Column1']
.str.split(r'\s*[,-]\s*', expand=True) # split on "," or "-" with optional spaces
.rename(columns=dict(enumerate(names))) # rename 0/1/2 with names in order
)
output:
Builder_Name City_Name Country
0 Builder Name City Country
You can combine some rows if you feel like you need to, but this was a possible options and should be pretty readable for most developers included in the projects
data = {
'Column1' : ['Builder Name - City, Country']
}
df = pd.DataFrame(data)
df['Builder_Name'] = df['Column1'].apply(lambda x : x.split('-')[0])
df['City_Name'] = df['Column1'].apply(lambda x : x.split('-')[1:])
df['City_Name'] = df['City_Name'][0]
df['City_Name'] = df['City_Name'].apply(lambda x : x.split()[0])
df['City_Name'] = df['City_Name'].apply(lambda x : x.replace(',', ''))
df['Country'] = df['Column1'].apply(lambda x : x.split(',')[1])
df = df[['Builder_Name', 'City_Name', 'Country']]
df
As mentioned in questions there is 2 delimiter "-" and ",". for one we simply use str.split("-", expand=True) and for 2 different delimiter we can use same code with addition of small code such as column1 = name-city name ,country (Owner = SANTIERUL NAVAL CONSTANTA - CONSTANTZA, ROMANIA) code will be write as ownerName = df['owner_name'] df[["Owner_name", "City_Name", "owner_country"]] = df["owner_name"].str.split(r', |- |\*|\n', expand=True)

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...
If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc
You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

How to add prefix to the selected records in python pandas df

I have df where some of the records in the column contains prefix and some of them not. I would like to update records without prefix. Unfortunately, my script adds desired prefix to each record in df:
new_list = []
prefix = 'x'
for ids in df['ids']:
if ids.find(prefix) < 1:
new_list.append(prefix + ids)
How can I ommit records with the prefix?
I've tried with df[df['ids'].str.contains(prefix)], but I'm getting an error.
Use Series.str.startswith for mask and add values with numpy.where:
df = pd.DataFrame({'ids':['aaa','ssx','xwe']})
prefix = 'x'
df['ids'] = np.where(df['ids'].str.startswith(prefix), '', prefix) + df['ids']
print (df)
ids
0 xaaa
1 xssx
2 xwe

Rename columns in dataframe using bespoke function python pandas

I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4

Remove a SPECIFIC url from a string in a pandas dataframe

I have a dataframe:
Name url
A 'https://foo.com, https://www.bar.org, https://goo.com'
B 'https://foo.com, https://www.bar.org, https://www.goo.com'
C 'https://foo.com, https://www.bar.org, https://goo.com'
and then a keyword list:
keyword_list = ['foo','bar']
I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:
df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')
I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?
Desired output:
Name url
A 'https://goo.com'
B 'https://www.goo.com'
C 'https://goo.com'
I'm pretty sure you can do so with some regex. But you can also do:
new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()
(new_df[~new_df.str.contains('|'.join(keyword_list))]
.reset_index(level=1, drop=True)
.to_frame(name='url')
.reset_index()
)
Output:
Name url
0 A https://goo.com
1 B https://www.goo.com
2 C https://goo.com

Categories