I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!
Related
I would like to check if pandas dataframe column id contains the following substrings '.F1', '.N1', '.FW', '.SP'.
I am currently using the following codes:
searchfor = ['.F1', '.N1', '.FW', '.SP']
mask = (df["id"].str.contains('|'.join(searchfor)))
The id column looks like such:
ID
0 F611B4E369F1D293B5
1 10302389527F190F1A
I am actually looking to see if the id column contains the four substrings starting with a .. For some reasons, F1 will be filtered out. In the current example, it does not have .F1. I would really appreciate if someone would let me know how to solve this particular issue. Thank you so much.
You can use re.escape() to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor (no need to change the definition of searchfor):
import re
searchfor = ['.F1', '.N1', '.FW', '.SP'] # no need to escape each string
pattern = '|'.join(map(re.escape, searchfor)) # use re.escape() with map()
mask = (df["id"].str.contains(pattern))
re.escape() will escape each string for you:
print(pattern)
'\\.F1|\\.N1|\\.FW|\\.SP'
I tried to reference this SF answer: How to check if character exists in DataFrame cell
It gave a seemingly good solution but it doesn't appear to work for a period character "." Which of course is the character I'm trying to filter out on.
df_intials = df['Name'].str.contains('.')
Is there something specific about filterting through a dataframe that every value in the column has a "."?
When I convert to a list, and write a simple function to append strings with the character "." to it works correctly.
pd.Series.str.contains uses regex expressions as default, so you can either use the escape character backslack or parameter regex=False:
Try
df_intials = df['Name'].str.contains('\.')
or
df_intials = df['Name'].str.contains('.', regex=False)
I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.
Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))
I have a column named "New_category" and I have to replace multiple strings into the column with the word "brandies". I have created the following list:
brandies = ["apricot brandies", "cherry brandies", "peach brandies"]
I want to write a formula, which will replace "apricot brandies", "cherry brandies", "peach brandies" in the column with the word "brandies".
This is what i have tried but it does not work.
iowa['New_category'] = iowa['New_Category'].str.replace('brandies','Brandies')
When using str.replace use a pattern that captures the whole string.
iowa['New_category'] = iowa['New_category'].str.replace('^.*brandies.*$', 'Brandies')
I have a dataframe with multiple columns. I want to look at one column and if any of the strings in the column contain #, I want to replace them with another string. How would I go about doing this?
A dataframe in pandas is composed of columns which are series - Panda docs link
I'm going to use regex, because it's useful and everyone needs practice, myself included! Panda docs for text manipulation
Note the str.replace. The regex string you want is this (it worked for me): '.*#+.*' which says "any character (.) zero or more times (*), followed by an # 1 or more times (+) followed by any character (.) zero or more times (*)
df['column'] = df['column'].str.replace('.*#+.*', 'replacement')
Should work, where 'replacement' is whatever string you want to put in.
My suggestion:
df['col'] = ['new string' if '#' in x else x for x in df['col']]
not sure which is faster.
Assuming you called your dataframe df, you can do:
pd.DataFrame(map(lambda col: map(lambda x: 'anotherString' if '#' in x else x, df[col]), df.columns)).transpose()