Remove first 3 characters in string using a condition statement - python

Can anyone Kindly help please?
I'm trying to remove three of the first characters within the string using the statement:
Data['COUNTRY_CODE'] = Data['COUNTRY1'].str[3:]
This will create a new column after removing the first three values of the string. However, I do not want this to be applied to all of the values within the same column so was hoping there would be a way to use a conditional statement such as 'Where' in order to only change the desired strings?

I assume you are using pandas so your condition check can be like:
condition_mask = Data['COL_YOU_WANT_TO_CHECK'] == 'SOME CONDITION'
Your new column can be created as:
# Assuming you want the first 3 chars as COUNTRY_CODE
Data.loc[condition_mask, 'COUNTRY_CODE'] = Data['COUNTRY1'].str[:3]

Related

Pandas DataFrames Lists

So I created a new column in my dataframe using a list. Now every entry has the ‘[ ]’ squared parentheses around the text. How do I remove them? Please help! It seems easy but I’m not getting there. Code used:
df.insert(1, ‘Email’, emails_list, True)
Now all the data in the Email column is in [square brackets]. I want to remove those parentheses.
You probably have lists as values to each row in the column 'Email'. You can try the below code to take the first element of the list, and replace the original list with it.
df['Email'] = df['Email'].map(lambda x: x[0] if len(x)> 0 else '')
The above code takes each cell value of the column, and checks if it of non zero length. If it has non-zero length, then it replaces the list in the cell, with the first element of the list. Otherwise, it just replaces it with an empty string.
This should help. If the error persists, please check the type and shape of 'emails_list'

Get dummy variables from a string column full of mess

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

Remove list type in columns while preserving list structure

I have two columns that from the way my data was pulled are in lists. This may be a really easy question, I just haven't found the exactly correct way to create the result I'm looking for.
I need the "a" column to be a string without the [] and the "a" column to be integers separated by a column if that's possible.
I've tried this code:
df['a'] = df['a'].astype(str)
to convert to a string: but it failed and outputs:
What I need the output to look like is:
a b
hbhprecision.com 123,1234,12345,123456
thomsonreuters.com 1234,12345,123456
etc.
Please help and thank you very much in advance!
for the first part, removing the brackets [ ]
df['c_u'].apply(lambda x : x.strip("['").strip("']"))
for the second part (assuming you removed your brackets as well), splitting the values across columns:
df['tawgs.db_id'].str.split(',', expand=True)

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Remove Rows that Contain a specific Value anywhere in the row (Pandas, Python 3)

I am trying to remove all rows in a Panda dataset that contain the symbol "+" anywhere in the row. So ideally this:
Keyword
+John
Mary+Jim
David
would become
Keyword
David
I've tried doing something like this in my code but it doesn't seem to be working.
excluded = ('+')
removal2 = removal[~removal['Keyword'].isin(excluded)]
The problem is that sometimes the + is contained within a word, at the beginning of a word, or at the end. Any ideas how to help? Do I need to use an index function? Thank you!
Use the vectorised str method contains and pass the '+' identifier, negate the boolean condition by using ~:
In [29]:
df[~df.Keyword.str.contains('\+')]
Out[29]:
Keyword
2 David

Categories