how do i seperate two numbers in pandas - python

how do I separate two numbers if the length of the number is 18 . what I exactly wanna do is I want to separate mobile number(10) and landline number(8) when they are joined(18).
I have tried to extract first 8 numbers but I don't know how to apply condition. and I need to remove first 8 numbers if the condition satisfies
df['Landline'] = df['Number'].str[:8]
I have tried this but I know its wrong
df['Landline'] = df['Number'].apply(lambda x : x.str[:8] if len(x)==18 )

For extracting first 8 numbers, use findall.
df['Number'].str.findall('^\d{8}')
Solution using an example
Here we use the Dummy Data made in the following section.
# separate landline and mobile numbers
phone_numbers = df.Numbers.str.findall('(^\d{8})*(\d{10})').tolist()
# store in a dict
d = dict((i, {'Landline': e[0][0], 'Mobile': e[0][1]}) for i, e in enumerate(phone_numbers))
# make a temporary dataframe
phone_df = pd.DataFrame(d).T
# update original dataframe
df['Landline'] = phone_df['Landline']
df['Mobile'] = phone_df['Mobile']
print(df)
Output:
Numbers Landline Mobile
0 123456780123456789 12345678 0123456789
1 0123456789 0123456789
Dummy Data
df = pd.DataFrame({'Numbers': ['123456780123456789', '0123456789', ]})
print(df)
Output:
Numbers
0 123456780123456789
1 0123456789

Looks like you need
df['Landline'] = df['Number'].apply(lambda x : x[:8] if len(x)==18 else x)

Related

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510
You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Extract columns from string

I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Categories