Get 10 Digit Number - python

I'm trying to define a function that will create a column and clean the numbers to just their ten digit area code and number. The Date frame.
PNum1
0 18888888888
1 1999999999
2 +++(112)31243134
I have all the individual functions and even stored them into a DataFrame and Dictionary.
def GetGoodNumbers(col):
column = col.copy()
Cleaned = column.replace('\D+', '', regex=True)
NumberCount = Cleaned.astype(str).str.len()
FirstNumber = Cleaned.astype(str).str[0]
SummaryNum = {'Number':Cleaned,'First':FirstNumber,'Count':NumberCount}
df = pd.DataFrame(data=SummaryNum)
DecentNumbers = []
return df
returns
Count First Number
0 11 1 18888888888
1 10 3 3999999999
2 11 2 11231243134
How can I loop through the dataframe column and return a new column that will:
-remove all non-digits.
-get the length (which will be usually 10 or 11)
-If length is 11, return the right 10 digits.
The desired output:
number
1231243134
1999999999
8888888888

You can remove every non-digit and slice the last 10 digits.
df.PNum1.str.replace('\D+', '').str[-10:]
0 8888888888
1 1999999999
2 1231243134
Name: PNum1, dtype: object

Related

Modifying pandas row value based on its length

I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it.
I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d+)') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
It looks like you could ensure having h after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d+)')
Why not immediately extract float pattern i.e. \d+\.?\d+ ?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d+\.?\d+)")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.

iterating large pandas DataFrame too slow

I have a large dataframe where I would like to make a new column based on existing columns.
test = pd.DataFrame({'Test1':["100","4242","3454","2","54"]})
test['Test2'] = ""
for i in range(0,len(test)):
if len(test.iloc[i,0]) == 4:
test.iloc[i,-1] = test.iloc[i,0][0:1]
elif len(test.iloc[i,0]) == 3:
test.iloc[i,-1] = test.iloc[i,0][0]
elif len(test.iloc[i,0]) < 3:
test.iloc[i,-1] = 0
else:
test.iloc[i,-1] = np.nan
This is working for a small dataframe, but when I have a large data set, (10+ million rows), it is taking way too long. How can I make this process faster?
Use str.len method to find the lengths of strings in the 'Test1' column and then using this information, use np.select to assign relevant parts of the strings in 'Test1' or default values to 'Test2'.
import numpy as np
lengths = test['Test1'].str.len()
test['Test2'] = np.select([lengths == 4, lengths == 3, lengths < 3], [test['Test1'].str[0:1], test['Test1'].str[0], 0], np.nan)
Output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
Note that [0:1] only returns the first element (same as [0]) so maybe you meant [0:2] (or something else) otherwise you can save one condition there.
So, basically you want to extract the first character of the string if it is at least 3 characters long. (NB. for a string, [0] and [0:1] yields exactly the same thing)
Just use a regex with a lookbehind for that.
test['Test2'] = test['Test1'].str.extract('^(.)(?=..)').fillna(0)
output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
How the regex works:
^ # match beginning of string
(.) # capture one character
(?=..) # only if it is followed by at least two characters

Python - count successive leading digits on a pandas row string without counting non successive digits

I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so.
I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\] on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.
df['Sequence'].str.count('0')
and
df['Sequence'].str.count('0[0]*[1-9][0-9]')
Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 2
3 002486248 2
4 045074305 3
5 080666140 3
I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**
results = []
count = 0
index = 0
for item in df['Sequence']:
count = 0
index = 0
while (item[index] == "0"):
count = count + 1
index = index + 1
results.append(count)
df['0s'] = results
df
In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.
You can use extract with the ^(0*) regex to match only the leading zeros. Then use str.len to get the length.
df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()
Example input:
df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})
Output:
sequence 0s
0 12040 0
1 01230 1
2 00010 3
3 00120 2
You can use this regex:
'^0+'
the ^ means, capture if the pattern starts at the beginning of the string.
the +means, capture if occuring at least once or multiple times.
IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str is converted to that of type int. Here's one solution:
df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()
Output:
Sequence leading 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1
Try str.findall:
df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)
# Output:
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1

Remove leading zeroes pandas

For example I have such a data frame
import pandas as pd
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24']}
df = pd.DataFrame(nums)
And I need to remove all leading zeroes and replace NONEs with zeros:
I did it with cycles but for large frames it works not fast enough.
I'd like to rewrite it using vectores
you can try str.replace
df['amount'].str.replace(r'^(0+)', '').fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Name: amount, dtype: object
df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')
I see already nice answer from #Epsi95 though, you even can try with character set with regex
>>> df['amount'].str.replace(r'^[0]*', '', regex=True).fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Explanation:
^[0]*
^ asserts position at start of a line
Match a single character present in the list below [0]
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Step by step :
Remove all leading zeros:
Use str.lstrip which returns a copy of the string with leading characters removed (based on the string argument passed).
Here,
df['amount'] = df['amount'].str.lstrip('0')
For more, (https://www.programiz.com/python-programming/methods/string/lstrip)
Replace None with zeros:
Use fill.na which works with others than None as well
Here,
df['amount'].fillna(value='0')
And for more : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
Result in one line:
df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')
If you need to ensure single 0 or the last 0 is not removed, you can use:
df['amount'] = df['amount'].str.replace(r'^(0+)(?!$)', '', regex=True).fillna('0')
Regex (?!$) ensure the matching substring (leading zeroes) does not including the last 0. Thus, effectively keeping the last 0.
Demo
Input Data
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24', '0', '000']}
df = pd.DataFrame(nums)
amount
0 0324
1 S123
2 0010
3 None
4 0030
5 SA40
6 SA24
7 0 <== Added a single 0 here
8 000 <== Added a sequence of all 0's here
Output
print(df)
amount
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
7 0 <== Single 0 is not removed
8 0 <== Last 0 is kept

Python parse dataframe element

I have a pandas dataframe column (Data Type) which I want to split into three columns
target_table_df = LoadS_A [['Attribute Name',
'Data Type',
'Primary Key Indicator']]
Example input (target_table_df)
Attribute Name Data Type Primary Key Indicator
0 ACC_LIM DECIMAL(18,4) False
1 ACC_NO NUMBER(11,0) False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4) False
4 ACDB DECIMAL(18,4) False
5 AGRMNT_ID NUMBER(11,0) True
6 BRNCH_NUM NUMBER(11,0) False
7 CLRD_BAL DECIMAL(18,4) False
8 CR_INT_ACRD_GRSS DECIMAL(18,4) False
9 CR_INT_ACRD_NET DECIMAL(18,4) False
I aim to:
Reassign 'Data Type' to the text preceding the parenthesis
[..if parenthesis exists in 'Data Type']:
Create new column 'Precision' and assign to first comma separated
value
Create new column 'Scale' and assign to second comma separated value
Intended output would therefore become:
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date
3 decimal 18 4
4 decimal 18 4
5 number 4 0
I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?
Any help much appreciated
Use target_table_df['Data Type'].str.extract(pattern)
You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.
\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.
,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.
(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.
Solution
everything together looks like this:
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]
# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df
I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date NaN NaN
3 decimal 18 4
4 decimal 18 4
5 number 11 0

Categories