Python parse dataframe element

Python parse dataframe element - python

I have a pandas dataframe column (Data Type) which I want to split into three columns
target_table_df = LoadS_A [['Attribute Name',
'Data Type',
'Primary Key Indicator']]
Example input (target_table_df)
Attribute Name Data Type Primary Key Indicator
0 ACC_LIM DECIMAL(18,4) False
1 ACC_NO NUMBER(11,0) False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4) False
4 ACDB DECIMAL(18,4) False
5 AGRMNT_ID NUMBER(11,0) True
6 BRNCH_NUM NUMBER(11,0) False
7 CLRD_BAL DECIMAL(18,4) False
8 CR_INT_ACRD_GRSS DECIMAL(18,4) False
9 CR_INT_ACRD_NET DECIMAL(18,4) False
I aim to:
Reassign 'Data Type' to the text preceding the parenthesis
[..if parenthesis exists in 'Data Type']:
Create new column 'Precision' and assign to first comma separated
value
Create new column 'Scale' and assign to second comma separated value
Intended output would therefore become:
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date
3 decimal 18 4
4 decimal 18 4
5 number 4 0
I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?
Any help much appreciated

Use target_table_df['Data Type'].str.extract(pattern)
You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.
\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.
,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.
(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.
Solution
everything together looks like this:
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]
# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df
I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date NaN NaN
3 decimal 18 4
4 decimal 18 4
5 number 11 0

Related

Remove leading zeroes pandas

For example I have such a data frame
import pandas as pd
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24']}
df = pd.DataFrame(nums)
And I need to remove all leading zeroes and replace NONEs with zeros:
I did it with cycles but for large frames it works not fast enough.
I'd like to rewrite it using vectores

you can try str.replace
df['amount'].str.replace(r'^(0+)', '').fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Name: amount, dtype: object

df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')

I see already nice answer from #Epsi95 though, you even can try with character set with regex
>>> df['amount'].str.replace(r'^[0]*', '', regex=True).fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Explanation:
^[0]*
^ asserts position at start of a line
Match a single character present in the list below [0]
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)

Step by step :
Remove all leading zeros:
Use str.lstrip which returns a copy of the string with leading characters removed (based on the string argument passed).
Here,
df['amount'] = df['amount'].str.lstrip('0')
For more, (https://www.programiz.com/python-programming/methods/string/lstrip)
Replace None with zeros:
Use fill.na which works with others than None as well
Here,
df['amount'].fillna(value='0')
And for more : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
Result in one line:
df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')

If you need to ensure single 0 or the last 0 is not removed, you can use:
df['amount'] = df['amount'].str.replace(r'^(0+)(?!$)', '', regex=True).fillna('0')
Regex (?!$) ensure the matching substring (leading zeroes) does not including the last 0. Thus, effectively keeping the last 0.
Demo
Input Data
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24', '0', '000']}
df = pd.DataFrame(nums)
amount
0 0324
1 S123
2 0010
3 None
4 0030
5 SA40
6 SA24
7 0 <== Added a single 0 here
8 000 <== Added a sequence of all 0's here
Output
print(df)
amount
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
7 0 <== Single 0 is not removed
8 0 <== Last 0 is kept

Manipulate Dataframe Series

I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?

As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)

>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object

Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]

Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.

Get 10 Digit Number

I'm trying to define a function that will create a column and clean the numbers to just their ten digit area code and number. The Date frame.
PNum1
0 18888888888
1 1999999999
2 +++(112)31243134
I have all the individual functions and even stored them into a DataFrame and Dictionary.
def GetGoodNumbers(col):
column = col.copy()
Cleaned = column.replace('\D+', '', regex=True)
NumberCount = Cleaned.astype(str).str.len()
FirstNumber = Cleaned.astype(str).str[0]
SummaryNum = {'Number':Cleaned,'First':FirstNumber,'Count':NumberCount}
df = pd.DataFrame(data=SummaryNum)
DecentNumbers = []
return df
returns
Count First Number
0 11 1 18888888888
1 10 3 3999999999
2 11 2 11231243134
How can I loop through the dataframe column and return a new column that will:
-remove all non-digits.
-get the length (which will be usually 10 or 11)
-If length is 11, return the right 10 digits.
The desired output:
number
1231243134
1999999999
8888888888

You can remove every non-digit and slice the last 10 digits.
df.PNum1.str.replace('\D+', '').str[-10:]
0 8888888888
1 1999999999
2 1231243134
Name: PNum1, dtype: object

python pandas data frame replace ends of string values to another character

I want to replace ends of string values in one column to another character. Here, I want to convert every ends of string values to '0'. The values in 'Codes' column are string.
e.g
Code
1 11-1111
2 12-2231
3 12-1014
4 15-0117
5 16-2149
to
Code
1 11-1110
2 12-2230
3 12-1010
4 15-0110
5 16-2140
What method I can use?

One way could be
df.Code = df.Code.str[:-1] + '0'
You get
Code
1 11-1110
2 12-2230
3 12-1010
4 15-0110
5 16-2140

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?

Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN

Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parse dataframe element - python

Related

Remove leading zeroes pandas

Manipulate Dataframe Series

Get 10 Digit Number

python pandas data frame replace ends of string values to another character

How to replace an entire cell with NaN on pandas DataFrame

Categories

Resources