Pandas: How to extract a string from another string - python

I have a column that consist of 8000 rows, and I need to create a new column which the value is extracted from the existing column.
the string shows like this:
TP-ETU06-01-525-W-133
and I want to create two new columns from the string where the value of first new column is extracted from the second string which ETU06 and the second one is from the last string which is 133.
I have done this by using:
df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
It actually works fine, but the existing column has random string that doesn't follow the others. So I want to leave empty in the new columns if the random string appears.
How should I change my script?
Thankyou

Use Series.str.contains for mask, then split values by Series.str.split and select secnd and last value by indexing only filtered rows by mask:
print (df)
test_no
0 temp data
1 NaN
2 TP-ETU06-01-525-W-133
mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]
print (df)
test_no sys_no package_no
0 temp data NaN NaN
1 NaN NaN NaN
2 TP-ETU06-01-525-W-133 ETU06 133

This approach uses regex and named capture groups to find and extract the strings of interest, in just two lines of code.
Benefit of regex over split:
It is true that regex is not required. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.
Additionally, regex provides a robust solution as the pattern matches the entire string, and anything outside of this pattern is ignored - like the example mentioned in the question.
If you'd like some explanation on the regex pattern itself, just add a comment and I'll update the answer to explain.
Sample Data:
test_no
0 TP-ETU05-01-525-W-005
1 TP-ETU06-01-525-W-006
2 TP-ETU07-01-525-W-007
3 TP-ETU08-01-525-W-008
4 TP-ETU09-01-525-W-009
5 NaN
6 NaN
7 otherstuff
Code:
import re
exp = re.compile(r'^[A-Z]{2}-(?P<sys_no>[A-Z]{3}\d{2})-\d{2}-\d{3}-[A-Z]-(?P<package_no>\d{3})$')
df[['sys_no', 'package_no']] = df['test_no'].str.extract(exp, expand=True)
Output:
test_no sys_no package_no
0 TP-ETU05-01-525-W-005 ETU05 005
1 TP-ETU06-01-525-W-006 ETU06 006
2 TP-ETU07-01-525-W-007 ETU07 007
3 TP-ETU08-01-525-W-008 ETU08 008
4 TP-ETU09-01-525-W-009 ETU09 009
5 NaN NaN NaN
6 NaN NaN NaN
7 otherstuff NaN NaN

Related

Pandas column sometimes has one value sometimes list. Extract first element

I have a pandas column that looks like this:
0 info#bakkersfinedrycleaning.com
1 service#mobileagency.com
2 NaN
3 NaN
4 NaN
5 NaN
6 sales#sourcefurniture.com, support#sourcefurni...
7 NaN
8 NaN
9 service#allfloridapool.com
I am trying to extract the email if there is only one and the first email if of the list if there is more than one. I can't get it to work:
Here is what I have so far and I don't know why it is giving this error:
ValueError: Columns must be same length as key
df1['Alternate_Email_1_Explorium__c'] = df1['Contact email address'].fillna('~')
df1['Alternate_Email_1_Explorium__c'] = df1['Alternate_Email_1_Explorium__c'].str.extract(r'(.+)|^(.+?),')
Any help would be appreciated on what I'm doing wrong.
This code should give you find the first email address from your column as long as the e-mails are separated consistently with a ','
df.fillna('~', inplace = True)
df['first_email_list'] = df[0].apply(lambda x : x.split(',')[0])
You would need to simply replace the df[0] with the name of the column that houses the email address you are currently working with and create a new column called 'first_email_list', but you could change that to whatever you would need

Why it says NaN when I'm trying to sum rows using Python Pandas?

Title Netquantity Bases Chambers
x 2 4
y 5
To get the total of bases chambers, I've used
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].sum()
However, the output was
Title Netquantity Bases Chambers
x 2 4
y 5
Total NaN NaN NaN
Could you help me to debug this issue? Some of the numbers are empty. I tried to
concat_list = concat_list.fillna(0)
But still didn't work.
Here is problem there are values mixed strings with numeric or all strings, first tray convert to floats:
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].astype(float).sum()
If not working, try replace not parseable values to missing values by to_numeric per all columns, so use DataFrame.apply:
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].apply(pd.to_numeric, errors='coerce').sum()

How to split a column with string values like '[title:item]['title2:item]'...etc into a dictionary with pandas

I am trying to clean some data in a dataframe. In particular a column that displays like this:
0 [Bean status:Whole][Type of Roast:Medium][Coff...
1 [Type of Roast:Espresso][Coffee Type:Blend]
2 [Bean status:Whole][Type of Roast:Dark][Coffee...
3 [Bean status:Whole][Type of Roast:Light][Coffe...
4 NaN
5 [Roaster:Little City][Type of Roast:Light][Cof...
Name: options, dtype: object
My goal is to split this into four columns and assign the corresponding value to the columns to look something like this:
Roaster Bean Status Type of Roast Coffee Type
0 NaN Whole Medium Blend
1 NaN NaN Espresso Blend
..
5 Littl... Whole Light Single Origin
I've tried df.str.split('[', expand=True) but it is not suitable because the options are not always present or in the same position.
My thoughts were to try to split the strings into a dictionary and store that dictionary in a new dataframe, then join the two dataframes together. However, I'm getting lost trying to store the column into a dictionary. I tried doing this: https://www.fir3net.com/Programming/Python/python-split-a-string-into-a-dictionary.html like so:
roasts = {}
roasts = dict(x.split(':') for x in df['options'][0].split('[]'))
print(roasts)
and I get this error:
ValueError: dictionary update sequence element #0 has length 4; 2 is required
I tried investigating what was going on here by storing to a list instead:
s = ([x.split(':') for x in df['options'][0].split('[]')])
print(s)
[['[Bean status', 'Whole][Type of Roast', 'Medium][Coffee Type', 'Blend]']]
So I see the code is not splitting the string up how I would like, and have played around substituting a single bracket into those various locations without proper results.
Is it possible to get this column into a dictionary or will I have to resort to regex?
Using AmiTavory's sample data
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
Combination of re.findall and str.split
import re
import pandas as pd
pd.DataFrame([
dict(
x.split(':')
for x in re.findall('\[(.*?)\]', v)
)
for v in df.options
])
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso
You might use
df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Example
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
>>> df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso
Explanation
Say you start with a string like
s = '[Bean status:Whole][Type of Roast:Medium]'
Then
s[1: -1]
removes the first and last parentheses.
Then,
split('][')
splits the dividers
Then,
e.split(':')[0]: e.split(':')[1]
for each of the splits, maps the first part to the second part.
Finally, create a Series from this.

Python, regular expressions - search dots in pandas data frame

I have pandas.dataFrame with column 'Country', head() is below:
0 tmp
1 Environmental Indicators: Energy
2 tmp
3 Energy Supply and Renewable Electricity Produc...
4 NaN
5 NaN
6 NaN
7 Choose a country from the following drop-down ...
8 NaN
9 Country
When I use this line:
energy['Country'] = energy['Country'].str.replace(r'[...]', 'a')
There is no change.
But when I use this line instead:
energy['Country'] = energy['Country'].str.replace(r'[...]', np.nan)
All values are NaN.
Why does only second code change output? My goal is change valuses with triple dot only.
Is this what you want when you say "I need change whole values, not just the triple dots"?
mask = df.Country.str.contains(r'\.\.\.', na=False)
df.Country[mask] = 'a'
.replace(r'[...]', 'a') treats the first parameter as a regular expression, but you want to treat it literally. So, you need .replace(r'\.\.\.', 'a').
As for your actual question, .str.replace requires a string as the second parameter. It attempts to convert np.nan to a string (which is not possible) and fails. For the reason not known to me, instead of raising a TypeError, it instead returns np.nan for each row.

How to *extract* latitud and longitude greedily in Pandas?

I have a dataframe in Pandas like this:
id loc
40 100005090 -38.229889,-72.326819
188 100020985 ut: -33.442101,-70.650327
249 10002732 ut: -33.437478,-70.614637
361 100039605 ut: 10.646041,-71.619039 \N
440 100048229 4.666439,-74.071554
I need to extract the gps points. I first ask for a contain of a certain regex (found here in SO, see below) to match all cells that have a "valid" lat/long value. However, I also need to extract these numbers and either put them on a series of their own (and then call split on the comma) or put them in two new pandas series. I have tried the following for the extraction part:
ids_with_latlong["loc"].str.extract("[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$")
but it looks, because of the output, that the reg exp is not doing the matching greedily, because I get something like this:
0 1 2 3 4 5 6 7 8
40 38.229889 .229889 NaN 72.326819 NaN 72 NaN 72 .326819
188 33.442101 .442101 NaN 70.650327 NaN 70 NaN 70 .650327
Obviously it's matching more than I want (I would just need cols 0, 1, and 4), but simply dropping them is too much of a hack for me to do. Notice that the extract function also got rid of the +/- signs at the beginning. If anyone has a solution, I'd really appreciate.
#HYRY's answer looks pretty good to me. This is just an alternate approach that uses built in pandas methods rather than a regex approach. I think it's a little simpler to read though I'm not sure if it will be sufficiently general for all your cases (it works fine on this sample data though).
df['loc'] = df['loc'].str.replace('ut: ','')
df['lat'] = df['loc'].apply( lambda x: x.split(',')[0] )
df['lon'] = df['loc'].apply( lambda x: x.split(',')[1] )
id loc lat lon
0 100005090 -38.229889,-72.326819 -38.229889 -72.326819
1 100020985 -33.442101,-70.650327 -33.442101 -70.650327
2 10002732 -33.437478,-70.614637 -33.437478 -70.614637
3 100039605 10.646041,-71.619039 10.646041 -71.619039
4 100048229 4.666439,-74.071554 4.666439 -74.071554
As a general suggestion for this type of approach you might think about doing in in the following steps:
1) remove extraneous characters with replace (or maybe this is where the regex is best)
2) split into pieces
3) check that each piece is valid (all you need to do is check that it's a number although you could take an extra step that it falls into the number range of being a valid lat or lon)
You can use (?:) to ignore the group:
df["loc"].str.extract(r"((?:[\+-])?\d+\.\d+)\s*,\s*((?:[\+-])?\d+\.\d+)")

Categories