Python, regular expressions - search dots in pandas data frame

Python, regular expressions - search dots in pandas data frame - python

I have pandas.dataFrame with column 'Country', head() is below:
0 tmp
1 Environmental Indicators: Energy
2 tmp
3 Energy Supply and Renewable Electricity Produc...
4 NaN
5 NaN
6 NaN
7 Choose a country from the following drop-down ...
8 NaN
9 Country
When I use this line:
energy['Country'] = energy['Country'].str.replace(r'[...]', 'a')
There is no change.
But when I use this line instead:
energy['Country'] = energy['Country'].str.replace(r'[...]', np.nan)
All values are NaN.
Why does only second code change output? My goal is change valuses with triple dot only.

Is this what you want when you say "I need change whole values, not just the triple dots"?
mask = df.Country.str.contains(r'\.\.\.', na=False)
df.Country[mask] = 'a'

.replace(r'[...]', 'a') treats the first parameter as a regular expression, but you want to treat it literally. So, you need .replace(r'\.\.\.', 'a').
As for your actual question, .str.replace requires a string as the second parameter. It attempts to convert np.nan to a string (which is not possible) and fails. For the reason not known to me, instead of raising a TypeError, it instead returns np.nan for each row.

Related

Why it says NaN when I'm trying to sum rows using Python Pandas?

Title Netquantity Bases Chambers
x 2 4
y 5
To get the total of bases chambers, I've used
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].sum()
However, the output was
Title Netquantity Bases Chambers
x 2 4
y 5
Total NaN NaN NaN
Could you help me to debug this issue? Some of the numbers are empty. I tried to
concat_list = concat_list.fillna(0)
But still didn't work.

Here is problem there are values mixed strings with numeric or all strings, first tray convert to floats:
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].astype(float).sum()
If not working, try replace not parseable values to missing values by to_numeric per all columns, so use DataFrame.apply:
concat_list.loc['Total'] = concat_list[['Bases','Chambers']].apply(pd.to_numeric, errors='coerce').sum()

Filling nan values

I have a dataset that contains nan values. These values are dependent on another variable, and I am trying to clean the data using it. I write a code to replace the nan values but it doesn't work. The code is:
df.loc[(df["house"]=="rented") & (df["car"]=="yes")]["debt"].fillna(2, inplace=True)

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Conditional that returns a boolean Series with column labels specified
df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7
Based on the documentation it should be converted to this:
df.loc['filter','selected column']
Give it a try like this:
df.loc[(df["house"]=="rented") & (df["car"]=="yes"), ["debt"]].fillna(2, inplace=True)

Switch df.loc to
for val in df.index:
if (df["house"][val] == "rented") and (df["car"][val] == "yes"):
df["debt"][val] = 2
If I understand you correctly, you do not want to just fill in the na values. Rather, you'd like to fill the na values only when house is rented and you have a car. To fill all na values at df index "debt"
df["debt"].fillna(2, inplace=True)
should be used rather then your second line of code.

Pandas: How to extract a string from another string

I have a column that consist of 8000 rows, and I need to create a new column which the value is extracted from the existing column.
the string shows like this:
TP-ETU06-01-525-W-133
and I want to create two new columns from the string where the value of first new column is extracted from the second string which ETU06 and the second one is from the last string which is 133.
I have done this by using:
df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
It actually works fine, but the existing column has random string that doesn't follow the others. So I want to leave empty in the new columns if the random string appears.
How should I change my script?
Thankyou

Use Series.str.contains for mask, then split values by Series.str.split and select secnd and last value by indexing only filtered rows by mask:
print (df)
test_no
0 temp data
1 NaN
2 TP-ETU06-01-525-W-133
mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]
print (df)
test_no sys_no package_no
0 temp data NaN NaN
1 NaN NaN NaN
2 TP-ETU06-01-525-W-133 ETU06 133

This approach uses regex and named capture groups to find and extract the strings of interest, in just two lines of code.
Benefit of regex over split:
It is true that regex is not required. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.
Additionally, regex provides a robust solution as the pattern matches the entire string, and anything outside of this pattern is ignored - like the example mentioned in the question.
If you'd like some explanation on the regex pattern itself, just add a comment and I'll update the answer to explain.
Sample Data:
test_no
0 TP-ETU05-01-525-W-005
1 TP-ETU06-01-525-W-006
2 TP-ETU07-01-525-W-007
3 TP-ETU08-01-525-W-008
4 TP-ETU09-01-525-W-009
5 NaN
6 NaN
7 otherstuff
Code:
import re
exp = re.compile(r'^[A-Z]{2}-(?P<sys_no>[A-Z]{3}\d{2})-\d{2}-\d{3}-[A-Z]-(?P<package_no>\d{3})$')
df[['sys_no', 'package_no']] = df['test_no'].str.extract(exp, expand=True)
Output:
test_no sys_no package_no
0 TP-ETU05-01-525-W-005 ETU05 005
1 TP-ETU06-01-525-W-006 ETU06 006
2 TP-ETU07-01-525-W-007 ETU07 007
3 TP-ETU08-01-525-W-008 ETU08 008
4 TP-ETU09-01-525-W-009 ETU09 009
5 NaN NaN NaN
6 NaN NaN NaN
7 otherstuff NaN NaN

Getting errors whenever dealing with null or NaN types when working on csv files with pandas

I am trying to replace all the Country ISO codes to Full Country Names to keep everything consistent as part of cleaning some data. I managed to find the pycountry package, which helps a ton! There are some fields on the CSV file that are empty, which I believe is causing some issues when running my code below.
Also, an additional question, not sure if it's just me, but there are times when CSV reads empty files as null/NaN or simply empty. I don't really know what went wrong there, but if possible I would like to change all those empty cells into one "thing" or type for ease of filtering/dropping it out.
df = pd.read_csv("file.csv")
#use pycountry to match the Nationalities as actual country names
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(df):
if (len(df['Nationality'])==2 and df['Nationality'] in list_alpha_2):
return pycountry.countries.get(alpha_2=df['Nationality']).name
elif (len(df['Nationality'])==3 and df['Nationality'] in list_alpha_3):
return pycountry.countries.get(alpha_3=df['Nationality']).name
elif (len(df['Nationality'])>3):
return df['Nationality']
else:
return '#N/A'
df['Nationality']=df.apply(country_flag,axis =1)
df
I was expecting the result to be something like:
0 AF 100 Afghanistan
1 #N/A
2 AUS 140 Australia
3 Germany 400 Germany
The error message I am getting is
TypeError: ("object of type 'float' has no len()", 'occurred at index 0')
Yet, there shouldn't be any float type values in the 'Nationality' column I am working on. I am guessing this is simply the empty/null/NaN values being considered a float type?

One idea is remove misisng values first by Series.dropna and use Series.apply:
print (df)
Nationality
0 AF
1 NaN
2 AUS
3 Germany
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(x):
if (len(x)==2 and x in list_alpha_2):
return pycountry.countries.get(alpha_2=x).name
elif (len(x)==3 and x in list_alpha_3):
return pycountry.countries.get(alpha_3=x).name
elif (len(x)>=3):
return x
else:
return np.nan
df['Nationality'] = df['Nationality'].dropna().astype(str).apply(country_flag)
print (df)
Nationality
0 Afghanistan
1 NaN
2 Australia
3 Germany

One thing to watch out for is when pandas is reading from a data source and tries to automatically assign a data type to a column, it will sometimes assign a different data type than what you would expect depending upon if there are empty values or not in the data source.
A classical example is integer values that are converted to float values.
If you have a CSV file with this exact content (note missing value in row 2 of column A):
ColA,ColB
0,2
,1
5,4
then reading the file with
res_df=pandas.read_csv(filename)
will create a dataframe with floats in the column A and integers in the column B.
This is due to the fact that there is no canonical way to assign an "empty" value to an integer, whereas a float can just be set as NaN (not a number).
But if that value was present, you would get 2 columns of integers.
Just something to be aware of, as it may easily be forgotten, and then suddenly you are getting floats instead of integers in your code and be confused about it.

How to split a column with string values like '[title:item]['title2:item]'...etc into a dictionary with pandas

I am trying to clean some data in a dataframe. In particular a column that displays like this:
0 [Bean status:Whole][Type of Roast:Medium][Coff...
1 [Type of Roast:Espresso][Coffee Type:Blend]
2 [Bean status:Whole][Type of Roast:Dark][Coffee...
3 [Bean status:Whole][Type of Roast:Light][Coffe...
4 NaN
5 [Roaster:Little City][Type of Roast:Light][Cof...
Name: options, dtype: object
My goal is to split this into four columns and assign the corresponding value to the columns to look something like this:
Roaster Bean Status Type of Roast Coffee Type
0 NaN Whole Medium Blend
1 NaN NaN Espresso Blend
..
5 Littl... Whole Light Single Origin
I've tried df.str.split('[', expand=True) but it is not suitable because the options are not always present or in the same position.
My thoughts were to try to split the strings into a dictionary and store that dictionary in a new dataframe, then join the two dataframes together. However, I'm getting lost trying to store the column into a dictionary. I tried doing this: https://www.fir3net.com/Programming/Python/python-split-a-string-into-a-dictionary.html like so:
roasts = {}
roasts = dict(x.split(':') for x in df['options'][0].split('[]'))
print(roasts)
and I get this error:
ValueError: dictionary update sequence element #0 has length 4; 2 is required
I tried investigating what was going on here by storing to a list instead:
s = ([x.split(':') for x in df['options'][0].split('[]')])
print(s)
[['[Bean status', 'Whole][Type of Roast', 'Medium][Coffee Type', 'Blend]']]
So I see the code is not splitting the string up how I would like, and have played around substituting a single bracket into those various locations without proper results.
Is it possible to get this column into a dictionary or will I have to resort to regex?

Using AmiTavory's sample data
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
Combination of re.findall and str.split
import re
import pandas as pd
pd.DataFrame([
dict(
x.split(':')
for x in re.findall('\[(.*?)\]', v)
)
for v in df.options
])
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso

You might use
df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Example
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
>>> df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso
Explanation
Say you start with a string like
s = '[Bean status:Whole][Type of Roast:Medium]'
Then
s[1: -1]
removes the first and last parentheses.
Then,
split('][')
splits the dividers
Then,
e.split(':')[0]: e.split(':')[1]
for each of the splits, maps the first part to the second part.
Finally, create a Series from this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, regular expressions - search dots in pandas data frame - python

Is this what you want when you say "I need change whole values, not just the triple dots"? mask = df.Country.str.contains(r'\.\.\.', na=False) df.Country[mask] = 'a'

Related

Why it says NaN when I'm trying to sum rows using Python Pandas?

Filling nan values

Pandas: How to extract a string from another string

Getting errors whenever dealing with null or NaN types when working on csv files with pandas

How to split a column with string values like '[title:item]['title2:item]'...etc into a dictionary with pandas

Categories

Resources