Creating a year column in Pandas - python

I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].
How can i do this, but change the column dtype to a float?
year_list = []
for i in range(title_length):
year = re.findall('\d{4}', wine['title'][i])
year_list.append(year)
wine['year'] = year_list
Here is the head of my dataframe:
country designation points province title year
Italy Vulkà Bianco 87 Sicily Nicosia 2013 Vulkà Bianco [2013]

re.findall returns a list of results. Use re.search
wine['year'] = [re.search('\d{4}', title)[0] for title in wine['title']]
better yet use pandas extract method.
wine['year'] = wine['title'].str.extract(r'\d{4}')
Definition
Series.str.extract(pat, flags=0, expand=True)
For each subject string in the Series, extract groups from the first match of regular expression pat.

Instead of re.findall that returns a list of strings, you may use str.extract():
wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')
Or, in case you want to only match 1900-2000s years:
wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')
Note that the pattern in str.extract must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.
I suggest using word boundaries \b around the \d{4} pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890.

Related

Extract substring from string and apply to entire dataframe column

I have a pandas dataframe with a bunch of urls in a column, eg
URL
www.myurl.com/python/us/learnpython
www.myurl.com/python/en/learnpython
www.myurl.com/python/fr/learnpython
.........
I want to extract the country code and add them in to a new column called Country containing us, en, fr and so on. I'm able to do this on a single string, eg
url = 'www.myurl.com/python/us/learnpython'
country = url.split("python/")
country = country[1]
country = country.split("/")
country = country[0]
How do I go about applying this to the entire column, creating a new column with the required data in the process? I've tried variations of this with a for loop without success.
Assuming the URLs would always have this format, we can just use str.extract here:
df["cc_code"] = df["URL"].str.extract(r'/([a-z]{2})/')
If the contry code always appears after second slash /, its better to just split the string passing value for n i.e. maxsplit parameter and take only the value you are interested in. Of course, you can assign the values to a new column:
>>> df['URL'].str.split('/',n=2).str[-1].str.split('/', n=1).str[0]
0 us
1 en
2 fr
Name: URL, dtype: object

How to split a dataframe column into 2 new columns, by slicing the all strings before the last item and last item

I have a dataframe that has a column which contains addresses. I would like to split the addresses so that the ending are in a column Ending and the strings before the the ending item are in a separate column Beginning. The address vary in length eg:
Main Street
Jon Smith Close
The Rovers Avenue
After searching different resources I came up with the following
new_address_df['begining'], new_address_df['ending'] = new_address_df['street'].str.split().str[:-1].apply(lambda x: ' '.join(map(str, x))), new_address_df['street'].str.split().str[-1]
The code works but I am not sure if its the right way to write the code in python. Another option would have been to convert to list, modify the data in list form and then convert back to dataframe. I guess this might not be the best approach.
Is there a way to improve the above code if its not pythonic.
There are certainly alot of ways of doing this :) I would go for using str and rpartition. rpartition splits your string in 3 components, the remaining part, the partition string, and the part after remaining and the partition string. If you just take the first and remaining part you should be done.
df[["begining", "ending"]]=df.street.str.rpartition(" ")[[0,2]]
You might use regular expression for this as follows
import pandas as pd
df = pd.DataFrame({"street":["Main Street","Jon Smith Close","The Rovers Avenue"]})
df2 = df.street.str.extract(r"(?P<Beginning>.+)\s(?P<Ending>\S+)")
df = pd.concat([df,df2],axis=1)
print(df)
output
street Beginning Ending
0 Main Street Main Street
1 Jon Smith Close Jon Smith Close
2 The Rovers Avenue The Rovers Avenue
Explanation: I used named capturing group which result in pandas.DataFrame with such named columns, which I then concat with original df with axis=1. In pattern I used group are sheared by single whitespace (\s), in group Beginning any character is allowed in group Ending only non-whitespace (\S) characters are allowed.

Python - str.match for each string in a dataframe

I'm trying to use str.match to match a phrase exactly, but for each word in each row's string. I want to return the row's index number for the correct row, which is why I'm using str.match instead of regex.
I want to return the index for the row that contains exactly 'FL', not 'FLORIDA'. The problem with using str.contains though, is that it returns to me the index of the row with 'FLORIDA'.
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
df.index[df['Name'].str.contains('FL')]
df.index[df['Name'].str.match('FL')]
Here's what the dataframe looks like:
Name Age
0 Alex in FL ten
1 Bob in FLORIDA five
2 Will in GA three
The output should be returning the index of row 0:
Int64Index([0], dtype='int64')
Use contains with word boundaries:
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df.index[df['Name'].str.contains(r'\bFL\b')])
Output
Int64Index([0], dtype='int64')
Try:
df[df.Name.str.contains(r'\bFL\b', regex=True)]
OR
df[['FL' in i for i in df.Name.str.split('\s')]]
Output:
Name Age
0 Alex in FL ten
The docs say that it's matching Regex with the expression ("FL" in your case). Since "FLORIDA" does contain that substring, it does match.
One way you could do this would be to match instead for " FL " (padded with space) but you would also need to pad each of the values with spaces as well (for when "FL" is the end of the string).

Pattern Match in List of Strings, Create New Column in pandas

I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2

sub string python pandas

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Categories