Extract substring from string and apply to entire dataframe column

Extract substring from string and apply to entire dataframe column - python

I have a pandas dataframe with a bunch of urls in a column, eg
URL
www.myurl.com/python/us/learnpython
www.myurl.com/python/en/learnpython
www.myurl.com/python/fr/learnpython
.........
I want to extract the country code and add them in to a new column called Country containing us, en, fr and so on. I'm able to do this on a single string, eg
url = 'www.myurl.com/python/us/learnpython'
country = url.split("python/")
country = country[1]
country = country.split("/")
country = country[0]
How do I go about applying this to the entire column, creating a new column with the required data in the process? I've tried variations of this with a for loop without success.

Assuming the URLs would always have this format, we can just use str.extract here:
df["cc_code"] = df["URL"].str.extract(r'/([a-z]{2})/')

If the contry code always appears after second slash /, its better to just split the string passing value for n i.e. maxsplit parameter and take only the value you are interested in. Of course, you can assign the values to a new column:
>>> df['URL'].str.split('/',n=2).str[-1].str.split('/', n=1).str[0]
0 us
1 en
2 fr
Name: URL, dtype: object

Related

Python pandas filtering dataframe column by list of conditions

What is the best way to filter multiple columns in a dataframe?
For example I have this sample from my data:
Index Tag_Number
666052 A1-1100-XY-001
666382 B1-001-XX-X-XX
666385 **FROM** C1-0001-XXX-100
666620 D1-001-XX-X-HP some text
"Tag_Number" column contains Tags but I need to get rid of texts before or after the tag. The common delimeter is "space". My idea was to divide up the column into multiple and filter each of these columns that start with either of these "A1-, B1-, C1-, D1-", i.e. if cell does not start with condition it is False, else True and the apply to the table, so that True Values remain as before, but if False, we get empty values. Finally, once the tags are cleaned up, combine them into one single column. I know this might be complicated and I'm really open to any suggestions.
What I have already tried:
Splitted = df.Tag_Number.str.split(" ",expand=True)
Splitted.columns = Splitted.columns.astype(str)
Splitted = Splitted.rename(columns=lambda s: "Tag"+s)
col_names = list(Splitted.columns)
Splitted
I got this Tag_number column splitted into 30 cols, but now I'm struggling to filter out each column.
I have created a conditions to filter each column by:
asset = ('A1-','B1-','C1-','D1-')
yet this did not help, I only got an array for the last column instead off all which is expected I guess.
for col in col_names:
Splitted_filter = Splitted[col].str.startswith(asset, na = False)
Splitted_filter
Is there a way to filter each column by this 'asset' filter?
Many Thanks

If you want to clean out the text that does not match the asset prefixes, then I think this would work.
sample = pd.read_csv(StringIO("""Index,Tag_Number
666052,A1-1100-XY-001
666382,B1-001-XX-X-XX
666385,**FROM** C1-0001-XXX-100
666620,D1-001-XX-X-HP some text"""))
asset = ('A1-','B1-','C1-','D1-')
def asset_filter(tag_n):
tags = tag_n.split() # common delimeter is "space"
tags = [t for t in tags if len([a for a in asset if t.startswith(a)]) >= 1]
return tags # can " ".join(tags) if str type is desired
sample['Filtered_Tag_Number'] = sample.Tag_Number.astype(str).apply(asset_filter)
See that it is possible to define a custom function asset_filter and the apply it to the column you wish to transform.
Result is this:
Index Tag_Number Filtered_Tag_Number
0 666052 A1-1100-XY-001 [A1-1100-XY-001]
1 666382 B1-001-XX-X-XX [B1-001-XX-X-XX]
2 666385 **FROM** C1-0001-XXX-100 [C1-0001-XXX-100]
3 666620 D1-001-XX-X-HP some text [D1-001-XX-X-HP]

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.

Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

Creating a year column in Pandas

I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].
How can i do this, but change the column dtype to a float?
year_list = []
for i in range(title_length):
year = re.findall('\d{4}', wine['title'][i])
year_list.append(year)
wine['year'] = year_list
Here is the head of my dataframe:
country designation points province title year
Italy Vulkà Bianco 87 Sicily Nicosia 2013 Vulkà Bianco [2013]

re.findall returns a list of results. Use re.search
wine['year'] = [re.search('\d{4}', title)[0] for title in wine['title']]
better yet use pandas extract method.
wine['year'] = wine['title'].str.extract(r'\d{4}')
Definition
Series.str.extract(pat, flags=0, expand=True)
For each subject string in the Series, extract groups from the first match of regular expression pat.

Instead of re.findall that returns a list of strings, you may use str.extract():
wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')
Or, in case you want to only match 1900-2000s years:
wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')
Note that the pattern in str.extract must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.
I suggest using word boundaries \b around the \d{4} pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890.

Remove excess info from dataframe

I have a big data frame that contains 6 columns. When I want to print the info out of one cell, I use the following code:
df = pd.read_excel(Path_files_data)
info_rol = df.loc[df.Rank == Ranknumber]
print(info_rol['Art_Nr'])
Here Rank is the column that gives the rank of every item and Ranknumber is the Rank of the item i try to look up. How what i get back looks like this:
0 10399
Name: Art_Nr, dtype: object
Here 0 is the rank and 10399 is Art_Nr. How do I get it to work that it only printsout the Art_Nr. and leaves al the crap like dtype: object.
PS. I tried strip but that didnt work for me.

I think you need select first value of Series by iat or iloc for scalar:
print(info_rol['Art_Nr'].iat[0])
print(info_rol['Art_Nr'].iloc[0])
If string or numeric output:
print(info_rol['Art_Nr'].values[0])
But after filtering is possible you get multiple values, then second, third.. values are lost.
So converting to list is more general solution:
print(info_rol['Art_Nr'].tolist())

Splitting a pandas DataFrame of email 'From' field into sender's name, email address

I've a pandas Dataframe consisting of a single column which is the extraction from the From field of emails e.g.
From
0 Grey Caulfu <grey.caulfu#ymail.com>
1 Deren Torculas <deren.e.torcs87#gmail.com>
2 Charlto Youna <youna.charlto4#yahoo.com>
I want to take advantage of the str accessor to split the data into two columns, such that the first column is, Name, contains the actual name (first name last name), and the second column, Email, contains the email address).
If I use:
df = pd.DataFrame(df.From.str.split(' ',1).tolist(),
columns = ['Name','Email'])
This is almost what I need, but it puts the surname in the Email column (i.e. it places the last two items from split() into this column). How do I modify this so that split() knows to stop after the first space when populating the first column?
Once we achieve this, we then need to make it a little more robust, so that it can handle names that contain three elements e.g.
Billy R. Valentine <brvalentine#abc2mail.com>
Yurimov | Globosales <yurimov#globosaleseu.com>

You can use rsplit() instead of split() , to split from the reverse. Example -
In [12]: df1 = pd.DataFrame(df.From.str.rsplit(' ',1).tolist(), columns=['Name','Email'])
In [13]: df1
Out[13]:
Name Email
0 Grey Caulfu <grey.caulfu#ymail.com>
1 Deren Torculas <deren.e.torcs87#gmail.com>
2 Charlto Youna <youna.charlto4#yahoo.com>

You can pass expand=True and create new columns from the str without having to create a new df:
In [353]:
df[['Name','e-mail']] = df['From'].str.rsplit(' ',1, expand=True)
df
Out[353]:
From Name \
0 Grey Caulfu <grey.caulfu#ymail.com> Grey Caulfu
1 Deren Torculas <deren.e.torcs87#gmail.com> Deren Torculas
2 Charlto Youna <youna.charlto4#yahoo.com> Charlto Youna
e-mail
0 <grey.caulfu#ymail.com>
1 <deren.e.torcs87#gmail.com>
2 <youna.charlto4#yahoo.com>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract substring from string and apply to entire dataframe column - python

Assuming the URLs would always have this format, we can just use str.extract here: df["cc_code"] = df["URL"].str.extract(r'/([a-z]{2})/')

Related

Python pandas filtering dataframe column by list of conditions

Given an index label, how would you extract the index position in a dataframe?

Creating a year column in Pandas

Remove excess info from dataframe

Splitting a pandas DataFrame of email 'From' field into sender's name, email address

Categories

Resources