Count the number of strings with length in pandas - python

I am trying to calculate the number of strings in a column with length of 5 or more. These strings are in a column separated by comma.
df= pd.DataFrame(columns=['first'])
df['first'] = ['Jack Ryan, Tom O','Stack Over Flow, StackOverFlow','Jurassic Park, IT', 'GOT']
Code I have used till now but not creating a new column with counts of strings of more than 5 characters.
df['countStrings'] = df['first'].str.split(',').count(r'[a-zA-Z0-9]{5,}')
Expected Output: Counting Strings of length 5 or More.
first
countString
Jack Ryan, Tom O
0
Stack Over Flow, StackOverFlow
2
Jurassic Park, IT
1
GOT
0
Edge Case: Strings of length more than 5 separated by comma and have multiple spaces
first
wrongCounts
rightCounts
Accounts Payable Goods for Resale
4
1
Corporate Finance, Financial Engineering
4
2
TBD
0
0
Goods for Not Resale, SAP
2
1

Pandas str.len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Since this is a string method, .str has to be prefixed everytime before calling this method.
Yo can try this :
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = ['jack,utah,TOMHAWK
Somer,SORITNO','jill','bob,texas','matt,AR','john']
df['first'].replace(',',' ', regex=True, inplace=True)
df['first'].str.count(r'\w+').sum()

You can match 5 chars and on the left and right match optional chars other than a comma.
[^,]*[A-Za-z0-9]{5}[^,]*
See a regex demo with the matches.
Example
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = [
'Accounts Payable Goods for Resale',
'Corporate Finance, Financial Engineering',
'TBD',
'Goods for Not Resale, SAP',
'Jack Ryan, Tom O',
'Stack Over Flow, StackOverFlow',
'Jurassic Park, IT',
'GOT'
]
df['countStrings'] = df['first'].str.count(r'[^,]*[A-Za-z0-9]{5}[^,]*')
print(df)
Output
first countStrings
0 Accounts Payable Goods for Resale 1
1 Corporate Finance, Financial Engineering 2
2 TBD 0
3 Goods for Not Resale, SAP 1
4 Jack Ryan, Tom O 0
5 Stack Over Flow, StackOverFlow 2
6 Jurassic Park, IT 1
7 GOT 0

This is how i would try to get the number of strings with len>=5 in a column:
data=[i for k in df['first']
for i in k.split(',')
if len(i)>=5]
result=len(data)

Related

searching substring for match in dataframe

I am trying to use my df as a lookup table, and trying to determine if my string contains a value in that df. Simple example
str = 'John Smith Business Analyst'
df = pd.read_pickle('job_titles.pickle')
The df would be one column with several job titles.
df = accountant, lawyer, CFO, business analyst, etc..
Now, somehow be able to determine that str has a substring: Business Analyst, because that value is contained in my df.
The return result would be the substring = 'Business Analyst'
If the original str was:
str = 'John Smith Business'
Then the return would be empty since no substring matches a string in the df.
I have it working if it is for one word. For example:
df = pd.read_pickle('cities.pickle')
df = Calgary, Edmonton, Toronto, etc
str = 'John Smith Business Analyst Calgary AB Canada'
str_list = str.split()
for word in str_list:
df_location = df[df['name'].str.match(word)]
if not df_location.empty:
break
df_location = Calgary
The city will be found in the df, and return that one row. Just not sure how when it is more than one word.
I am not sure what you want to do with the returned value exactly, but here is a way to identify it at least. First, I made a toy dataframe:
import pandas as pd
titles_df = pd.DataFrame({'title' : ['Business Analyst', 'Data Scientist', 'Plumber', 'Baker', 'Accountant', 'CEO']})
search_name = 'John Smith Business Analyst'
titles_df
title
0 Business Analyst
1 Data Scientist
2 Plumber
3 Baker
4 Accountant
5 CEO
Then, I loop through the values in the title column to see if any of them are in the search term:
for val in titles_df['title'].values:
if val in search_name:
print(val)
If you want to do this over all the names in a dataframe column and assign a new column with the title you can do the following:
First, I create a dataframe with some names:
names_df = pd.DataFrame({'name' : ['John Smith Business Analyst', 'Dorothy Roberts CEO', 'Jim Miller Dancer', 'Samuel Adams Accountant']})
Then, I loop through the values of names and values of titles and assign the matched titles to a title column in the names dataframe (unmatched ones will have an empty string):
names_df['title'] = ''
for name in names_df['name'].values:
for title in titles_df['title'].values:
if title in name:
names_df['title'][names_df['name'] == name] = title
names_df
name title
0 John Smith Business Analyst Business Analyst
1 Dorothy Roberts CEO CEO
2 Jim Miller Dancer
3 Samuel Adams Accountant Accountant

How to split data in a dataframe cell and perform a Pandas groupby on splits?

I have produced some data which lists parks in proximity to different areas of East London with use of the FourSquare API. It here in the dataframe, df.
Location,Parks,Borough
Aldborough Hatch,Fairlop Waters Country Park,Redbridge
Ardleigh Green,Haynes Park,Havering
Bethnal Green,"Haggerston Park, Weavers Fields",Tower Hamlets
Bromley-by-Bow,"Rounton Park, Grove Hall Park",Tower Hamlets
Cambridge Heath,"Haggerston Park, London Fields",Tower Hamlets
Dalston,"Haggerston Park, London Fields",Hackney
Import data with df = pd.read_clipboard(sep=',')
What I would like to do is group by the borough column and count the distinct parks in that borough so that for example 'Tower Hamlets' = 5 and 'Hackney' = 2. I will create a new dataframe for this purpose which simply lists total number of parks for each borough present in the dataframe.
I know I can do:
df.groupby(['Borough', 'Parks']).size()
But I need to split parks by the delimiter ',' such that they are treated as unique, distinct entities for a borough.
What do you suggest?
Thanks!
The first rule of data science is to clean your data into a useful format.
Reformat the DataFrame to be usable:
df.Parks = df.Parks.str.split(',\s*') # per user piRSquared
df = df.explode('Parks') # pandas v 0.25
Now the DataFrame is in a proper format that can be more easily analyzed
df.groupby('Borough').Parks.nunique()
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
That's three lines of code, but now the DataFrame is in a useful format, upon which more insights can easily be extracted.
Plot
df.groupby(['Borough']).Parks.nunique().plot(kind='bar', title='Unique Parks Counts by Borough')
If you are using Pandas 0.25 or greater, consider the answer from Trenton_M
His answer provides a good suggestion for creating a more useful data set.
IIUC:
df.groupby('Borough').Parks.apply(
lambda s: len(set(', '.join(s).split(', ')))
)
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
Similar
df.Parks.str.split(', ').groupby(df.Borough).apply(lambda s: len(set().union(*s)))
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Python remove row if cell value in dataframe contain characters less than 5

I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]

Pandas - 'cut' everything after a certain character in a string column and paste it in the beginning of the column

In a pandas dataframe string column, I want to grab everything after a certain character and place it in the beginning of the column while stripping the character. What is the most efficient way to do this / clean way to do achieve this?
Input Dataframe:
>>> df = pd.DataFrame({'city':['Bristol, City of', 'Newcastle, City of', 'London']})
>>> df
city
0 Bristol, City of
1 Newcastle, City of
2 London
>>>
My desired dataframe output:
city
0 City of Bristol
1 City of Newcastle
2 London
Assuming there are only two pieces to each string at most, you can split, reverse, and join:
df.city.str.split(', ').str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
If there are more than two commas, split on the first one only:
df.city.str.split(', ', 1).str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
Another option is str.partition:
u = df.city.str.partition(', ')
u.iloc[:,-1] + ' ' + u.iloc[:,0]
0 City of Bristol
1 City of Newcastle
2 London
dtype: object
This always splits on the first comma only.
You can also use a list comprehension, if you need performance:
df.assign(city=[' '.join(s.split(', ', 1)[::-1]) for s in df['city']])
city
0 City of Bristol
1 City of Newcastle
2 London
Why should you care about loopy solutions? For loops are fast when working with string/regex functions (faster than pandas, at least). You can read more at For loops with pandas - When should I care?.

Categories