I want to get list of dataframe columns that contains all rows with 2 spaces.
Input:
import pandas as pd
import numpy as np
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.expand_frame_repr = False
df = pd.DataFrame({'id': [101, 102, 103],
'full_name': ['John Brown', 'Bob Smith', 'Michael Smith'],
'comment_1': ['one two', 'qw er ty', 'one space'],
'comment_2': ['ab xfd xsxws', 'dsd sdd dwde', 'wdwd ofjpoej oihoe'],
'comment_3': ['ckdf cenfw cd', 'cewfwf wefep lwcpem', np.nan],
'birth_year': [1960, 1970, 1970]})
print(df)
Output:
id full_name comment_1 comment_2 comment_3 birth_year
0 101 John Brown one two ab xfd xsxws ckdf cenfw cd 1960
1 102 Bob Smith qw er ty dsd sdd dwde cewfwf wefep lwcpem 1970
2 103 Michael Smith one space wdwd ofjpoej oihoe NaN 1970
Expected Output:
['comment_2', 'comment_3']
You can use series.str.count() to count the appearances of a substring or pattern in a string, use .all() to check whether all items meet the criteria, and iterate over df.columns using only string columns with select_dtypes('object')
[i for i in df.select_dtypes('object').columns if (df[i].dropna().str.count(' ')==2).all()]
['comment_2', 'comment_3']
Try:
res=[]
for col in df.columns:
if(df[col].dtype==object):
dftemp=df[col].fillna(" ").str.replace(r"[^\s]", "").str.len()
dftemp=dftemp.eq(2).all()
if(dftemp): res.append(col)
print(res)
Outputs:
['comment_2', 'comment_3']
It runs through all columns, which might be strings (object type), removes all the non-space characters from these columns, then just counts charcters. In case if all have exactly 2 characters - it adds column name to the res array.
Related
Please note this is just an example, there are more columns in the example and the list ends up being very big, hence I don't want to iterate it twice
Having:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Name':['Peter','Peter','Anna','Anna','Anna'],
'Country1':['Italy',np.nan,np.nan,'Sweden',np.nan],
'Country2':[np.nan,'Venezuela',np.nan,'Peru','Iceland'],
'Price':[12,33,45,6,9]})
I do
data_g_name = data.groupby('Name')
country_cols=['Country1','Country2']
g_stats = pd.DataFrame({
'Countries':data_g_name['Country1','Country2'].apply(lambda x:x.values.flatten().tolist()),
'TotalCost' : data_g_name['Price'].sum()
})
And obtain:
' Name Countries TotalCost
0 Anna [nan, nan, Sweden, Peru, nan, Iceland] 60
1 Peter [Italy, nan, nan, Venezuela] 45'
I would like (without having to iterate through the list if possible, real case list is big):
Name Countries TotalCost
0 Anna [Sweden,Peru,Iceland] 60
1 Peter [Italy,Venezuela] 45
Use melt to unpivot dataframe, drop all row with NaN in column 'Country', group by 'Name' and convert to list then join the sum of 'Price':
>>> df.melt(['Name', 'Price'], value_name='Country') \
.dropna(subset=['Country']).groupby('Name')['Country'] \
.apply(list).to_frame() \
.join(df.groupby('Name')['Price'].sum().rename('TotalCost'))
Country TotalCost
Name
Anna [Sweden, Peru, Iceland] 60
Peter [Italy, Venezuela] 45
Aggregate country and cost separately and then combine results:
cost = data.Price.groupby(data.Name).sum().rename('TotalCost')
countries = (
data.melt('Name', ['Country1', 'Country2'], value_name='Countries')
.dropna()
.groupby('Name')
.Countries
.agg(list))
pd.concat([countries, cost], 1).reset_index()
# Name Countries TotalCost
#0 Anna [Sweden, Peru, Iceland] 60
#1 Peter [Italy, Venezuela] 45
There are spacings and numbers inside some of the rows of my dataframe. For example Florida16, Florida19, Wisconsin (State of)
I want to remove those extra numbers and spacings and just keep the main names
How do I this with rename? Do I need a for loop?
df.rename()
Try the following:
import pandas as pd
import numpy as np
data = np.array([['Florida19','test with space', 'AnotherNumber18'],['Florida19','test with space', 'AnotherNumber18 andspace']])
df = pd.DataFrame(data)
patterns = ['[0-9]+', '\s.*']
replacement = ''
df.replace(patterns, replacement, regex=True, inplace=True)
print(df)
This results in:
0 1 2
0 Florida test AnotherNumber
1 Florida test AnotherNumber
Edit:
If the desired output for an entry e.g. Wisconsin (State of) should be Wisconsin(Stateof) (or in general just a removement of whitespace) then use patterns = ['[0-9]+','\s']
This will result in:
0 1 2
0 Florida testwithspace AnotherNumber
1 Florida testwithspace AnotherNumberandspace
For index:
If you have these values set as "index" of your DataFrame like:
1 2
0
Florida19 'test with space' 'AnotherNumber18'
Florida16 'test with space' 'AnotherNumber18 andspace'
Wisconsin (State of) 'info1' 'info2'
You can use df.rename() with regular expressions to change these indices:
import pandas as pd
import numpy as np
import re
data = np.array([['Florida19','test with space', 'AnotherNumber18'],
['Florida16','test with space', 'AnotherNumber18 andspace'],
['Wisconsin (State of)', 'info1', 'info2']])
df = pd.DataFrame(data)
df.set_index(0, inplace=True)
pattern1 = r'[0-9]+|\s.*' # match numbers or string parts that start with a whitespace
pattern2 = r'[0-9]+|\s' # for only removing numbers and whitespaces
df1 = df.rename(index=(lambda x: re.sub(pattern1,'',x))
df2 = df.rename(index=(lambda x: re.sub(pattern2,'',x))
This will produces:
df1 =
1 2
0
Florida 'test with space' 'AnotherNumber18'
Florida 'test with space' 'AnotherNumber18 andspace'
Wisconsin 'info1' 'info2'
df2 =
1 2
0
Florida 'test with space' 'AnotherNumber18'
Florida 'test with space' 'AnotherNumber18 andspace'
Wisconsin(Stateof) 'info1' 'info2'
I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas
You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999
Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)
You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]
I have a large dataset and would like to filter it to only show rows which contain a particular substring (In the following example, 'George') (also bonus points if you tell me how to pass multiple substrings)
For example, if I start with the code
data = {
'Employee': ['George Brazil', 'Tim Swak', 'Rajesh Mahta', 'Kristy Karns', 'Jamie Hyneman', 'Pamela Voss', 'Tyrone Johnson', 'Anton Lafreu'],
'Director': ['Omay Wanja', 'Omay Wanja', 'George Stafford', 'Omay Wanja', 'George Stafford', 'Kristy Karns', 'Carissa Karns', 'Christie Karns'],
'Supervisor': ['George Foreman', 'Mary Christiemas', 'Omay Wanja', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'George of the jungle'],
'A series of ints to make this more complex': [1,0,1,4 , 1, 3, 3, 7]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Employee Director Supervisor A series of ints to make this more complex
a George Brazil Omay Wanja George Foreman 1
b Tim Swak Omay Wanja Mary Christiemas 0
c Rajesh Mahta George Stafford Omay Wanja 1
d Kristy Karns Omay Wanja CEO PERSON 4
e Jamie Hyneman George Stafford CEO PERSON 1
f Pamela Voss Kristy Karns CEO PERSON 3
g Tyrone Johnson Carissa Karns CEO PERSON 3
h Anton Lafreu Christie Karns George of the jungle 7
I would like to then perform an operation such that it returns the dataframe but with only rows a, c, e, and h, because they are the only rows which contain the substring 'George'
Try this
filters = 'George'
df[df.apply(lambda row: row.astype(str).str.contains(filters).any(), axis=1)]
edited to return subset
You can separate use an or statement for each column. There's probably a more elegant way to get it to work, but this will do.
df[df['Employee'].str.contains("George") | df['Director'].str.contains("George") | df['Supervisor'].str.contains("George")]
From your code, it seems you only want the rows that have 'George' in columns ['Employee', 'Director', 'Supervisor']. If so, try this:
# Lambda solution for first `n` columns
mask = df.iloc[:, 0:3].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Lambda solution with named columns
mask = df[['Employee','Director','Supervisor']].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Trivial solution
df[(df['Employee'].str.contains('George')) | (df['Director')].str.contains('George')) | (df['Supervisor'].str.contains('George'))]
I am trying to clean my df['Country'] variable by creating a new one df['Country Clean'] that takes the value of the country variable if it finds it in the df['Country'] column.
I figured out though that if I repeat my command, I will also delete my previous findings and i will only get a variable reporting the finding for 'Russia'
Is there a way to do this?
data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' , 'Yugoslavia', 'Russia']}
df = pd.DataFrame(data)
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean'] = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')
Expected output
data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2)
exp
I suggest first normalizing the country names and then change the Country Clean column values according to the allowed country list:
normalize_countries={"Italie": "Italy", "Rusia": "Russia"} # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings
countries = ["Italy", "Russia"] # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):
for e in countries:
if e in s:
return e
return 'NaN'
df['Country Clean'] = df['Country Clean'].apply(applyFunc)
Output:
>>> df
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italy
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia NaN
5 2 Russia Russia
The df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()]) line searches for all misspelt country names as whole words in the Country column and replaces them with correct spelling variants.
You may also add whole word check when searching for countries if you use regexps in the countries list and then use re.search instead of if e in countries in the applyFunc.
Use:
In [15]: countries = ["italy", "russia", "yugoslavia", "italie"]
In [16]: for i in countries:df.loc[lambda x:x.Country.str.lower().str.contains(i), 'Country Clean'] = i.capitalize()
In [17]: df
Out[17]:
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italie
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia Yugoslavia
5 2 Russia Russia