Remove white space from entire DataFrame - python

i have a dataframe, 22 columns and 65 rows. The data comes in from csv file.
Each of the values with dataframe has an extra unwanted whitespace. So if i do a loop on 'Year' column with a Len() i get
2019 5
2019 5
2018 5
...
this 1 extra whitespace appears throughout DF in every value. I tried running a .strip() on DF but no attribute exists
i tried a 'for each df[column].str.strip() but there are various data types in each column... dtypes: float64(6), int64(4), object(14) , so this errors.
any ideas on how to apply a function for entire dataframe, and if so, what function/method? if not what is best way to handle?

Handle the error:
for col in df.columns:
try:
df[col] = df[col].str.strip()
except AttributeError:
pass
Normally, I'd say select the object dtypes, but that can still be problematic if the data are messy enough to store numeric data in an object container.
import pandas as pd
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': ['seven ']*3})
df['foo2'] = df.foo.astype(object)
for col in df.select_dtypes('object'):
df[col] = df[col].str.strip()
#AttributeError: Can only use .str accessor with string values!

you should use apply() function in order to do this :
df['Year'] = df['Year'].apply(lambda x:x.strip() )
you can apply this function on each column separately :
for column in df.columns:
df[column] = df[column].apply(lambda x:x.strip() )

Try this:
for column in df.columns:
df[column] = df[column].apply(lambda x: str(x).replace(' ', ' '))

Why not try this?
for column in df.columns:
df[column] = df[column].apply(lambda x: str(x).strip())

Related

Save & remove rows in a pandas dataframe based on a condition. Error: bad operand type for unary ~: 'str'

Say I have a dataframe as below (a representation of a much larger dataset) which has a code as a column along with another column (acutal dataset as many more).
import pandas as pd
df = pd.DataFrame({'code': [123456, 123758, 12334356, 4954968, 774853],
'col2': [1,2,3,4,5]})
Question: How can I store in a separate dataframe & remove from the original dataframe the entries of this dataframe (all columns associated with the entry as well) which don't have the first 3 characters as 123?
Attempted: To do this I have attempted to select out all rows which start with 123 and then use the not symbol ~ to select everything which doesn't start with this. I have stored this in a new dataframe since I want it saved and then tried dropping this from the original dataframe by getting its index as its not wanted.
# Converting column to a string
df['code'] = df['code'].astype(str)
# Saving entries which DONT start with 123 in a separate dataframe
df2 = df[~df['code'].str[0:3] == '123']
# Dropping those bad entries (starting with 123 chars) from dataframe
df = df.drop(df2.index, inplace=True)
However when I do this I come across the following error:
TypeError: bad operand type for unary ~: 'str'
Any alternate solutions along with corrections to my own would be much appreciated.
Desired Output: Should generalise for additional entries too. Notice that 4954968 & 774853 have gone since they don't start with 123
df_final = pd.Dataframe(df = pd.DataFrame({'code': [123456, 123758, 12334356, ], 'col2': [1,2,3]}))
In your solution is problem priority operators, so is necessary parentheses:
df2 = df[~(df['code'].str[0:3] == '123')]
print (df2)
code col2
3 4954968 4
4 774853 5
Better is change logic - select only matched values
df = df[(df['code'].str[0:3] == '123')]
print (df)
You can use startswith to identify the rows that you want. No need for a double negative.
import pandas as pd
df = df.loc[df['code'].str.startswith('123'), :]

Remove column based on header name type

usually when you want to remove columns which are not of type float, you can write pd.DataFrame.select_dtypes(include='float64'). however i would like to remove a column in cases where the header name is not a float
df = pd.DataFrame({'a' : [1,2,3], 10 : [3,4,5]})
df.dtypes
will give the output
a int64
10 int64
dtype: object
how can i remove the column a based on the fact that it's not a float or int?
Please Try drop column with digit using regex if you wanted to drop 10
df.filter(regex='\d', axis=1)
#On the contrary, you can drop nondigits too
# df.filter(regex='\D', axis=1)
A solution based on type enumeration:
Code
sr_dtype = df.dtypes
df = df.drop(columns=sr_dtype.index[
sr_dtype.index.map(lambda el: not isinstance(el, (int, float))) # add more if necessary
])
Note that df.types itself is a Series instance that regular Series operations are applicable. In particular, index.map() is used as a wrapper for isinstance() check in this example.
Result
print(df)
10
0 3
1 4
2 5
Are you sure that's the right output? Your dataframe columns are 'a' and 10, why your input has a column named 'b'?
Anyway, to remove the column a, regardless to its type but through its header name, use the drop method:
df = df.drop(columns=['a'])
Also works with a list of columns as well, instead of the single element list in this case.
Based in the other answers, you can also try:
1) To make sure to keep only float and int types:
df[[col for col in df.columns if type(col) in [float,int]]]
2) To just exclude string-like columns:
df.loc[:, [not isinstance(col, str) for col in df.columns]] # return bool array
# or
df[[col for col in df.columns if not isinstance(col, str)]] # return colum names
3) To exclude columns that's not float/int based on regex:
df.filter(regex='^\d+$|^\d+?\.{1}\d+$')
where the first expression ^\d+$ map integers (start and end with digit), and the second expression ^\d+?\.{1}\d+$ maps floats. We could just use ^[\d|\.]+$ (allowing only digits and points) to map both of them, but it would also maps columns like "1..2".

How do I filter out multiple columns witha certain string in Python

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

What is the right way to substitute column values in dataframe?

I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))

Replace NaN values of filtered column by the mean

I have a dataframe with the following shape:
Index([u'PRODUCT',u'RANK', u'PRICE', u'STARS', u'SNAPDATE', u'CAT_NAME'], dtype='object')
For each product of that dataframe I can have NaN values for a specific date.
The goal is to replace for each product the NaN values by the mean of the existing values.
Here is what I tried without success:
for product in df['PRODUCT'].unique():
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
print df
gives me:
TypeError: 'NoneType' object has no attribute '__getitem__'
What am I doing wrong?
You can use groupby to create a mean series:
s = df.groupby('PRODUCT')['RANK'].mean()
Then use this series to fillna values:
df['RANK'] = df['RANK'].fillna(df['PRODUCT'].map(s))
The reason you're getting this error is because of your use of inplace in fillna. Unfortunately, the documentation there is wrong:
Returns: filled : Series
This shows otherwise, though:
df = pd.DataFrame({'a': [3]})
>>> type(df.a.fillna(6, inplace=True))
NoneType
>>> type(df.a.fillna(6))
pandas.core.series.Series
So when you assign
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
you're assigning df = None, and the next iteration fails with the error you get.
You can omit the assignment df =, or, better yet, use the other answer.

Categories