Process for multiple columns - python

I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me.
c = data["High_banks"]
c2 = pd.to_numeric(c.str.replace(',',''))
data = data.assign(High_banks = c2)
What is the best way to do this?

i think you can do it like this
df = df.replace(",","",regex=True )
after that you can convert datatype

You can use a combination of the methods apply and applymap.
Take this for an example:
df = pd.DataFrame([['1,', '2,12'], ['3,356', '4,567']], columns = ['a','b'])
new_df = (df.applymap(lambda x: x.replace(',',''))
.apply(pd.to_numeric, axis = 1))
new_df.dtypes
>> #successfully converted to numeric types
a int64
b int64
dtype: object
The first method, applymap runs element wise on the dataframe to remove , then apply applies the pd.to_numeric function across the column axis of the dataframe.

Related

pandas - mask works on whole dataframe but on selected columns?

I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.
Can you explain why and tell if it is expected result?
You can try with a dataframe dt, containing 0 in columns:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.
But:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced
i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.
you can try this instead:
dt[columns] = dt[columns].mask(dt[columns] == 0)
The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:
tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)
and then you can go and update the dataframe:
dt[columns] = tmp
Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code
dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)
Extra:
If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:
Understanding inplace=True in pandas
In pandas, is inplace = True considered harmful, or not?
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

What is the right way to substitute column values in dataframe?

I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))

applymap() does not work on Pandas MultiIndex Slice

I have an hierarchical dataset:
df = pd.DataFrame(np.random.rand(6,6),
columns=[['A','A','A','B','B','B'],
['mean', 'max', 'avg']*2],
index=pd.date_range('20000103', periods=6))
I want to apply a function to all values under the columns A. I can set the value to something:
df.loc[slice(None), 'A'] = 1
Easy enough. Now, instead of assigning a value, if I want to apply a mapping to this MultiIndex slice, it does not work.
For example, let me apply a simple formatting statement:
df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
This step works fine. However, I cannot assign this to the original df:
df.loc[slice(None), 'A'] = df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
Everything turns into a NaN. Any help would be appreciated.
You can do it in a couple of ways:
df['A'] = df['A'].applymap('{:.2f}'.format)
or (this will keep the original dtype)
df['A'] = df['A'].round(2)
or as a string
df['A'] = df['A'].round(2).astype(str)

Convert pandas data frame to series

I'm somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.
I want to convert this into a series? I'm wondering what the most pythonic way to do this is?
I've tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It's not smart enough to realize it's still a "vector" in math terms.
Thanks!
You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).
df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df.squeeze(axis=0)
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
Note: To accommodate the point raised by #IanS (even though it is not in the OP's question), test for the dataframe's size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.
if df.empty:
# Empty dataframe, so convert to empty Series.
result = pd.Series()
elif df.shape == (1, 1)
# DataFrame with one value, so convert to series with appropriate index.
result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
# Convert to series per OP's question.
result = df.T.squeeze()
else:
# Dataframe with multiple rows. Implement desired behavior.
pass
This can also be simplified along the lines of the answer provided by #themachinist.
if len(df) > 1:
# Dataframe with multiple rows. Implement desired behavior.
pass
else:
result = pd.Series() if df.empty else df.iloc[0, :]
It's not smart enough to realize it's still a "vector" in math terms.
Say rather that it's smart enough to recognize a difference in dimensionality. :-)
I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:
>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
a0 a1 a2 a3 a4
0 0 1 2 3 4
>>> df.iloc[0]
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
You can retrieve the series through slicing your dataframe using one of these two methods:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))
series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series
You can also use stack()
df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])
After u run df, then run:
df.stack()
You obtain your dataframe in series
If you have a one column dataframe df, you can convert it to a series:
df.iloc[:,0] # pandas Series
Since you have a one row dataframe df, you can transpose it so you're in the previous case:
df.T.iloc[:,0]
Another way -
Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows
# label your columns by passing a list of names
myResult.columns = ['firstCol']
# fetch the column in this way, which will return you a series
myResult = myResult['firstCol']
print(type(myResult))
In similar fashion, you can get series from Dataframe with multiple columns.
data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)
This gives a dataframe with index as column name of data and all data are present in "values" column
Another way is very simple
df= df.iloc[3].reset_index(drop=True).squeeze()
Squeeze -> is the one that converts to Series.

Categories