Process for multiple columns

Process for multiple columns - python

I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me.
c = data["High_banks"]
c2 = pd.to_numeric(c.str.replace(',',''))
data = data.assign(High_banks = c2)
What is the best way to do this?

i think you can do it like this
df = df.replace(",","",regex=True )
after that you can convert datatype

You can use a combination of the methods apply and applymap.
Take this for an example:
df = pd.DataFrame([['1,', '2,12'], ['3,356', '4,567']], columns = ['a','b'])
new_df = (df.applymap(lambda x: x.replace(',',''))
.apply(pd.to_numeric, axis = 1))
new_df.dtypes
>> #successfully converted to numeric types
a int64
b int64
dtype: object
The first method, applymap runs element wise on the dataframe to remove , then apply applies the pd.to_numeric function across the column axis of the dataframe.

Related

pandas - mask works on whole dataframe but on selected columns?

I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.
Can you explain why and tell if it is expected result?
You can try with a dataframe dt, containing 0 in columns:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.
But:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced

i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.
you can try this instead:
dt[columns] = dt[columns].mask(dt[columns] == 0)

The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:
tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)
and then you can go and update the dataframe:
dt[columns] = tmp
Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code
dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)
Extra:
If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:
Understanding inplace=True in pandas
In pandas, is inplace = True considered harmful, or not?
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!

You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

What is the right way to substitute column values in dataframe?

I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you

Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))

applymap() does not work on Pandas MultiIndex Slice

I have an hierarchical dataset:
df = pd.DataFrame(np.random.rand(6,6),
columns=[['A','A','A','B','B','B'],
['mean', 'max', 'avg']*2],
index=pd.date_range('20000103', periods=6))
I want to apply a function to all values under the columns A. I can set the value to something:
df.loc[slice(None), 'A'] = 1
Easy enough. Now, instead of assigning a value, if I want to apply a mapping to this MultiIndex slice, it does not work.
For example, let me apply a simple formatting statement:
df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
This step works fine. However, I cannot assign this to the original df:
df.loc[slice(None), 'A'] = df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
Everything turns into a NaN. Any help would be appreciated.

You can do it in a couple of ways:
df['A'] = df['A'].applymap('{:.2f}'.format)
or (this will keep the original dtype)
df['A'] = df['A'].round(2)
or as a string
df['A'] = df['A'].round(2).astype(str)

Convert pandas data frame to series

I'm somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.
I want to convert this into a series? I'm wondering what the most pythonic way to do this is?
I've tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It's not smart enough to realize it's still a "vector" in math terms.
Thanks!

You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).
df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df.squeeze(axis=0)
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
Note: To accommodate the point raised by #IanS (even though it is not in the OP's question), test for the dataframe's size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.
if df.empty:
# Empty dataframe, so convert to empty Series.
result = pd.Series()
elif df.shape == (1, 1)
# DataFrame with one value, so convert to series with appropriate index.
result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
# Convert to series per OP's question.
result = df.T.squeeze()
else:
# Dataframe with multiple rows. Implement desired behavior.
pass
This can also be simplified along the lines of the answer provided by #themachinist.
if len(df) > 1:
# Dataframe with multiple rows. Implement desired behavior.
pass
else:
result = pd.Series() if df.empty else df.iloc[0, :]

It's not smart enough to realize it's still a "vector" in math terms.
Say rather that it's smart enough to recognize a difference in dimensionality. :-)
I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:
>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
a0 a1 a2 a3 a4
0 0 1 2 3 4
>>> df.iloc[0]
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

You can retrieve the series through slicing your dataframe using one of these two methods:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))
series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

You can also use stack()
df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])
After u run df, then run:
df.stack()
You obtain your dataframe in series

If you have a one column dataframe df, you can convert it to a series:
df.iloc[:,0] # pandas Series
Since you have a one row dataframe df, you can transpose it so you're in the previous case:
df.T.iloc[:,0]

Another way -
Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows
# label your columns by passing a list of names
myResult.columns = ['firstCol']
# fetch the column in this way, which will return you a series
myResult = myResult['firstCol']
print(type(myResult))
In similar fashion, you can get series from Dataframe with multiple columns.

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)
This gives a dataframe with index as column name of data and all data are present in "values" column

Another way is very simple
df= df.iloc[3].reset_index(drop=True).squeeze()
Squeeze -> is the one that converts to Series.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Process for multiple columns - python

I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me. c = data["High_banks"] c2 = pd.to_numeric(c.str.replace(',','')) data = data.assign(High_banks = c2) What is the best way to do this?

i think you can do it like this df = df.replace(",","",regex=True ) after that you can convert datatype

Related

pandas - mask works on whole dataframe but on selected columns?

column filter and multiplication in dask dataframe

What is the right way to substitute column values in dataframe?

applymap() does not work on Pandas MultiIndex Slice

Convert pandas data frame to series

Categories

Resources