Transform a dataframe without looping? - python

I would like to analyse and transform the following DataFrame
import random
import string
import numpy as np
import pandas as pd
# generate example dataframe
df=pd.DataFrame()
df['Name']=[str(x) for x in np.random.choice(['a','b','c'],10)]
df['Cat1']=[str(x) for x in np.random.choice(['x',''],10)]
df['Cat2']=[str(x) for x in np.random.choice(['x',''],10)]
df['Cat3']=[str(x) for x in np.random.choice(['x',''],10)]
df.head(10)
This produces a DataFrame like this:
Sample DataFrame
The task is to count the 'x' in columns Cat1, Cat2, Cat3 for each unique entry in column 'Name'. This can be achieved wth help ofthe groupby() function:
grouped=df.groupby(['Name'])
dfg=grouped['Cat1','Cat2','Cat3'].sum()
dfg
Result of analysis
And the result is this almost what I wanted. Now, I needed to replace the 'x' by a number, e.g., 'xxxx' by 4, 'x' by 1, and so forth. The solution uses a loop over all columns:
for col in range(0,len(dfg.columns)):
dfg[dfg.columns[col]]=list(map(lambda x: len(x), dfg[dfg.columns[col]]))
dfg
Final result.
Now, I wonder how I can avoid that loop and achieve the same final result?
Thanks a lot for sharing your ideas and guidance.

Try:
df.set_index('Name').eq('x')\
.groupby('Name')['Cat1','Cat2','Cat3'].sum()\
.astype(int).reset_index()
Output:
Name Cat1 Cat2 Cat3
0 a 5 3 4
1 b 1 1 0
2 c 1 1 1

Depending on your source of data, this could be easily solved by replacing the "x" with a 1 and setting the empty cells to 0. So you also had to change the datatype of the column to integer.
Calling sum() then on your group will already give you the numeric answer.

Related

How to do this transpose?

I have this situation and cannot find a way with Pandas to do get the result I want.
I have this df with only one column.
enter image description here
And I want to transpose to get like this:
enter image description here
I alreary tried transpose but not getting the result i want.
And is there an easy way to put each value in a specif column?
For example: Y-1 in a column named Y-1, Y-3 in a column named Y-3. And if there is no Y-2 value, leave it blank in the column.
enter image description here
You might be able to get away with dropping down to numpy to simply reshape. However, if there is a variable number of entries for each row, you can use a pivot with custom indices:
import pandas as pd
df = pd.DataFrame({"MSG": ["MSG XXX", "Y-1", "Y-2", "Y-3", "Y-5", "Y-7", "Y-19", "MSG XYZ", "Y-1", "Y-3", "Y-11", "Y-12", "Y-17", "Y-19"]})
groups = df["MSG"].str.startswith("MSG").cumsum()
out = (
df
.assign(index=groups, columns=df.groupby(groups).cumcount())
.pivot(index="index", columns="columns", values="MSG")
)
out:
columns 0 1 2 3 4 5 6
index
1 MSG XXX Y-1 Y-2 Y-3 Y-5 Y-7 Y-19
2 MSG XYZ Y-1 Y-3 Y-11 Y-12 Y-17 Y-19

How to add multiple columns to a python dataframe by using other dataframe columns

Here is my requirement
I have an existing data frame df.A[a,b,c] and would like create a new df.B[X,Y] from the df.A by doing some arithmetic operation on the columns in df.A
it will be like
df.A= a b c
0 1 2
0 2 0
1 3 2
My df.B will be derived as
df.B['X','Y']=df.A[(sum[A.a]+[A.b]),sum[A.c]]
The output should look like
df.B= X Y
7 4
let me know if you need any further details to achieve this case.
Is your only goal to get the sum of (A+B) and the sum of (C)?
If that's the case, just do something like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':np.random.randint(0,10,3),
'B':np.random.randint(0,10,3),
'C':np.random.randint(0,10,3)})
df1.groupby(df1.index.isin(['A','B'])).sum().rename({False: 'Y', True: 'X'})
I'm having a hard time seeing the purpose of this, so if you add more context, it'd be helpful for creating a more elegant solution.
I think it'd be easier to just get outputs:
A_plus_B = df3[['A','B']].to_numpy().sum()
C = df3['C'].sum()

Element-by-element division in pandas dataframe with "/"?

Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.

How to modify DataFrame column without getting SettingWithCopyWarning?

I have a DataFrame object df. And I would like to modify job column so that all retired people are 1 and rest 0 (like shown here):
df['job'] = df['job'].apply(lambda x: 1 if x == "retired" else 0)
But I get a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Why did I get it here though? From what I read it applies to situations where I take a slice of rows and then a column, but here I am just modyfing elements in a row. Is there a better way to do that?
Use:
df['job']=df['job'].eq('retired').astype(int)
or
df['job']=np.where(df['job'].eq('retired'),1,0)
So here's an example dataframe:
import pandas as pd
import numpy as np
data = {'job':['retired', 'a', 'b', 'retired']}
df = pd.DataFrame(data)
print(df)
job
0 retired
1 a
2 b
3 retired
Now, you can make use of numpy's where function:
df['job'] = np.where(df['job']=='retired', 1, 0)
print(df)
job
0 1
1 0
2 0
3 1
I would not suggest using apply here, as in the case of large data frame it could lower your performance.
I would prefer using numpy.select or numpy.where.
See This And This

Getting substring based on another column in a pandas dataframe

Hi is there a way to get a substring of a column based on another column?
import pandas as pd
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x
digit name
0 2 bernard
1 3 brenden
2 3 bern
What i would expect is something like:
for row in x.itertuples():
print row[2][:row[1]]
be
bre
ber
where the result is the substring of name based on digit.
I know if I really want to I can create a list based on the itertuples function but does not seem right and also, I always try to create a vectorized method.
Appreciate any feedback.
Use apply with axis=1 for row-wise with a lambda so you access each column for slicing:
In [68]:
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x.apply(lambda x: x['name'][:x['digit']], axis=1)
Out[68]:
0 be
1 bre
2 ber
dtype: object

Categories