Divide each element by 2 and it should ignore "String" values - python

Divide each element by 2 and it should ignore "String" values. End results should be in Pandas Data frame only
df=pd.DataFrame({'a':[3,6,9], 'b':[2,4,6], 'c':[1,2,3]})
print(df)

You can do:
I added a column with string values for demonstration
df=pd.DataFrame({'a':[3,6,9], 'b':[2,4,6], 'c':[1,2,3], 'd':['a', 'b', 'c']})
for i in list(df.keys()):
try:
df[i] = df[i]/2
except(TypeError):
df[i] = df[i]
print(df)
This gives:
a b c d
0 1.5 1.0 0.5 a
1 3.0 2.0 1.0 b
2 4.5 3.0 1.5 c
If you have columns with mixed integer and string types :
df=pd.DataFrame({'a':[3,6,9, 'a'], 'b':[2,4,6, 8], 'c':[1,2,3,4], 'd':['a', 'b', 'c', 'd']})
for i in list(df.keys()):
try:
df[i] = df[i]/2
except(TypeError):
df[i] = df[i].astype(str)
mynewlist = [int(s)/2 if s.isdigit() else s for s in list(df[i].values)]
df[i] = mynewlist
print(df)
which gives:
a b c d
0 1.5 1.0 0.5 a
1 3 2.0 1.0 b
2 4.5 3.0 1.5 c
3 a 0.5 0.5 d

Related

Subtract with value in previous row to create a new column by subject

Using python and this data set https://raw.githubusercontent.com/yadatree/AL/main/AK4.csv I would like to create a new column for each subject, that starts with 0 (in the first row) and then subtracts the SCALE value from row 2 from row 1, then row 3 from row 2, row 4 from row 3, etc.
However, if this produces a negative value, then to give the output of 0.
Edit: Thank you for the response. That worked perfectly. The only remaining issue is that I'd like to start again with each subject (SUBJECT column). The number of values for each subject is not fixed thus something that checks the SUBJECT column and then starts again from 0 would be ideal.
screenshot
You can use .shift(1) create new column with values moved from previous rows - and then you will have both values in the same row and you can substract columns.
And later you can selecte all negative results and assign zero
import pandas as pd
data = {
'A': [1, 3, 2, 5, 1],
}
df = pd.DataFrame(data)
df['previous'] = df['A'].shift(1)
df['result'] = df['A'] - df['previous']
print(df)
#df['result'] = df['A'] - df['A'].shift(1)
#print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
Result:
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 -1.0
3 5 2.0 3.0
4 1 5.0 -4.0
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 0.0
3 5 2.0 3.0
4 1 5.0 0.0
EDIT:
If you use df['result'] = df['A'] - df['A'].shift(1) then you get column result without creating column previous.
And if you use .shift(1, fill_value=0) then it will put 0 instead of NaN in first row.
EDIT:
You can use groupy("SUBJECT") to group by subject and later in every group you can put 0 in first row.
import pandas as pd
data = {
'S': ['A', 'A', 'A', 'B', 'B', 'B'],
'A': [1, 3, 2, 1, 5, 1],
}
df = pd.DataFrame(data)
df['result'] = df['A'] - df['A'].shift(1, fill_value=0)
print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
all_groups = df.groupby('S')
first_index = all_groups.apply(lambda grp: grp.index[0])
df.loc[first_index, 'result'] = 0
print(df)
Results:
S A result
0 A 1 1
1 A 3 2
2 A 2 -1
3 B 1 -1
4 B 5 4
5 B 1 -4
S A result
0 A 1 1
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0
S A result
0 A 1 0
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0

Pandas: apply result_type="expand": wrong dtypes

I want to add multiple columns to a DataFrame:
import pandas as pd
df = pd.DataFrame(
[
(0, 1),
(1, 1),
(1, 2),
],
columns=['a', 'b']
)
def apply_fn(row) -> (int, float):
return int(row.a + row.b), float(row.a / row.b)
df[['c', 'd']] = df.apply(apply_fn, result_type='expand', axis=1)
Result:
>>> df
a b c d
0 0 1 1.0 0.0
1 1 1 2.0 1.0
2 1 2 3.0 0.5
>>> df.dtypes
a int64
b int64
c float64
d float64
dtype: object
Why is column c not of dtype int? Can I specify this somehow? Something like .apply(..., dtypes=[int, float])?
I believe this is happening because result_type='expand' causes to be expanded as a Series, so the first row is in its own series, then the next row, etc. But, because Series objects can only have one dtype, the ints get converted to floats.
For example, look at this:
>>> pd.Series([1, 0.0])
0 1.0
1 0.0
dtype: float64
One workaround would be to call tolist on the apply call, and wrap it in a call to DataFrame:
>>> df[['c', 'd']] = pd.DataFrame(df.apply(apply_fn, axis=1).tolist())
a b c d
0 0 1 1 0.0
1 1 1 2 1.0
2 1 2 3 0.5
You can chain with astype
df.apply(apply_fn, axis=1, result_type='expand').astype({0:'int', 1:'float'})
Out[147]:
0 1
0 1 0.0
1 2 1.0
2 3 0.5

replace with multiple conditions not updating in pandas

I am trying to replace a value based on the row index, and for only certain columns in a dataframe.
for columns b and c, I want to replace the value 1 with np.nan, for rows 1, 2 and 3
df = pd.DataFrame(data={'a': ['"dog", "cat"', '"dog"', '"mouse"', '"mouse", "cat", "bird"', '"circle", "square"', '"circle"', '"triangle", "square"', '"circle"'],
'b': [1,1,3,4,5,1,2,3],
'c': [3,4,1,3,2,1,0,0],
'd': ['a','a','b','c','b','c','d','e'],
'id': ['group1','group1','group1','group1', 'group2','group2','group2','group2']})
I am using the following line but its not updating in place, and if I try assigning it, returns only the subset of amended rows, rather than an update version of the original dataframe.
df[df.index.isin([1,2,3])][['b','c']].replace(1, np.nan, inplace=True)
You could do it like this:
df.loc[1:3, ['b', 'c']] = df.loc[1:3, ['b', 'c']].replace(1, np.nan)
Output:
>>> df
a b c d id
0 "dog", "cat" 1.0 3.0 a group1
1 "dog" NaN 4.0 a group1
2 "mouse" 3.0 NaN b group1
3 "mouse", "cat", "bird" 4.0 3.0 c group1
4 "circle", "square" 5.0 2.0 b group2
5 "circle" 1.0 1.0 c group2
6 "triangle", "square" 2.0 0.0 d group2
7 "circle" 3.0 0.0 e group2
A more dynamic version:
cols = ['b', 'c']
rows = slice(1, 3) # or [1, 2, 3] if you want
df.loc[rows, cols] = df.loc[rows, cols].replace(1, np.nan)

How to aggregate and groupby in pandas

I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe

How to create conditional columns in Pandas Data Frame in which column values are based on other columns

I am new to python, I am attempting what would be a conditional mutate in R DPLYR.
In short I would like to create a new column in the Data-frame called Result where : if df.['test'] is greater than 1 df.['Result'] equals the respective df.['count'] for that row, if it lower than 1 then df.['Result'] is
df.['count'] *df.['test']
I have tried df['Result']=df['test'].apply(lambda x: df['count'] if x >=1 else ...) Unfortunately this results in a series, I have also attempted to write small functions which also return series
I would like the final Dataframe to look like this...
no_ Test Count Result
1 2 1 1
2 3 5 5
3 4 1 1
4 6 2 2
5 0.5 2 1
You can use np.where:
df['Result'] = np.where(df['Test'] > 1, df['Count'], df['Count'] * df['Test'])
Output:
No_ Test Count Result
0 1 2.0 1 1.0
1 2 3.0 5 5.0
2 3 4.0 1 1.0
3 4 6.0 2 2.0
4 5 0.5 2 1.0
You can work it out with lists comprehensions:
df['Result'] = [ df['count'][i] if df['test'][i]>1 else
df['count'][i] * df['test'][i]
for i in range(df.shape[0]) ]
Here is a way to do this:
import pandas as pd
df = pd.DataFrame(columns = ['Test', 'Count'],
data={'Test':[2, 3, 4, 6, 0.5], 'Count':[1, 5, 1, 2, 2]})
df['Result'] = df['Count']
df.loc[df['Test'] < 1, 'Result'] = df['Test'] * df['Count']
Output:
Test Count Result
0 2.0 1 1.0
1 3.0 5 5.0
2 4.0 1 1.0
3 6.0 2 2.0
4 0.5 2 1.0

Categories