This is my data set:
q1 q2 q3 q4
0 a a a a
1 b a a a
2 c c b a
3 d d b a
4 a a a a
5 b c b a
6 b a b a
7 c c b a
8 d d b a
where column
q1 has 'a','b','c','d' column values.
q2 has 'a','c','d' column values.
q3 has 'a','b' column values.
q4 has 'a' column values.
I want to run a for-loop for all columns with the common equation, but all the columns do not have all values. I am getting an error.
Example:
enter image description here
col = ['q1', 'q2', 'q3', 'q4']
for i in col:
print((df[i].value_counts()['a']) + (df[i].value_counts()['b']) + (df[i].value_counts()['c']) + (df[i].value_counts()['d']))
You can try with multiple if conditions to check the value in the particular column:
col = ['q1', 'q2', 'q3', 'q4']
for i in col:
count = 0
if 'a' in df[i].unique():
count += df[i].value_counts()['a']
if 'b' in df[i].unique():
count += df[i].value_counts()['b']
if 'c' in df[i].unique():
count += df[i].value_counts()['c']
if 'd' in df[i].unique():
count += df[i].value_counts()['d']
print(count)
You can try by declaring the values in your for():
col = ['q1', 'q2', 'q3', 'q4']
values = ['a', 'b', 'c', 'd']
for i in col:
count = 0
for v in values:
if v in df[i].values:
count += df[i].value_counts()[v]
print(count)
you can use, create list and append:
list_C = []
for i in cols:
a = df[i].str.count("A").sum()
list_C.append(a)
b = df[i].str.count("B").sum()
list_C.append(b)
c = df[i].str.count("C").sum()
list_C.append(c)
d = df[i].str.count("D").sum()
list_C.append(d)
print(list_C)
print(sum(list_C))
Related
I have a df that looks something like this:
name A B C D
1 bar 1 0 1 1
2 foo 0 0 0 1
3 cat 1 0-1 0
4 pet 0 0 0 1
5 ser 0 0-1 0
6 chet 0 0 0 1
I need to use loc method to add values in a new column ('E') based on the values of the other columns as a group for instance if values are [1,0,0,0] value in column E will be 1. I've tried this:
d = {'A': 1, 'B': 0, 'C': 0, 'D': 0}
A = pd.Series(data=d, index=['A', 'B', 'C', 'D'])
df.loc[df.iloc[:, 1:] == A, 'E'] = 1
It didn't work. I need to use loc method or other numpy based method since the dataset is huge. If it is possible to avoid creating a series to compare the row that would also be great, somehow extracting the values of columns A B C D and compare them as a group for each row.
You can compare values with A with test if match all rows in DataFrame.all:
df.loc[(df == A).all(axis=1), 'E'] = 1
For 0,1 column:
df['E'] = (df == A).all(axis=1).astype(int)
df['E'] = np.where(df == A).all(axis=1), 1, 0)
I have a dataframe with two levels of columns index.
Reproducible Dataset.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','D'],
['Name','Name','Company','Company']])
I want to rename the duplicated MultiIndex columns, only when level-0 and level-1 combined is duplicated. Then add a suffix number to the end. Like the one below.
Below is a solution I found, but it only works for single level column index.
class renamer():
def __init__(self):
self.d = dict()
def __call__(self, x):
if x not in self.d:
self.d[x] = 0
return x
else:
self.d[x] += 1
return "%s_%d" % (x, self.d[x])
df = df.rename(columns=renamer())
I think the above method can be modified to support the multi level situation, but I am too new to pandas/python.
Thanks in advance.
#Datanovice
This is to clarify to you about the output what I need.
I have the snippet below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([
['A','A','C','A'],
['A','A','C','A'],
['Company','Company','Company','Name']]))
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby(0).cumcount()
s = [np.where(cond.gt(0),s[i] + '_' + cond.astype(str),s[i]) for i in
range(df.columns.nlevels)]
s = pd.DataFrame(s)
#print(s)
df.columns = pd.MultiIndex.from_arrays(s.values.tolist())
print(df)
The current result is-
What I need is the last piece of column index should not be counted as duplicated, as as "A-A-Name" is not same with the first two.
Thank you again.
Might be a better way to do this, but you could return a dataframe from your columns and apply a conditional operation on them and re-assign them.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','A'],
['Name','Name','Company','Company']])
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby([0,1]).cumcount()
s[0] = np.where(cond.gt(0),s[0] + '_' + cond.astype(str),s[0])
s[1] = np.where(cond.gt(0),s[1] + '_' + cond.astype(str),s[1])
df.columns = pd.MultiIndex.from_frame(s)
print(df)
0 A A_1 C D
1 Name Name_1 Company Company
0 Gaz Gaz Gaz Gaz
1 X X X X
2 Y Y Y Y
3 Z Z Z Z
Try this -
arrays = [['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.randn(3, 8), columns=index)
A B
A A A B C C D D
0 0 0 1 3 1 2 1 4
1 0 1 1 1 1 3 0 1
2 1 1 4 2 3 2 1 4
suffix = pd.DataFrame(df.columns)
suffix['count'] = suffix.groupby(0).cumcount()
suffix['new'] = [((i[0]+'_'+str(j)),(i[1]+'_'+str(j))) for i,j in zip(suffix[0],suffix['count'])]
new_index = pd.MultiIndex.from_tuples(list(suffix['new']))
df.columns = new_index
I would like to enumerate elements in a column which appear more than once. Elements that appear only once should not be modified.
I have come up with two solutions, but they seem to be very inelegant, and I am hoping that there is a better solution.
Input:
X
0 A
1 B
2 C
3 A
4 C
5 C
6 D
Output:
new_name
X
A A1
A A2
B B
C C1
C C2
C C3
D D
Here are two possible ways of achieving this, one using .expanding().count(), the other using .cumcount(), but both pretty ugly
import pandas as pd
def solution_1(df):
pvt = (df.groupby(by='X')
.expanding()
.count()
.rename(columns={'X': 'Counter'})
.reset_index()
.drop('level_1', axis=1)
.assign(name = lambda s: s['X'] + s['Counter'].astype(int).astype(str))
.set_index('X')
)
pvt2 = (df.reset_index()
.groupby(by='X')
.count()
.rename(columns={'index': 'C'}
))
df2 = pd.merge(left=pvt, right=pvt2, left_index=True, right_index=True)
ind=df2['C']>1
df2.loc[ind, 'new_name']=df2.loc[ind, 'name']
df2.loc[~ind, 'new_name']=df2.loc[~ind].index
df2 = df2.drop(['Counter', 'C', 'name'], axis=1)
return df2
def solution_2(df):
pvt = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'cumcount'})
).rename(columns={'X': 'Counter'})
pvt2 = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'count'})
).rename(columns={'X': 'Total Count'})
# print(pvt2)
df2 = df.merge(pvt, left_index=True, right_index=True)
df3 = df2.merge(pvt2, left_on='X', right_index=True)
ind=df3['Total Count']>1
df3['Counter'] = df3['Counter']+1
df3.loc[ind, 'new_name']=df3.loc[ind, 'X']+df3.loc[ind, 'Counter'].astype(int).astype(str)
df3.loc[~ind, 'new_name']=df3.loc[~ind, 'X']
df3 = df3.drop(['Counter', 'Total Count'], axis=1).set_index('X')
return df3
if __name__ == '__main__':
s = ['A', 'B', 'C', 'A', 'C', 'C', 'D']
df = pd.DataFrame(s, columns=['X'])
print(df)
sol_1 = solution_1(df)
print(sol_1)
sol_2 = solution_2(df)
print(sol_2)
Any suggestions? Thanks a lot.
First we use GroupBy.cumcount to get a cumulative count for each unique value in X.
Then we add 1 and convert the numeric values to string with Series.astype.
Finally we concat the values to our original column with Series.cat:
df['new_name'] = df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str))
X new_name
0 A A1
1 A A2
2 B B1
3 C C1
4 C C2
5 C C3
6 D D1
If you actually dont want a number at the values which only appear once, we can use:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').eq(1),
df['new_name'].str.replace('\d', ''),
df['new_name'])
X new_name
0 A A1
1 A A2
2 B B
3 C C1
4 C C2
5 C C3
6 D D
All in one line:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').ne(1),
df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str)),
df['X'])
IIUC
df.X+(df.groupby('X').cumcount()+1).mask(df.groupby('X').X.transform('count').eq(1),'').astype(str)
Out[18]:
0 A1
1 B
2 C1
3 A2
4 C2
5 C3
6 D
dtype: object
Can I use the previous calculated answer from apply(axis=1) within the current row evaluation?
I have this df:
df = pd.DataFrame(np.random.randn(5,3),columns=list('ABC'))
df
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.527406 0.533451 -0.650252 'b'
2 -1.646425 0.738068 0.562747 'c'
3 -0.045872 0.088864 0.932650 'd'
4 -0.964226 0.542817 0.873731 'e'
and I'm trying to add for each row the value of the previous row multiplied by 2 and added to the current value, without manipulating the string column (e.g row = row + row(shift-1) *0.5).
This is the code I have so far:
def calc_by_previous_answer(row):
#here i have only the current row so I'm unable to get the previous one
row = row * 0.5
return row
#add the shift here will not propagate the previous answer
df = df.apply(calc_by_previous_answer, axis=1)
df
Not easy, but possible with select by previous values by loc, for select only numeric columns use DataFrame.select_dtypes:
def calc_by_previous_answer(row):
#here i have only the current row so I'm unable to get the previous one
#cannot select previous row of first row because not exist
if row.name > 0:
row = df.loc[row.name-1, c] * 0.5 + row
# else:
# row = row * 0.5
return row
c = df.select_dtypes(np.number).columns
df[c] = df[c].apply(calc_by_previous_answer, axis=1)
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.410128 1.004794 0.237621 'c'
3 -0.869085 0.457898 1.214023 'd'
4 -0.987162 0.587249 1.340056 'e'
Solution with no apply with DataFrame.add:
c = df.select_dtypes(np.number).columns
df[c] = df[c].add(df[c].shift() * 0.5, fill_value=0)
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.410128 1.004794 0.237621 'c'
3 -0.869085 0.457898 1.214023 'd'
4 -0.987162 0.587249 1.340056 'e'
EDIT:
c = df.select_dtypes(np.number).columns
for idx, row in df.iterrows():
if row.name > 0:
df.loc[idx, c] = df.loc[idx-1, c] * 0.5 + df.loc[idx, c]
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.335647 0.748541 0.564393 'c'
3 -1.213695 0.463134 1.214847 'd'
4 -1.571074 0.774384 1.481154 'e'
There is no need to use apply, you can solve it as follows. Since you want to use the updated row value in the calculation of the following row value, you need to use a for loop.
cols = ['A','B','C']
for i in range(1, len(df)):
df.loc[i, cols] = df.loc[i-1, cols] * 0.5 + df.loc[i, cols]
Result:
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.335647 0.748541 0.564393 'c'
3 -1.213695 0.463134 1.214847 'd'
4 -1.571074 0.774384 1.481154 'e'
I need to change individual elements in a DataFrame. I tried doing something like this, but it doesn't work:
for index, row in df.iterrows():
if df.at[row, index] == 'something':
df.at[row, index] = df.at[row, index] + 'add a string'
else:
df.at[row, index] = df.at[row, index] + 'add a value'
How can I do that?
If need modify all columns in DataFrame use numpy.where with DataFrame constructor, because where return numpy array:
df = pd.DataFrame(np.where(df == 'something', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
If only one column col:
df['col'] = np.where(df['col'] == 'something',
df['col'] + 'add a string',
df['col'] + 'add a value')
Sample:
df = pd.DataFrame({'col': ['a', 'b', 'a'], 'col1': ['a', 'b', 'b']})
print (df)
col col1
0 a a
1 b b
2 a b
df = pd.DataFrame(np.where(df == 'a', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
print (df)
col col1
0 aadd a string aadd a string
1 badd a value badd a value
2 aadd a string badd a value
df['col'] = np.where(df['col'] == 'a',
df['col'] + 'add a string',
df['col'] + 'add a value')
print (df)
col col1
0 aadd a string a
1 badd a value b
2 aadd a string b
You can use .ix and apply a function like this:
import pandas as pd
D = pd.DataFrame({'A': ['a', 'b', 3,7,'b','a'], 'B': ['a', 'b', 3,7,'b','a']})
D.ix[D.index%2 == 0,'A'] = D.ix[D.index%2 == 0,'A'].apply(lambda s: s+'x' if isinstance(s,str) else s+1)
D.ix[D.index[2:5],'B'] = D.ix[D.index[2:5],'B'].apply(lambda s: s+'y' if isinstance(s,str) else s-1)
First example appends x to each string or alternatively adds 1 to each non-string on column A for every even index.
The second example appends y to each string or alternatively subtracts 1 from each non-string on column B for the indices 2,3,4.
Original Frame:
A B
0 a a
1 b b
2 3 3
3 7 7
4 b b
5 a a
Modified Frame:
A B
0 ax a
1 b b
2 4 2
3 7 6
4 bx by
5 a a