How to count particular column values in python pandas? - python

I'm having dataframe like below:
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
I want to check for sym column in my dataframe. My intension is to generate two different files, one containing same two columns in different order and second file contains sym,sym_count,AD_count,AU_count,neglected_count columns.
Edit 1 -
I want to avoid identity other than (AD & AU). In both output file I don't want result of AD & AU identity. neglected_count column is optional.
Expected Result-
result.csv
sym,identity
AAA,AD
AAA,AU
BBB,AD
CCC,AU
CCC,AU
EEE,AU
result_count.csv
sym,sym_count,AD_count,AU_count,neglected_count
AAA,2,1,1,0
BBB,1,1,0,0
CCC,2,0,2,0
EEE,2,0,1,1
How I can perform such type of calculation in python pandas?

I think you need crosstab with insert for add sum column to first position and add_suffix to column names.
Last write to_csv.
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
df = pd.DataFrame(df1_data, columns=['sym','identity'])
print (df)
sym identity
0 AAA AD
1 BBB AD
2 CCC AU
3 AAA AU
4 CCC AU
5 DDD AZ
6 EEE AU
7 EEE AZ
8 FFF AZ
#write to csv
df.to_csv('result.csv', index=False)
#need vals only in identity
vals = ['AD','AU']
#replace another values to neglected
neglected = df.loc[~df.identity.isin(vals), 'identity'].unique().tolist()
neglected = {x:'neglected' for x in neglected}
print (neglected)
{'AZ': 'neglected'}
df.identity = df.identity.replace(neglected)
df1 = pd.crosstab(df['sym'], df['identity'])
df1.insert(0, 'sym', df1.sum(axis=1))
df2 = df1.add_suffix('_count').reset_index()
#find all rows where is 0 in columns with vals
mask = ~df2.filter(regex='|'.join(vals)).eq(0).all(axis=1)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 False
dtype: bool
#boolean indexing
df2 = df2[mask]
print (df2)
identity sym sym_count AD_count AU_count neglected_count
0 AAA 2 1 1 0
1 BBB 1 1 0 0
2 CCC 2 0 2 0
4 EEE 2 0 1 1
df2.to_csv('result_count.csv', index=False)

Related

Pandas Dataframe - (Column re structure)

I have a dataframe that has n number of columns. These contain letters, the amount of letters a column contains varies and a letter can appear in various amounts of columns. I need the code for a pandas dataframe to convert the sheet to columns starting with the letters, the rows should contain the numbers of the columns that that letter was in.
Link to example problem
ABCDEF
ABDE. 11 1
BBCC -> 2 2
EFB. 3 3
4 4
The image describes my problem better. Thank you in advance for any help.
Use DataFrame.stack with DataFrame.reset_index for reshape, then DataFrame.sort_values and aggregate lists, last create DataFrame by constructor with transpose:
s=df.stack().reset_index(name='a').sort_values('level_1').groupby('a')['level_1'].agg(list)
df1 = pd.DataFrame(s.tolist(), index=s.index).T
print (df1)
a a b c d e f
0 1 1 1 1 3 2
1 3 3 2 4 4 None
2 None 4 None None None None
Or use GroupBy.cumcount for counter and reshape by DataFrame.pivot:
df2 = df.stack().reset_index(name='a').sort_values('level_1')
df2['g'] = df2.groupby('a').cumcount()
df2 = df2.pivot('g','a','level_1')
print (df2)
a a b c d e f
g
0 1 1 1 1 3 2
1 3 3 2 4 4 NaN
2 NaN 4 NaN NaN NaN NaN
Last if necessary remove index and columns names:
df1 = df1.rename_axis(index=None)
df2 = df2.rename_axis(index=None, columns=None)

Finding index of a data frame comparing with another data frame

I have Two data frames df and df1. Both have a column called description(which may not be unique). I wanted to get the index no of df where the description matches description of df1.
df
Name des
0 xyz1 abc
1 xyz2 bcd
2 xyz3 nna
3 xyz4 mmm
4 xyz5 man
df1
des
0 abc
1 nna
2 bcd
3 man
O/P required
df1
des index_df
0 abc 0
1 nna 2
2 bcd 1
3 man 4
This is possible with .loc accessor and using reset_index to elevate index to column:
res = df.loc[df['des'].isin(set(df1['des'])), 'des'].reset_index()
# index des
# 0 0 abc
# 1 1 bcd
# 2 2 nna
# 3 4 man
Use map by Series with swapped index and values created by column des:
s = pd.Series(df.index, index=df['des'])
df1['index_df'] = df1['des'].map(s)
print (df1)
des index_df
0 abc 0
1 nna 2
2 bcd 1
3 man 4

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

How to refactor simple dataframe parsing code with Pandas

I am using Pandas to parse a dataframe that I have created:
# Initial DF
A B C
0 -1 qqq XXX
1 20 www CCC
2 30 eee VVV
3 -1 rrr BBB
4 50 ttt NNN
5 60 yyy MMM
6 70 uuu LLL
7 -1 iii KKK
8 -1 ooo JJJ
My goal is to analyze column A and apply the following conditions to the dataframe:
Investigate every row
determine if df['A'].iloc[index]=-1
if true and index=0 mark first row as to be removed
if true and index=N mark last row as to be removed
if 0<index<N and df['A'].iloc[index]=-1 and the previous or following row contain -1 (df['A'].iloc[index+]=-1 or
df['A'].iloc[index-1]=-1), mark row as to be removed; else replace
-1 with the average of the previous and following value
The final dataframe should look like this:
# Final DF
A B C
0 20 www CCC
1 30 eee VVV
2 40 rrr BBB
3 50 ttt NNN
4 60 yyy MMM
5 70 uuu LLL
I was able to achieve my goal by writing a simple code that applies the conditions mentioned above:
import pandas as pd
# create dataframe
data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'],
'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']}
df = pd.DataFrame(data)
# If df['A'].iloc[index]==-1:
# - option 1: remove row if first or last row are equal to -1
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
N = len(df.index) # number of rows
index_vect = [] # store indexes of rows to be deleated
for index in range(0,N):
# option 1
if index==0 and df['A'].iloc[index]==-1:
index_vect.append(index)
elif index>1 and index<N and df['A'].iloc[index]==-1:
# option 2
if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:
index_vect.append(index)
# option 3
else:
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
# option 1
elif index==N and df['A'].iloc[index]==-1:
index_vect.append(index)
# remove rows to be deleated
df = df.drop(index_vect).reset_index(drop = True)
As you can see the code is pretty long and I would like to know if you can suggest a smarter and more efficient way to obtain the same result.
Furthermore I noticed my code return a warning message cause by the line df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
Do you know how I could optimize such line of code?
Here's a solution:
import numpy as np
# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan
# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]
# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)
Here's the resulting df:
A B C
1 20.0 www CCC
2 30.0 eee VVV
3 40.0 rrr BBB
4 50.0 ttt NNN
5 60.0 yyy MMM
6 70.0 uuu LLL
You can then reset the index if you want to.

Categories