Create new columns based on distinct values and count them - python
Sorry if the title is not clear enough. Let me explain what I want to achieve.
I have this Data-Frame, let's call it df.
id | Area
A one
A two
A one
B one
B one
C one
C two
D one
D one
D two
D three
I would like to create a new Data-Frame based on the values in the existing Data-Frame. First, I would like to find a total sum of distinct id in df. Ex. id A has 3 entries, B has 2 entries, etc. Then create a new data frame out of it.
For our new Data-Frame, let's call it df_new
id | count
A 3
B 2
C 2
D 4
Next, I would like to create a new column based on values in df['Area'], for this example, df['Area'] contains 3 distinct values (one, two, three). I would like to count the number of times an id has been in which Area. For example, id A has been in area one twice, once in area two and zero in area three. Then, I will append those values into a new column called one, two and three.
df_new :
id | count | one | two | three
A 3 2 1 0
B 2 2 0 0
C 2 1 1 0
D 4 2 1 1
I have developed my own code which produces df_new, however I believe Pandas has a better function to perform this sort of data extraction. Here is my code.
#Read the data
df = pd.read_csv('test_data.csv', sep = ',')
df.columns = ['id', 'Area'] #Rename
# Count a total number of Area by Id
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()})
# Reset index
df_new = df_new.reset_index()
#For loop for counting and creating a new column for areas in df['Area']
for i in xrange(0, len(df)):
#Get the id
idx = df['id'][i]
#Get the areaname
area_name = str(df["Area"][i])
#Retrieve the index of a particular id
current_index = df_new.loc[df_new['id'] == idx, ].index[0]
#If area name exists in a column
if area_name in df_new.columns:
#Then +1 at the Location of the idx (Index)
df_new[area_name][current_index] += 1
#If not exists in the columns
elif area_name not in df_new.columns:
#Create an empty one with zeros
df_new[area_name] = 0
#Then +1 at the location of the idx (Index)
df_new[area_name][current_index] += 1
The code is long and hard to read. It also suffers from the warning "A value is trying to be set on a copy of a slice from a DataFrame". I would like to learn more on how to write this effectively.
Thank you
You can use df.groupby.count for the first part and pd.crosstab for the the second. Then, use pd.concat to join em:
In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\
pd.crosstab(df.id, df.Area)], 1)
Out[1246]:
count one three two
id
A 3 2 0 1
B 2 2 0 0
C 2 1 0 1
D 4 2 1 1
Here's the first part using df.groupby:
df.groupby('id').count().rename(columns={'Area' : 'count'})
count
id
A 3
B 2
C 2
D 4
Here's the second part with pd.crosstab:
pd.crosstab(df.id, df.Area)
Area one three two
id
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1
For the second part, you can also use pd.get_dummies and do a dot product:
(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area))
one three two
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1
Related
Groupby and sum of multiple columns with the same value
I am working on Pandas data frame and have following dataframe: data =pd.DataFrame() data['HomeTeam'] = ['A','B','C','D','E'] data['AwayTeam'] = ['E','D','A','B','C'] data['HomePoint'] = [1,3,0,1,3] data['AwayPoint'] = [1,0,3,1,0] data ['Match'] = data['HomeTeam'].astype(str)+' Vs '+data['AwayTeam'].astype(str) # I want to duplicate the match Nsims = 2 data_Dub =pd.DataFrame((pd.np.tile(data,(Nsims,1)))) data_Dub.columns = data.columns # Then I will assign the stage of the match data_Dub['SimStage'] = data_Dub.groupby('Match').cumcount() What i wanted to do is to sum homepoint and awaypoint obtained by each team and save it to new data frame. my new dataframe will look like as follow: It means that Homepoint and awaypoint will be added for same team as I have 5 teams in dataframe. Can anyone advise how to do it. I used following code and it does not work. Point = data_Dub.groupby(['SimStage','HomeTeam','AwayTeam])['HomePoint','AwayPoint'].sum() Thanks.
You can aggregate sum separately for HomeTeam and AwayTeam, then use add, last sort_index, reset_index for columns from MultiIndex, change column name and if necessary order of columns: a = data_Dub.groupby(['AwayTeam', 'SimStage'])['AwayPoint'].sum() b = data_Dub.groupby(['HomeTeam', 'SimStage'])['HomePoint'].sum() s = a.add(b).rename('Point') df = s.sort_index(level=[1, 0]).reset_index().rename(columns={'AwayTeam':'Team'}) df = df[['Team','Point','SimStage']] print (df) Team Point SimStage 0 A 4 0 1 B 4 0 2 C 0 0 3 D 1 0 4 E 4 0 5 A 4 1 6 B 4 1 7 C 0 1 8 D 1 1 9 E 4 1
Find out middle occurrence of "0" and first occurrence ''1" of an event in pandas dataframe
Hi I have a pandas dataframe which has event columns and other columns as well. I want to perform a group by on id and on that group by i want to take 2 records out of all continues 0s i want to find out a pattern of continues 5 0's could be more but it has to always followed by 1 as well and then identify set of records i.e. continues 5 0's and followed by next 1 then get middle row of (0s out of those 5 set of 0's) record and find out the first 1 after those 0's and take that row. But for 0s alteast i should get repeated for 5 times or more then take mid row out of those last 5. In short: I want the set of 0's and 1's and condition is take the 1's only for which above you find continues 5 0's or more, if this pattern is multiple time then take one pattern get two records for every id having 0's and 1's for eg. import pandas as pd data={'id':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'name': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n' ,'o','p','q','r','s','t','a1','b1','c1','d1','e1','f1','g1','h1','i1','j1','k1','l1','m1','n1' ,'o1','p1','q1','r1','s1','t1','aa','bb','cc','dd','ee','ff', 'gg','hh','ii','jj','kk','ll','mm','nn' ,'oo','pp','qq','rr','ss','tt','aa1','bb1','cc1','dd1','ee1','ff1', 'gg1','hh1','ii1','jj1','kk1','ll1','mm1','nn1' ,'oo1','pp1','qq1','rr1','ss1','tt1'], 'value':[0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0]} df=pd.DataFrame.from_dict(data) As a output i want to get 2 records per id one for 0 and one for 1's. And 0 row should be middle records of 5 or more consecutive 0s. The expected output is: id name value 16 1 q 0 19 1 t 1 64 2 ee1 0 67 2 hh1 1
You can do it using pivot table and applying masks for the different values. First we should group by id, value pair: df_grouped = df.reset_index().pivot_table(index=['id','value'], values='name', aggfunc=lambda x: ','.join(x) ).reset_index() df_grouped['name'] = df_grouped['name'].str.split(',') print(df_grouped) id value name 0 1 0 a,b,d,e,f,g,h,i 1 1 1 c,j 2 2 0 l,m,n,o,p 3 2 1 k,q,r,s,t,u,w Then select the zeros per value==0 and id pair and keep the middle value: mask_zeros = ((df_grouped['value']==0)* (df_grouped['name'].apply(len)>=5)) df_zeros = mask_zeros*df_grouped['name'].apply( lambda x: x[int(np.ceil(.5*len(x)))] if len(x)%2==1 else x[int(.5*len(x))]) print(df_zeros) 0 f 1 2 o 3 And select the first name per value==1 and id pair: mask_ones = (df_grouped['value']==1) df_ones = mask_ones*df_grouped['name'].apply( lambda x: x[0] if len(x)>0 else None) print(df_ones) 0 1 c 2 3 k Then keep only the selected names by assigning: df_grouped['name'] = df_ones + df_zeros df_grouped = df_grouped.merge(df.reset_index(), on=['name','value','id'] ).set_index('index') print(df_grouped) id value name index 5 1 0 f 2 1 1 c 14 2 0 o 10 2 1 k
I break down the steps df['New']=df.value.diff().fillna(0).ne(0).cumsum() df1=df.loc[df.value.eq(0)] s1=df1.groupby(['id','New']).filter(lambda x : len(x)>=5 ).groupby('id').apply(lambda x : x.iloc[len(x)//2-1:len(x)//2+1] if len(x)%2==0 else x.iloc[[(len(x)+1)//2],:] ).reset_index(level=0,drop=True) s2=df1.groupby(['id','New']).filter(lambda x : len(x)>=5 ) pd.concat([df.loc[s2.drop_duplicates(['id'],keep='last').index+1],s1]).sort_index() Out[1995]: id name value New 5 1 f 0 2 6 1 g 0 2 9 1 j 1 3 14 2 o 0 4 16 2 q 1 5
Converting pandas column of comma-separated strings into dummy variables
In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas: 0 'a' 1 'a,b,c' 2 'a,b,d' 3 'd' 4 'c,d' Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated! Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks! a b c d 0 1 0 0 0 1 1 1 1 0 2 1 1 0 1 3 0 0 0 1 4 0 0 1 1
Use str.get_dummies df['col'].str.get_dummies(sep=',') a b c d 0 1 0 0 0 1 1 1 1 0 2 1 1 0 1 3 0 0 0 1 4 0 0 1 1 Edit: Updating the answer to address some questions. Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column. If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix df['col'].str.get_dummies(sep=',').add_prefix('col_') Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame? You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe. df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']}) df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1) other a b c d 0 x 1 0 0 0 1 y 1 1 1 0 2 x 1 1 0 1 3 x 0 0 0 1 4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame: data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')
Count occurences of specific values in a data frame, where all possible values are defined by a list
I have two categories A and B that can take on 5 different states (values, names or categories) defined by the list abcde. Counting the occurence of each state and storing it in a data frame is fairly easy. However, I would also like the resulting data frame to include zeros for the possible values that have not occured in Category A or B. First, here's a dataframe that matches the description: In[1]: import pandas as pd possibleValues = list('abcde') df = pd.DataFrame({'Category A':list('abbc'), 'Category B':list('abcc')}) print(df) Out[1]: Category A Category B 0 a a 1 b b 2 b c 3 c c I've tried different approaches with df.groupby(...).size() and .count() , combined with the list of possible values and the names of the categories in a list, but with no success. Here's the desired output: Category A Category B a 1 1 b 2 1 c 1 2 d 0 0 e 0 0 To go one step further, I'd also like to include a column with the totals for each possible state across all categories: Category A Category B Total a 1 1 2 b 2 1 3 c 1 2 3 d 0 0 0 e 0 0 0 SO has got many related questions and answers, but to my knowledge none that suggest a solution to this particular problem. Thank you for any suggestions! P.S I'd like to make the solution adjustable to the number of categories, possible values and number of rows.
Need apply + value_counts + reindex + sum: cols = ['Category A','Category B'] df1 = df[cols].apply(pd.value_counts).reindex(possibleValues, fill_value=0) df1['total'] = df1.sum(axis=1) print (df1) Category A Category B total a 1 1 2 b 2 1 3 c 1 2 3 d 0 0 0 e 0 0 0 Another solution is convert columns to categorical and then 0 values are added without reindex: cols = ['Category A','Category B'] df1 = df[cols].apply(lambda x: pd.Series.value_counts(x.astype('category', categories=possibleValues))) df1['total'] = df1.sum(axis=1) print (df1) Category A Category B total a 1 1 2 b 2 1 3 c 1 2 3 d 0 0 0 e 0 0 0
Copy pandas DataFrame row to multiple other rows
Simple and practical question, yet I can't find a solution. The questions I took a look were the following: Modifying a subset of rows in a pandas dataframe Changing certain values in multiple columns of a pandas DataFrame at once Fastest way to copy columns from one DataFrame to another using pandas? Selecting with complex criteria from pandas.DataFrame The key difference between those and mine is that I need not to insert a single value, but a row. My problem is, I pick up a row of a dataframe, say df1. Thus I have a series. Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row. df1: Index/Col A B C 1 0 0 0 2 0 0 0 3 1 2 3 4 0 0 0 df2: Index/Col A B C 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code: series = df1[3] df2[df2.index>=2 and df2.index<=3] = series returning df2: Index/Col A B C 1 0 0 0 2 1 2 3 3 1 2 3 4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire: In [76]: df2.loc[[2,3],:] = df1.loc[3].values df2 Out[76]: A B C 1 0 0 0 2 1 2 3 3 1 2 3 4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this. code df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column bounds that you have to select