This code will generate a very simple dummy dataframe, where people filled a survey form:
df2 = pd.DataFrame({
'name':['John','John','John','Rachel','Rachel','Rachel'],
'gender':['Male','Male','Male','Female','Female','Female'],
'age':[40,40,40,39,39,39],
'SurveyQuestion':['Married?','HasKids?','Smokes?','Married?','HasKids?','Smokes?'],
'answers':['Yes','Yes','No','Yes','No','No']
})
The output looks like so:
Because of the way the table is structured, with each question having its own row, we see that the first 3 columns always contain the same info, as it's just repeating the info based on the person that filled in the survey.
It would be better to visualize the dataframe as a pivot-table, similar to the following:
df2.pivot(index='name',columns='SurveyQuestion',values='answers')
However, doing it this way results in many of the previous columns being lost, since only 1 column can be used as the index.
I'm wondering what the most straightforward way of doing this would be that didn't involve an extra step of rejoining the columns.
You can use df.pivot_table:
In [27]: df2.pivot_table(values='answers', index=['name','gender','age'], columns='SurveyQuestion', aggfunc='first')
Out[27]:
SurveyQuestion HasKids? Married? Smokes?
name gender age
John Male 40 Yes Yes No
Rachel Female 39 No Yes No
OR, you can use df.pivot with df.set_index, like this:
In [30]: df = df2.set_index(['name', 'gender', 'age'])
In [32]: df.pivot(index=df.index, columns='SurveyQuestion')['answers']
Out[32]:
SurveyQuestion HasKids? Married? Smokes?
name gender age
John Male 40 Yes Yes No
Rachel Female 39 No Yes No
I'm not sure there's any existing algorithms to do this for you but I've had a similar problem in my projects.
If you're trying to condense the rows in your table, first you need to make sure every person can have the same columns applied to them. For example, you can't reasonably do this if you didn't ask the 'HasKids?' question to Rachel unless you include an N/a option.
After this, order the table by some unique ID, that way any repeated people will definitely be next to each other in the table.
Then iterate through this table, and everytime you hit a row that's the same as the last, take whatever unique information it has, add it to the original row for that person and delete this repeat. If this is done for the whole table you should get your pivot.
Related
I have a datatable as,
DT_EX= dt.Frame({
'country':['a','a','a','a'],
'id':[3,3,3,3],
'shop':['dmart','dmart','dmart','dmart'],
'beef':[23,None,None,None],
'eggs':[None,33,None,None],
'fork':[None,None,10,None],
'veg':[None,None,None,40]})
It's output is as,
And I would like to convert it to a datatable which should not have NA's in columns as showed in this output,
Could you please explain how to do this operation(removing NA's) on py-datatable?. would dt.isna() be helpful in this case?.
One way around it would be to select the first three columns (they have no nulls) and extend it with the sum of the remaining columns : link
from datatable import f, first, sum
DT_EX[:,first(f[:3]).extend(sum(f[3:]))]
country id shop beef eggs fork veg
▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪
0 a 3 dmart 23 33 10 40
UPDATE: simpler solution from another related question:
DT_EX[:, sum(f[3:]), f[:3])]
So i have one more subgroup of items and here is a new DT.
DT_EX= dt.Frame({
'country':['a','a','a','a','b','b','c','c'],
'id':[3,3,3,3,4,4,4,4],
'shop':['dmart','dmart','dmart','dmart','amzn','amzn','amzn','amzn'],
'beef':[23,None,None,None,93,None,None,None],
'eggs':[None,33,None,None,None,103,None,None],
'fork':[None,None,10,None,None,None,210,None],
'veg':[None,None,None,40,None,None,None,340]})
I have tried to appply the recommended logics on it as here in the attached screenshot,
In second code chunk it has summed up each column(beef,eggs,fork,veg)
and In the third code chunk, i did a grouping on first three columns, here it gives a correct output, but it's adding duplicate columns, and another observation is that its filling NA values with 0, it can be found on C observation.
would you have any other ideas/suggestions for it ?.
I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution
I have a initial dummy dataframe with 7 columns, 1 row and given columns names and initialised zeros
d = pandas.DataFrame(numpy.zeros((1, 7)))
d = d.rename(columns={0:"Gender_M",
1:"Gender_F",
2:"Employed_Self",
3:"Employed_Employee",
4:"Married_Y",
5:"Married_N",
6:"Salary"})
Now I have a single record
data = [['M', 'Employee', 'Y',85412]]
data_test = pd.DataFrame(data, columns = ['Gender', 'Employed', 'Married','Salary'])
From the single record I have to create a new dataframe, where if the
Gender column has M, then Gender_M should be changed to 1, Gender_F left with zero
Employed column has Employee, then Employed_Employee changed to 1, Employed_Self with zero
same with Married and for the integer column Salary, just set the value 85412, I tried with if statements, but its a long set of codes, is there a simple way?
Here is one way using update twice
d.update(df)
df.columns=df.columns+'_'+df.astype(str).iloc[0]
df.iloc[:]=1
d.update(df)
Alas homework is often designed to be boring and repetitive ...
You do not have a problem - rather you want other people to do the work for you. SO is not for this purpose - post a problem, you will find many people willing to help.
So show your FULL answer then ask for "Is there a better way"
I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333
I am working on a project in which I scraped NBA data from ESPN and created a DataFrame to store it. One of the columns of my DataFrame is Team. Certain players that have been traded within a season have a value such as LAL/LAC under team, rather than just having one team name like LAL. With these rows of data, I would like to make 2 entries instead of one. Both entries would have the same, original data, except for 1 of the entries the team name would be LAL and for the other entry the team name would be LAC. Some team abbreviations are 2 letters while others are 3 letters.
I have already managed to create a separate DataFrame with just these rows of data that have values in the form team1/team2. I figured a good way of getting the data the way I want it would be to first copy this DataFrame with the multiple team entries, and then with one DataFrame, keep everything in the Team column up until the /, and with the other, keep everything in the Team column after the slash. I'm not quite sure what the code would be for this in the context of a DataFrame. I tried the following but it is invalid syntax:
first_team = first_team['TEAM'].str[:first_team[first_team['TEAM'].index("/")]]
where first_team is my DataFrame with just the entries with multiple teams. Perhaps this can give you a better idea of what I'm trying to accomplish!
Thanks in advance!
You're probably better off using split first to separate the teams into columns (also see Pandas DataFrame, how do i split a column into two), something like this:
d = pd.DataFrame({'player':['jordan','johnson'],'team':['LAL/LAC','LAC']})
pd.concat([d, pd.DataFrame(d.team.str.split('/').tolist(), columns = ['team1','team2'])], axis = 1)
player team team1 team2
0 jordan LAL/LAC LAL LAC
1 johnson LAC LAC None
Then if you want separate rows, you can use append.