I have one excel document which contains sport column, in which sports name and sports persons names are available. If I clicked on sports name sports persons names are disappears i.e. sports persons names are children's of the sports name.
Please look at the data below:
If I clicked on cricket then ramesh, suresh,mahesh names are disappears i.e. cricket is the parent of ramesh, suresh and mahesh like same football is the parent of pankaj, riyansh, suraj.
I want to read this excel document and convert in the python pandas Dataframe. I tried to read it with pandas pivot_table but I'm not getting any success.
I tried to read this excel sheet and converted into a dataframe.
df = pd.read_excel("sports.xlsx",skiprows=7,header=0)
d = pd.pivot_table(df,index=["sports"])
print d
But I'm getting all the sports values in single column I want to split it by sports name and it's corresponding sports persons name.
Expected Output:
sports_name player_name age address
cricket ramesh 20 aaa
cricket suresh 21 bbb
cricket mahesh 22 ccc
football pankaj 24 eee
football riyansh 25 fff
football suraj 26 ggg
basketball rajesh 28 iii
basketball abhijeet 29 jjj
pandas.pivot_table is there to support data analysis and helps you to create pivot tables similar to excel, not to read excel pivot tables.
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
Example from Documentation
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
Now to help you on the problem, I created a sample data set and a pivot table.
Then read the excel sheet into pandas dataframe. This dataframe contains nans to be replaced using df.fillna(method='ffill')
df = pd.read_excel(pviotfile,skiprows=12,header=0)
df=df.fillna(method='ffill')
print (df)
output
Sports Name Address Age
0 basketball Abhijit 129 ABC 20
1 basketball Rajesh 128 ABC 20
2 Cricket Mahesh 123 ABC 20
3 Cricket Ramesh 126 ABC 20
4 Cricket Suresh 124 ABC 20
5 Football Riyash 125 ABC 20
6 Football suraj 127 ABC 20
Related
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.
I have 2 dataframes of different sizes. The first dataframe(df1) has 4 columns, but two of those columns have the same name as the columns in the second dataframe(df2), which is only comprised of 2 columns. The columns in common are ['ID'] and ['Department'].
I want to check if any ID from df2 are in df1. If so, I want to replace df1['Department'] value with df2['Department'] value.
For instance, df1 looks something like this:
ID Department Yrs Experience Education
1234 Science 1 Bachelors
2356 Art 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
And df2 looks something like this:
ID Department
1098 P.E.
1234 Technology
2356 History
I want to check if the ID from df2 is in df1 and if so, update Department. The output should looks something like this:
ID Department Yrs Experience Education
1234 **Technology** 1 Bachelors
2356 **History** 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
The expected updates to df1 are in bold
Is there an efficient way to do this?
Thank you for taking the time to read this and help.
You can use ID of df1 to map with the Pandas series formed by setting ID on df2 as index and taking the column of Department from df2 (this acts as a mapping table).
Then, in case of no match of ID from df2, we fill-in the original values of Department from df1 (to retain original values in case of no match):
df1['Department'] = (df1['ID'].map(df2.set_index('ID')['Department'])
.fillna(df1['Department'])
)
Result:
print(df1)
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
Try:
df1["Department"].update(
df1[["ID"]].merge(df2, on="ID", how="left")["Department"]
)
print(df1)
Prints:
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
df_1 = pd.DataFrame(data={'ID':[1234, 2356, 2456, 4657], 'Department':['Science', 'Art', 'Math', 'Science']})
df_2 = pd.DataFrame(data={'ID':[1234, 2356], 'Department':['Technology', 'History']})
df_1.loc[df_1['ID'].isin(df_2['ID']), 'Department'] = df_2['Department']
OutPut
ID Department
0 1234 Technology
1 2356 History
2 2456 Math
3 4657 Science
I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?
I am learning Python and I have a question related to creating a data frame for every 5 rows, transpose and merge the data frames.
I have a .txt file with the following input. It has thousands of rows and I need to go through each line until the end of the file.
Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle
I need to get this as my output:
Name,Age,Sex,Company,Vehicle
Kamath,23,Male,ACC,Car
Ram,32,Male,CCA,Bike
Reena,26,Female,BARC,Cycle
Use read_csv for DataFrame and then pivot with cumcount for counter for new index:
import pandas as pd
temp=u"""Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
df = pd.read_csv(pd.compat.StringIO(temp), names=['a','b'])
print (df)
a b
0 Name Kamath
1 Age 23
2 Sex Male
3 Company ACC
4 Vehicle Car
5 Name Ram
6 Age 32
7 Sex Male
8 Company CCA
9 Vehicle Bike
10 Name Reena
11 Age 26
12 Sex Female
13 Company BARC
14 Vehicle Cycle
df = pd.pivot(index=df.groupby('a').cumcount(),
columns=df['a'],
values=df['b'])
print (df)
a Age Company Name Sex Vehicle
0 23 ACC Kamath Male Car
1 32 CCA Ram Male Bike
2 26 BARC Reena Female Cycle
Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.