Let's say I have this dataframe:
Name Salary Field
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I want to add some 0-indexed numbers above the column names, but I'd also want to keep the column names. I want to reach this form:
0 1 2
Name Salary Field
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
How can I do that with pandas and Python?
You don't need to create a new dataframe:
df.columns = pd.MultiIndex.from_tuples(list(enumerate(df)))
As expected:
# 0 1 2
# Name Salary Field
# 0 Megan 30000 Botany
# 1 Ann 24000 Psychology
# 2 John 24000 Police
# 3 Mary 45000 Genetics
# 4 Jay 60000 Data Science
I hope this helps.
Here is a quick way
new_df = pd.DataFrame(df.values,
columns = [ list(range(df.shape[1])), df.columns]
)
I'm sure there is a more elegant way
Related
Lets say I had this sample of a mixed dataset:
df:
Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
I'd like to group by all this data starting with property and have it look like this:
grouped:
Property Name Date of entry Old data Updated Data
City names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
State names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
grouped = pd.DataFrame(df.groupby(['Property','Name','Date of entry','Old Data', 'updated data'])
.size(),columns=['Count'])
grouped
and I get a type error saying: '<' not supported between instances of 'int' and 'datetime.datetime'
Is there some sort of formatting that I need to do to the df['Old data'] & df['Updated data'] columns to allow them to be added to the groupby?
added data types:
Property: Object
Name: Object
Date of entry: datetime
Old data: Object
Updated data: Object
*I modified your initial data to get a better view of the output.
You can try with pivot_table instead of groupby:
df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x)
Output:
Old data Updated data
Property Name Date of entry
Address Harry 2/3/2021 123 lane 123 street
Lisa 2/3/2021 123 lane 123 street
City Jack 1/8/2021 TX Miami
Jim 1/7/2021 Jacksonville Miami
Tammy 1/8/2021 TX Miami
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Order count Jack 3/4/2021 2 3
Tammy 3/4/2021 2 3
State Jack 1/8/2021 TX CA
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Zip Joe 2/2/2021 11111 22222
The whole code:
import pandas as pd
from io import StringIO
txt = '''Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
City Jack 1/8/2021 TX Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Order count Jack 3/4/2021 2 3
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Address Lisa 2/3/2021 123 lane 123 street
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
City Tammy 1/8/2021 TX Miami
'''
df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s{2,}', engine='python')
print(df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x))
I have a dataframe with 2 columns i.e. UserId in integer format and Actors in string format as shown below:
Userid Actors
u1 Tony Ward,Bruce LaBruce,Kevin P. Scott,Ivar Johnson, Naomi Watts, Tony Ward,.......
u2 Tony Ward,Bruce LaBruce,Kevin P. Scott, Luke Wilson, Owen Wilson, Lumi Cavazos,......
It represents actors from all movies watched by a particular user of the platform
I want an output where we have the count of each actor for each user as shown below:
UserId Tony Ward Bruce LaBruce Kevin P. Scott Ivar Johnson Luke Wilson Owen Wilson Lumi Cavazos
u1 2 1 1 1 0 0 0
u2 1 1 1 0 1 1 1
It is something similar to countvectoriser I reckon, but i just have nouns here.
Kindly help
Assuming its a pandas.Dataframe try this, DataFrame.explode Transform each element of a list-like (result of split) to a row DataFrame.groupby aggregates the data & DataFrame.unstack transforms to required format.
df['Actors'] = df['Actors'].str.replace(",\s", ",").str.split(",")
(
df.explode('Actors').
groupby(['Userid', 'Actors'], as_index=False).size().
unstack().fillna(0)
)
Df before :
unnamed:0 unnamed:1 unnamed:2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
To the df look like this:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I have tried to rename the unnamed columns:
testfile.columns = testfile.columns.str.replace('Unnamed.*', 't')
testfile = testfile.rename(columns=lambda x: x+'x')
This will do it from 0 to the number of columns you have
testfile.columns = ['t{}'.format(i) for i in range(testfile.shape[1])]
you can use this to reset the column names and add prefix to them
df = df.T.reset_index(drop=True).T.add_prefix('t')
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
try rename with string split
df = df.rename(lambda x: 't'+x.split(':')[-1], axis=1)
Out[502]:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 DataScience
If you don't care about the digit in unnamed:X, just want the increment on t, you may use numpy arange and np.char.add to construct them
np.char.add('t', np.arange(df.shape[1]).astype(str))
array(['t0', 't1', 't2'], dtype='<U12')
Assign it direct to columns
df.columns = np.char.add('t', np.arange(df.shape[1]).astype(str))
Your data is already increasing. You just want t instead of unnamed: as prefix.
df.columns = df.columns.str.replace('unnamed:', 't')
Try this:
df.rename(lambda x: x.replace('unnamed:', 't'), axis=1)
output:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?