creating multi-index from dataframe - python

Geography Age group 2016
0 Toronto All 1525
1 Toronto 1~7 5
2 Toronto 7~20 7
3 Toronto 20~40 500
4 Vancouver All 3000
5 Vancouver 1~7 10
6 Vancouver 7~20 565
7 Vancouver 20~40 564
.
.
.
NOTE: This is just an example. my dataframe contains different numbers
I want to create multi-index where first index is by Geography and second is by age group.
Also is it possible to groupby w/o performing any functions at the end?
Output should be:
Geography Age group 2016
0 Toronto All 1525
1 1~7 5
2 7~20 7
3 20~40 500
4 Vancouver All 3000
5 1~7 10
6 7~20 565
7 20~40 564
.
.

In order to create a MultiIndex as specified, you can simply use DataFrame.set_index():
df.set_index(['Geography','Agegroup' ])
2016
Geography Age group
Toronto All 1525
1~7 5
7~20 7
20~40 500
Vancouver All 3000
1~7 10
7~20 565
20~40 564

Related

I have position columns in pandas containing values from 1 to 15 and votes percentage columns containing values from 0 to 100

I need to find data where votes percentage is less than 10 between position 1 and 2 in pandas,how do I do that ?
I have wrote this code and found out all 1 and 2 positions
df[(df['Position']==1) | (df['Position']==2)]
The data looks like this
Election_Year Position Name Votes Votes_per Party AC_name AC_No
0 2010 1 Rajesh Singh 42289 29.4 Janata Dal (United) Valmiki Nagar 1
1 2010 2 Mukesh Kumar 27618 19.2 Rashtriya Janata DalValmikiNagar 1
14 2010 1 Bhagirathi Devi 51993 41.5 Bharatiya Janta Party Ramnagar 2
15 2010 2 Naresh Ram 22211 17.7 Indian National Congress Ramnagar 2
31 2010 1 Satish Chandra 45022 38.1 Bharatiya Janta Party Narkatiaganj 3
I can write this and find the answer for for two rows
a.Votes_per[0]-a.Votes_per[1]
Now how to find the data for all the rows?
This should work,
df[df['Position']==1][ (df[df['Position']==1]['Votes_per'] - df[df['Position']==2]['Votes_per'] ) <10 ]

Get values of a column having more than a certain value for given number of times

I have a dataframe dfas:
Election Year Votes Votes % Party Region
0 2000 42289 29.40 Janata Dal (United) A
1 2000 27618 19.20 Rashtriya Janata Dal B
2 2000 20886 14.50 Bahujan Samaj Party C
3 2000 17747 12.40 Congress D
4 2000 14047 9.80 Independent E
5 2005 8358 5.80 Janvadi Party A
6 2005 4428 13.10 Independent B
7 2005 1647 1.20 Independent C
8 2005 1610 11.10 Independent D
9 2005 1334 15.06 Nationalist Party E
10 2010 1114 0.80 Independent A
11 2010 1042 10.5 Bharatiya Janta Dal B
12 2010 835 0.60 Independent C
13 2010 14305 15.50 Independent D
14 2010 22211 17.70 Congress E
I need to find The "Regions" in which 3 parties or more got greater than 10% of the vote shares in each "Election Year".
I have sorted the Election year in ascending year and Votes% in descending by:
df1 = df.sort_values(['Election Year','Votes %'], ascending = (True, False))
Then I have taken top 3 of each region:
top_3 = df1.groupby(['Election Year', 'Region']).head(3).reset_index()
Now how to check if the top 3 Regions have 10% or more votes in each year?
Are you looking for something like this?
def election(df):
count = df['Votes %'].gt(10).sum()
regions = ','.join(df['Region'].where(df['Votes %'].gt(10),'None').tolist())
return pd.Series({'count':count,'regions':regions})
ndf = df.groupby(['Election Year','Party']).apply(election)
ndf = ndf.replace(['None,','None'],'',regex=True)
or
df.loc[df['Votes %'].gt(10).groupby(df['Election Year']).transform('sum').gt(3) & df['Votes %'].gt(10)].groupby('Election Year')['Region'].agg(','.join)

Joining pandas dataframes by column headers

I have two data frames (Actual and Targets) with followed headers:
print (df1)
WorkWeek Area Actual
0 202001 South 5
1 202001 North 5
2 202001 West 6
3 202001 East 8
4 202002 South 7
5 202002 North 9
6 202002 West 6
7 202002 East 3
8 202003 South 5
9 202003 North 85
10 202003 West 5
11 202003 East 11
12 202004 South 2
13 202004 North 2
14 202004 West 2
15 202004 East 2
print (df2)
WorkWeek South North West East
0 202001 60 90 70 80
1 202002 60 90 70 80
2 202003 60 90 70 80
3 202004 60 90 70 80
I want to have joined df(Actual_vs_Targets) by WW and Area
In case if i want to add more areas how should i act?
Thank you!
Use DataFrame.melt with DataFrame.merge:
df22 = df2.melt('WorkWeek', var_name='Area', value_name='Target')
df = df1.merge(df22, on=['WorkWeek','Area'], how='left')
Or DataFrame.sem with DataFrame.join:
df22 = df2.set_index('WorkWeek').stack().rename_axis(['WorkWeek','Area']).rename('Target')
df = df1.join(df22, on=['WorkWeek','Area'])
print (df)
WorkWeek Area Actual Target
0 202001 South 5 60
1 202001 North 5 90
2 202001 West 6 70
3 202001 East 8 80
4 202002 South 7 60
5 202002 North 9 90
6 202002 West 6 70
7 202002 East 3 80
8 202003 South 5 60
9 202003 North 85 90
10 202003 West 5 70
11 202003 East 11 80
12 202004 South 2 60
13 202004 North 2 90
14 202004 West 2 70
15 202004 East 2 80

Comparing Two Data Frames in python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).
df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

How can I swap the values of a selective column randomly?

Suppose I have a large data-set(in CSV formate) like the following :
Country Age Salary Purchased
0 France 44 72000 No
1 Spain 27 48000 Yes
2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 45000 Yes
5 France 35 58000 Yes
6 Spain 75 52000 No
7 France 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
Now how can i swap all the values for a selected column Randomly ? For Example
i want to swap all the values of the first column 'Country' randomly.
Looking for your suggestion. Thanks in advance !
Shuffle in-place using random.shuffle:
# <= 0.23
# np.random.shuffle(df['Country'].values)
# 0.24+
np.random.shuffle(df['Country'].to_numpy())
Or, assign back with random.choice:
df['Country'] = np.random.choice(df['Country'], len(df), replace=False)
permutation
np.random.seed([3, 1415])
df.assign(Country=df.Country.to_numpy()[np.random.permutation(len(df))])
Country Age Salary Purchased
0 France 44 72000 No
1 Germany 27 48000 Yes
2 France 30 54000 No
3 Spain 38 61000 No
4 France 40 45000 Yes
5 Spain 35 58000 Yes
6 Germany 75 52000 No
7 Spain 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
sample
df.assign(Country=df.Country.sample(frac=1).to_numpy())

Categories