getting new dataframe from existing in dataframe with conditions on multiple columns - python

I am trying to sort a pandas dataframe. The data looks like-
year
state
district
Party
rank
share in votes
2010
haryana
kaithal
Winner
1
40.12
2010
haryana
kaithal
bjp
2
30.52
2010
haryana
kaithal
NOTA
3
29
2010
goa
panji
Winner
3
10
2010
goa
panji
INC
2
40
2010
goa
panji
BJP
1
50
2013
up
meerut
Winner
2
40
2013
up
meerut
SP
1
60
2015
haryana
kaithal
Winner
2
15
2015
haryana
kaithal
BJP
3
35
2015
haryana
kaithal
INC
1
50
This data is for multiple states for multiple years.
In this dataset, there are multiple values for each district. I want to calculate the margin of share for each district in this manner. I have tried this, but not able to write fully. I am not able to write code for defining the margin of share and get a dataframe with only one (margin of share) value corresponding to each district instead of party wise shares.
for year in df['YEAR']:
for state in df['STATE']:
for district in df['DISTRICT']:
for rank in df['RANK']:
for party in df['PARTY']:
if rank==1 and party=='WINNER':
then margin of share =Share of Winner-Share of party at rank 2. If share WINNER does not have rank 1 then Margin of Share= share of winner - share of party at rank 1.
I am basically trying to get this output-
| year | state |district| margin of share|
|---------------|-------------|--------|----------------|
| 2010 | haryana |kaithal | 9.6 |
| 2010 | goa |panji | -40 |
| 2010 | up |kaithal | -20 |
| 2015 | haryana |kaithal | -35 |
I wish to have create a different data frame with columns Year, State, District and margin of SHARE.

Create MultiIndex by first 3 columns by DataFrame.set_index, create masks, filter with DataFrame.loc and subtract values, last use Series.fillna for replace not matched values by condition m3:
df1 = df.set_index(['year', 'state', 'district'])
m1 = df1.Party=='Winner'
m2 = df1['rank']==2
m3 = df1['rank']==1
s1 = (df1.loc[m1 & m3,'share in votes']
.sub(df1.loc[m2,'share in votes']))
print (s1)
year state district
2010 goa panji NaN
haryana kaithal 9.6
2013 up meerut NaN
2015 haryana kaithal NaN
Name: share in votes, dtype: float64
s2 = (df1.loc[m1,'share in votes']
.sub(df1.loc[m3,'share in votes']))
print (s2)
year state district
2010 haryana kaithal 0.0
goa panji -40.0
2013 up meerut -20.0
2015 haryana kaithal -35.0
Name: share in votes, dtype: float64
df = s1.fillna(s2).reset_index()
print (df)
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0

use groupby and where with conditions
g = df.groupby(['year', 'state', 'district'])
cond1 = df['Party'].eq('Winner')
cond2 = df['rank'].eq(1)
cond3 = df['rank'].eq(2)
df1 = g['share in votes'].agg(lambda x: (x.where(cond1).sum() - x.where(cond3).sum()) if x.where(cond1 & cond2).sum() != 0 else (x.where(cond1).sum() - x.where(cond2).sum())).reset_index()
result(df1):
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
if you want sort like df use following code:
df.iloc[:, :3].drop_duplicates().merge(df1)
result:
year state district share in votes
0 2010 haryana kaithal 9.6
1 2010 goa panji -40.0
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0

Related

Combine rows with containing blanks and each other data - Python [duplicate]

I have pandas DF as below ,
id age gender country sales_year
1 None M India 2016
2 23 F India 2016
1 20 M India 2015
2 25 F India 2015
3 30 M India 2019
4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year
1 20 M India 2016
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.
Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
print(df.replace('None',np.NaN).groupby('id').first())
first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()
Use -
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id
1 20
2 23
3 30
4 36
Name: age, dtype: object
Remove the ['age'] to get full rows -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year
id
1 20 M India 2015
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
You can put the id back as a column with reset_index() -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year
0 1 20 M India 2015
1 2 23 F India 2016
2 3 30 M India 2019
3 4 36 None India 2019

pandas: group years by decade

So I have data in CSV. Here is my code.
data = pd.read_csv('cast.csv')
data = pd.DataFrame(data)
print(data)
The result looks like this.
title year name type \
0 Closet Monster 2015 Buffy #1 actor
1 Suuri illusioni 1985 Homo $ actor
2 Battle of the Sexes 2017 $hutter actor
3 Secret in Their Eyes 2015 $hutter actor
4 Steve Jobs 2015 $hutter actor
... ... ... ... ...
74996 Mia fora kai ena... moro 2011 Penelope Anastasopoulou actress
74997 The Magician King 2004 Tiannah Anastassiades actress
74998 Festival of Lights 2010 Zoe Anastassiou actress
74999 Toxic Tutu 2016 Zoe Anastassiou actress
75000 Fugitive Pieces 2007 Anastassia Anastassopoulou actress
character n
0 Buffy 4 31.0
1 Guests 22.0
2 Bobby Riggs Fan 10.0
3 2002 Dodger Fan NaN
4 1988 Opera House Patron NaN
... ... ...
74996 Popi voulkanizater 11.0
74997 Unicycle Race Attendant NaN
74998 Guidance Counselor 20.0
74999 Demon of Toxicity NaN
75000 Laundry Girl 25.0
[75001 rows x 6 columns]
I want to group the data by year and type. Then I want to know the size of the each type on specific year. So here is my code.
grouped = data.groupby(['year', 'type']).size()
print(grouped)
The result look like this.
year type
1912 actor 1
actress 2
1913 actor 9
actress 1
1914 actor 38
..
2019 actress 3
2020 actor 3
actress 1
2023 actor 1
actress 2
Length: 220, dtype: int64
The problem is, how if I want to get the size data from 1910 until 2020 and the increase year is 10 (Per decade). So the year index will 1910, 1920, 1930, 1940, and so on until 2020.
I see two simple options.
1- round the years to the lower 10:
group = df['year']//10*10 # or df['year'].round(-1)
grouped = data.groupby([group, 'type']).size()
2- use pandas.cut:
years = list(range(1910,2031,10))
group = pd.cut(s, bins=years, labels=years[:-1])
grouped = data.groupby([group, 'type']).size()

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

Wide to Long data frame returning NaN instead of float values

I have a large data frame that looks like this:
Country 2010 2011 2012 2013
0 Germany 4.625e+10 4.814e+10 4.625e+10 4.593e+10
1 France 6.178e+10 6.460e+10 6.003e+10 6.241e+10
2 Italy 4.625e+10 4.625e+10 4.625e+10 4.625e+10
I want to reshape the data so that the Country, Years, and Values are all columns. I used the melt method
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=[2010,2011,2012,2013,2014,2015,2016,2016,2017],
var_name ='Year', value_name='Total')
I was able to attain:
Country Year Total
0 Germany 2010 NaN
1 France 2010 NaN
2 Italy 2010 NaN
My issue is that the float values turns into NaN and I don't how to reshape the data frame to keep the values as floats.
Omit the value_vars argument and it works:
pd.melt(dftotal, id_vars='Country', var_name ='Year', value_name='Total')
Country Year Total
0 Germany 2010 4.625000e+10
1 France 2010 6.178000e+10
2 Italy 2010 4.625000e+10
3 Germany 2011 4.814000e+10
4 France 2011 6.460000e+10
5 Italy 2011 4.625000e+10
6 Germany 2012 4.625000e+10
7 France 2012 6.003000e+10
8 Italy 2012 4.625000e+10
9 Germany 2013 4.593000e+10
10 France 2013 6.241000e+10
11 Italy 2013 4.625000e+10
The problem is probably that your column names are not ints but strings, so you could do:
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=['2010','2011','2012','2013','2014','2015','2016','2016','2017'],
var_name ='Year', value_name='Total')
And it would also work.
Alternatively, using stack:
dftotal = (dftotal.set_index('Country').stack()
.reset_index()
.rename(columns={'level_1':'Year',0:'Total'})
.sort_values('Year'))
Will get you the same output (but less succinctly)

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories