pandas: group years by decade - python

So I have data in CSV. Here is my code.
data = pd.read_csv('cast.csv')
data = pd.DataFrame(data)
print(data)
The result looks like this.
title year name type \
0 Closet Monster 2015 Buffy #1 actor
1 Suuri illusioni 1985 Homo $ actor
2 Battle of the Sexes 2017 $hutter actor
3 Secret in Their Eyes 2015 $hutter actor
4 Steve Jobs 2015 $hutter actor
... ... ... ... ...
74996 Mia fora kai ena... moro 2011 Penelope Anastasopoulou actress
74997 The Magician King 2004 Tiannah Anastassiades actress
74998 Festival of Lights 2010 Zoe Anastassiou actress
74999 Toxic Tutu 2016 Zoe Anastassiou actress
75000 Fugitive Pieces 2007 Anastassia Anastassopoulou actress
character n
0 Buffy 4 31.0
1 Guests 22.0
2 Bobby Riggs Fan 10.0
3 2002 Dodger Fan NaN
4 1988 Opera House Patron NaN
... ... ...
74996 Popi voulkanizater 11.0
74997 Unicycle Race Attendant NaN
74998 Guidance Counselor 20.0
74999 Demon of Toxicity NaN
75000 Laundry Girl 25.0
[75001 rows x 6 columns]
I want to group the data by year and type. Then I want to know the size of the each type on specific year. So here is my code.
grouped = data.groupby(['year', 'type']).size()
print(grouped)
The result look like this.
year type
1912 actor 1
actress 2
1913 actor 9
actress 1
1914 actor 38
..
2019 actress 3
2020 actor 3
actress 1
2023 actor 1
actress 2
Length: 220, dtype: int64
The problem is, how if I want to get the size data from 1910 until 2020 and the increase year is 10 (Per decade). So the year index will 1910, 1920, 1930, 1940, and so on until 2020.

I see two simple options.
1- round the years to the lower 10:
group = df['year']//10*10 # or df['year'].round(-1)
grouped = data.groupby([group, 'type']).size()
2- use pandas.cut:
years = list(range(1910,2031,10))
group = pd.cut(s, bins=years, labels=years[:-1])
grouped = data.groupby([group, 'type']).size()

Related

getting new dataframe from existing in dataframe with conditions on multiple columns

I am trying to sort a pandas dataframe. The data looks like-
year
state
district
Party
rank
share in votes
2010
haryana
kaithal
Winner
1
40.12
2010
haryana
kaithal
bjp
2
30.52
2010
haryana
kaithal
NOTA
3
29
2010
goa
panji
Winner
3
10
2010
goa
panji
INC
2
40
2010
goa
panji
BJP
1
50
2013
up
meerut
Winner
2
40
2013
up
meerut
SP
1
60
2015
haryana
kaithal
Winner
2
15
2015
haryana
kaithal
BJP
3
35
2015
haryana
kaithal
INC
1
50
This data is for multiple states for multiple years.
In this dataset, there are multiple values for each district. I want to calculate the margin of share for each district in this manner. I have tried this, but not able to write fully. I am not able to write code for defining the margin of share and get a dataframe with only one (margin of share) value corresponding to each district instead of party wise shares.
for year in df['YEAR']:
for state in df['STATE']:
for district in df['DISTRICT']:
for rank in df['RANK']:
for party in df['PARTY']:
if rank==1 and party=='WINNER':
then margin of share =Share of Winner-Share of party at rank 2. If share WINNER does not have rank 1 then Margin of Share= share of winner - share of party at rank 1.
I am basically trying to get this output-
| year | state |district| margin of share|
|---------------|-------------|--------|----------------|
| 2010 | haryana |kaithal | 9.6 |
| 2010 | goa |panji | -40 |
| 2010 | up |kaithal | -20 |
| 2015 | haryana |kaithal | -35 |
I wish to have create a different data frame with columns Year, State, District and margin of SHARE.
Create MultiIndex by first 3 columns by DataFrame.set_index, create masks, filter with DataFrame.loc and subtract values, last use Series.fillna for replace not matched values by condition m3:
df1 = df.set_index(['year', 'state', 'district'])
m1 = df1.Party=='Winner'
m2 = df1['rank']==2
m3 = df1['rank']==1
s1 = (df1.loc[m1 & m3,'share in votes']
.sub(df1.loc[m2,'share in votes']))
print (s1)
year state district
2010 goa panji NaN
haryana kaithal 9.6
2013 up meerut NaN
2015 haryana kaithal NaN
Name: share in votes, dtype: float64
s2 = (df1.loc[m1,'share in votes']
.sub(df1.loc[m3,'share in votes']))
print (s2)
year state district
2010 haryana kaithal 0.0
goa panji -40.0
2013 up meerut -20.0
2015 haryana kaithal -35.0
Name: share in votes, dtype: float64
df = s1.fillna(s2).reset_index()
print (df)
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
use groupby and where with conditions
g = df.groupby(['year', 'state', 'district'])
cond1 = df['Party'].eq('Winner')
cond2 = df['rank'].eq(1)
cond3 = df['rank'].eq(2)
df1 = g['share in votes'].agg(lambda x: (x.where(cond1).sum() - x.where(cond3).sum()) if x.where(cond1 & cond2).sum() != 0 else (x.where(cond1).sum() - x.where(cond2).sum())).reset_index()
result(df1):
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
if you want sort like df use following code:
df.iloc[:, :3].drop_duplicates().merge(df1)
result:
year state district share in votes
0 2010 haryana kaithal 9.6
1 2010 goa panji -40.0
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0

Is it possible to conditionally combine data frame rows using pandas in python3?

I have the following data frame.
Names Counts Year
0 Jordan 1043 2000
1 Steve 204 2000
2 Brock 3 2000
3 Steve 33 2000
4 Mike 88 2000
... ... ... ...
20001 Bryce 2 2015
20002 Steve 11 2015
20003 Penny 24 2015
20004 Steve 15 2015
20005 Ryan 5 2015
I want to output the information about the name "Steve" over all years. The output should combine the "Counts" for the name "Steve" if the name appears multiple times within the same year.
Example output might look like:
Names Counts Year
0 Steve 237 2000
1 Steve 400 2001
2 Steve 35 2002
... ... ... ...
15 Steve 26 2015
do you want something like this ?
#first
cols=['Counts','Year']
df[cols]=df[cols].astype('int32')
df=df[df['Names']=='Steve']
df=df.groupby('Year')['Counts'].agg({'sum'})
Filter records for Steve then groupby Year, and finally calculate the aggregates i.e. first for Names, and sums for Counts
(df[df['Names'].eq('Steve')]
.groupby('Year')
.agg({'Names': 'first', 'Counts': sum})
.reset_index())
Year Names Counts
0 2000 Steve 237
1 2015 Steve 26

Wide to Long data frame returning NaN instead of float values

I have a large data frame that looks like this:
Country 2010 2011 2012 2013
0 Germany 4.625e+10 4.814e+10 4.625e+10 4.593e+10
1 France 6.178e+10 6.460e+10 6.003e+10 6.241e+10
2 Italy 4.625e+10 4.625e+10 4.625e+10 4.625e+10
I want to reshape the data so that the Country, Years, and Values are all columns. I used the melt method
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=[2010,2011,2012,2013,2014,2015,2016,2016,2017],
var_name ='Year', value_name='Total')
I was able to attain:
Country Year Total
0 Germany 2010 NaN
1 France 2010 NaN
2 Italy 2010 NaN
My issue is that the float values turns into NaN and I don't how to reshape the data frame to keep the values as floats.
Omit the value_vars argument and it works:
pd.melt(dftotal, id_vars='Country', var_name ='Year', value_name='Total')
Country Year Total
0 Germany 2010 4.625000e+10
1 France 2010 6.178000e+10
2 Italy 2010 4.625000e+10
3 Germany 2011 4.814000e+10
4 France 2011 6.460000e+10
5 Italy 2011 4.625000e+10
6 Germany 2012 4.625000e+10
7 France 2012 6.003000e+10
8 Italy 2012 4.625000e+10
9 Germany 2013 4.593000e+10
10 France 2013 6.241000e+10
11 Italy 2013 4.625000e+10
The problem is probably that your column names are not ints but strings, so you could do:
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=['2010','2011','2012','2013','2014','2015','2016','2016','2017'],
var_name ='Year', value_name='Total')
And it would also work.
Alternatively, using stack:
dftotal = (dftotal.set_index('Country').stack()
.reset_index()
.rename(columns={'level_1':'Year',0:'Total'})
.sort_values('Year'))
Will get you the same output (but less succinctly)

Fuzzy Matching Two Columns in the Same Dataframe Using Python

I have two datasets within the same data frame each showing a list of companies. One dataset is from 2017 and the other is from this year. I am trying to match the two company datasets to each other and figured fuzzy matching ( FuzzyWuzzy) was the best way to do this. Using a partial ratio, I want to simply have the columns with the values listed as so: last year company's name, highest fuzzy matching ratio, this year company associated with that highest score. The original data frame has been given the variable "data" with last year company names under the column "Company" and this year company names under the column "Company name". To accomplish this task, I tried to create a function with the extractOne fuzzy matching process and then apply that function to each value/row in the dataframe. I would then add the results to my original data frame.
Here is the code below:
names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
for row in last_year:
x=process.extractOne(row,this_year)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#last year company names dataset
last_year=data['Company'].dropna().values
#this year companydataset
this_year=data['Company name'].values
name_match,ratio_match=match_names(last_year,this_year)
data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)
data.to_csv("test.csv")
However, every time I execute this part of the code, the two added columns I created, do not show up in the csv. In fact, "test.csv" is just the same data frame as before despite the computer showing it as recently created. If anyone could point out the problem or help me out in any way, it would truly be appreciated.
Edit ( data frame preview):
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
Then after the Company entries (last year company data sets) end, the "Company name" column ( this year company data sets) begins as so:
4168 NaN LEWIS TENNIS
4169 NaN CHUCKS PRO SHOP AT
4170 NaN CHUCK KINYON
4171 NaN LAKE COUNTRY RACQUET CLUB
4172 NaN SPORTS ACADEMY & RAC CLUB
Your dataframe structure is odd considering that one column only begins once the other end, however we can make it work. Let's take the following sample dataframe for data that you supplied:
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
11 NaN LEWIS TENNIS
12 NaN CHUCKS PRO SHOP AT
13 NaN CHUCK KINYON
14 NaN LAKE COUNTRY RACQUET CLUB
15 NaN SPORTS ACADEMY & RAC CLUB
Then perform your matching:
import pandas as pd
from fuzzywuzzy import process, fuzz
known_list = data['Company name'].dropna()
def find_match(x):
match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
return pd.Series([match[0], match[1]])
data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')
Yields:
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories