Pandas: change column values over a certain amount - python

I have the following:
result.head(4)
district end party start state type id.thomas current
564 1 1987 Democrat 1985-01-03 HI rep 2 1985
565 1 1993 Democrat 1991-01-03 HI rep 2 1991
566 1 1995 Democrat 1993-01-05 HI rep 2 2019
567 1 1997 Democrat 1995-01-04 HI rep 2 2017
I would like to change all values greater than 2014 in the column end to 2014. I'm not sure how to go about doing this

Use clip_upper:
In [207]:
df['end'] = df['end'].clip_upper(1990)
df
Out[207]:
district end party start state type id.thomas current
564 1 1987 Democrat 1985-01-03 HI rep 2 1985
565 1 1990 Democrat 1991-01-03 HI rep 2 1991
566 1 1990 Democrat 1993-01-05 HI rep 2 2019
567 1 1990 Democrat 1995-01-04 HI rep 2 2017
So in your case df['end'] = df['end'].clip_upper(2014) should work

Related

Python Pandas when i try to add a column in an existing dataframe my new column is not correct

I am trying to get a religion adherence data visulization project done. But on my i am stuck with this problem pls help thank you
x= range(1945,2011,5)
for i in x:
df_new= df_new.append(pd.DataFrame({'year':[i]}))
years
0 1945
0 1950
0 1955
0 1960
0 1965
0 1970
0 1975
0 1980
0 1985
0 1990
0 1995
0 2000
0 2005
0 2010
this is my dataframe for now and i want to add a column which looks like this :
0 1.307603e+08
1 2.941211e+08
2 3.440720e+08
3 4.351231e+08
4 5.146341e+08
5 5.923423e+08
6 6.636743e+08
7 6.471395e+08
8 7.457716e+08
9 9.986003e+08
10 1.153186e+09
11 1.314048e+09
12 1.426454e+09
13 1.555483e+09
when i add them up like that
a=df.groupby(['year'],as_index=False)['islam'].sum()
b=a['islam']
df_new.insert(1,'islam',b)
the dataframe look like this which is not correct help me pls thank you !
year islam
0 1945 130760281.0
0 1950 130760281.0
0 1955 130760281.0
0 1960 130760281.0
0 1965 130760281.0
0 1970 130760281.0
0 1975 130760281.0
0 1980 130760281.0
0 1985 130760281.0
0 1990 130760281.0
0 1995 130760281.0
0 2000 130760281.0
0 2005 130760281.0
0 2010 130760281.0
df:
year name christianity judaism islam budism nonrelig
0 1945 USA 110265118 4641182.0 0.0 1601218 22874544
1 1950 USA 122994019 6090837.0 0.0 0 22568130
2 1955 USA 134001770 5333332.0 0.0 90173 23303540
3 1960 USA 150234347 5500000.0 0.0 2012131 21548225
4 1965 USA 167515758 5600000.0 0.0 1080892 19852362
... ... ... ... ... ... ... ...
1990 1990 WSM 159500 0.0 37.0 15 1200
1991 1995 WSM 161677 0.0 43.0 16 1084
1992 2000 WSM 174600 0.0 50.0 18 1500
1993 2005 WSM 177510 0.0 58.0 18 1525
1994 2010 WSM 180140 0.0 61.0 19 2750
Try skipping the first step where you create a dataframe of years. If you group the dataframe by year and leave off the as_index argument, it will give you what you're looking for.
summed_df = df.groupby(['year'],as_index=False)['islam'].sum()
That will give you a dataframe with the year as the index. Now you just have to reset the index, and you'll have a two-column dataframe with years and the sum values.
summed_df = summed_df.reset_index()
(Note: the default for reset_index() is drop=False. The drop parameter specifies whether you discard the index values (True) or insert them as a column into the dataframe (False). You want False here to preserve those year values.)

Convert panel data wide to long by two variables in Python

I have a dataset in Python that I am trying to convert from a wide dataset like this:
ID
Name
2007
2008
1
Andy
324
412
2
Becky
123
422
3
Lizzie
332
564
To a long dataset such as this.
ID
Name
Year
Var
1
Andy
2007
324
1
Andy
2008
412
2
Becky
2007
123
2
Becky
2008
422
3
Lizzie
2007
332
3
Lizzie
2008
564
Unfortunately can't use pivot due to the two identification columns and multiple observations for each year. Any help would be much appreciated.
Can't use pivot because this is actually a melt operation:
out = (df.melt(
id_vars=["ID", "Name"],
value_vars=["2007", "2008"],
var_name="Year",
value_name="Var"
)
.sort_values(["ID", "Year"]))
print(out)
ID Name Year Var
0 1 Andy 2007 324
3 1 Andy 2008 412
1 2 Becky 2007 123
4 2 Becky 2008 422
2 3 Lizzie 2007 332
5 3 Lizzie 2008 564

How to add null value rows into pandas dataframe for missing years in a multi-line chart plot

I am building a chart from a dataframe with a series of yearly values for six countries. This table is created by an SQL query and then passed to pandas with read_sql command...
country date value
0 CA 2000 123
1 CA 2001 125
2 US 1999 223
3 US 2000 235
4 US 2001 344
5 US 2002 355
...
Unfortunately, not every year has a value in each country, nevertheless the chart tool requires each country to have the same number of years in the dataframe. Years that have no values need a Nan (null) row added.
In the end, I want the pandas dataframe to look as follows for all six countries....
country date value
0 CA 1999 Nan
1 CA 2000 123
2 CA 2001 125
3 CA 2002 Nan
4 US 1999 223
5 US 2000 235
6 US 2001 344
7 US 2002 355
8 DE 1999 Nan
9 DE 2000 Nan
10 DE 2001 423
11 DE 2002 326
...
Are there any tools or shortcuts for determining min-max dates and then ensuring a new nan row is created if needed?
Use Series.unstack with DataFrame.stack trick:
df = df.set_index(['country','date']).unstack().stack(dropna=False).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0
Another idea with DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['country'].unique(),
range(df['date'].min(), df['date'].max() + 1)],
names=['country','date'])
df = df.set_index(['country','date']).reindex(mux).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0

Selection over different columns after a groupby

I am new to pandas and hence please treat this question with patience
I have a Df with year, state and population data collected over many years and across many states
I want to find the max pop during any year and the corresponding state
example:
1995 Alabama xx; 1196 New York yy; 1997 Utah zz
I did a groupby and got the population for all the states in a year; How do i iterate over all the years
state_yearwise = df.groupby(["Year", "State"])["Pop"].max()
state_yearwise.head(10)
1990 Alabama 22.5
Arizona 29.4
Arkansas 16.2
California 34.1
2016 South Dakota 14.1
Tennessee 10.2
Texas 17.4
Utah 16.1
Now I did
df.loc[df.pop == df.pop.max(), ["year", "State", "pop"]]
1992 Colorado 54.1
give me only 1 year and the max over all years and states
What I want is per year which state had the max population
Suggestions?
You can use transform to get the max for each column and get the index of the corresponding pop
idx = df.groupby(['year'])['pop'].transform(max) == df['pop']
Now you can index the df using idx
df[idx]
You get
pop state year
2 210 B 2000
3 200 B 2001
For the other dataframe that you updated
Year State County Pop
0 2015 Mississippi Panola 6.4
1 2015 Mississippi Newton 6.7
2 2015 Mississippi Newton 6.7
3 2015 Utah Monroe 12.1
4 2013 Alabama Newton 10.4
5 2013 Alabama Georgi 4.2
idx = df.groupby(['Year'])['Pop'].transform(max) == df['Pop']
df[idx]
You get
Year State County Pop
3 2015 Utah Monroe 12.1
4 2013 Alabama Newton 10.4
Is this what you want:
df = pd.DataFrame([{'state' : 'A', 'year' : 2000, 'pop' : 100},
{'state' : 'A', 'year' : 2001, 'pop' : 110},
{'state' : 'B', 'year' : 2000, 'pop' : 210},
{'state' : 'B', 'year' : 2001, 'pop' : 200}])
maxpop = df.groupby("state",as_index=False)["pop"].max()
pd.merge(maxpop,df,how='inner')
I see for df:
pop state year
0 100 A 2000
1 110 A 2001
2 210 B 2000
3 200 B 2001
And for the final result:
state pop year
0 A 110 2001
1 B 210 2000
Proof this works:
Why not get rid of group by ? By using sort_values and drop_duplicates
df.sort_values(['state','pop']).drop_duplicates('state',keep='last')
Out[164]:
pop state year
1 110 A 2001
2 210 B 2000

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories