creating an array basedon dataframe values - python

If i have an id that has a start date = 1988 and end date as = 2018 and value = 21100 and i want to create a array or dataframe of the dates from 1988 - 2018 i.e (1988,1989,1990...2018) with each date = to the same value of 21100
So i basically want something that looks like:
date, id1, id2
1988, 21100,0
1989, 21100,0
1990,21000 ,0
...
1994,21100,4598
...
2013,21100,4598
...
2018,21100,0
how could i do this? I want the array to start populating the value based on the start date and to end populating based on the end date. i have multiple id's (268) and i want them to loop through each adding a new column (id2, id3 ... id268). So for example id2 starts at 1994 to 2013 with a value of 4598.

EDIT:
example = pd.DataFrame({
'id': ['id1', 'id2', 'id3', 'id4'],
'start date': ['1988', '1988', '2000', '2005'],
'end date': ['2018', '2013', '2005', '2017'],
'value': [2100, 4568, 7896, 68909]
})
print (example)
id start date end date value
0 id1 1988 2018 2100
1 id2 1988 2013 4568
2 id3 2000 2005 7896
3 id4 2005 2017 68909
You can create Series in list comprehension and join them together by concat, replace missing values, convert to integers and last convert index to column Date :
L = [pd.Series(v, index=range(int(s), int(e)+1)) for s,e,v in
zip(example['start date'], example['end date'], example['value'])]
df1 = (pd.concat(L, axis=1, keys=example['id'])
.fillna(0)
.astype(int)
.rename_axis('date')
.reset_index())
print (df1)
id date id1 id2 id3 id4
0 1988 2100 4568 0 0
1 1989 2100 4568 0 0
2 1990 2100 4568 0 0
3 1991 2100 4568 0 0
4 1992 2100 4568 0 0
5 1993 2100 4568 0 0
6 1994 2100 4568 0 0
7 1995 2100 4568 0 0
8 1996 2100 4568 0 0
9 1997 2100 4568 0 0
10 1998 2100 4568 0 0
11 1999 2100 4568 0 0
12 2000 2100 4568 7896 0
13 2001 2100 4568 7896 0
14 2002 2100 4568 7896 0
15 2003 2100 4568 7896 0
16 2004 2100 4568 7896 0
17 2005 2100 4568 7896 68909
18 2006 2100 4568 0 68909
19 2007 2100 4568 0 68909
20 2008 2100 4568 0 68909
21 2009 2100 4568 0 68909
22 2010 2100 4568 0 68909
23 2011 2100 4568 0 68909
24 2012 2100 4568 0 68909
25 2013 2100 4568 0 68909
26 2014 2100 0 0 68909
27 2015 2100 0 0 68909
28 2016 2100 0 0 68909
29 2017 2100 0 0 68909
30 2018 2100 0 0 0
Use DataFrame constructor with range:
start = 1988
end = 2019
val = 21100
df = pd.DataFrame({'date':range(start, end),
'id1': val})
print (df)
date id1
0 1988 21100
1 1989 21100
2 1990 21100
3 1991 21100
4 1992 21100
5 1993 21100
6 1994 21100
7 1995 21100
8 1996 21100
9 1997 21100
10 1998 21100
11 1999 21100
12 2000 21100
13 2001 21100
14 2002 21100
15 2003 21100
16 2004 21100
17 2005 21100
18 2006 21100
19 2007 21100
20 2008 21100
21 2009 21100
22 2010 21100
23 2011 21100
24 2012 21100
25 2013 21100
26 2014 21100
27 2015 21100
28 2016 21100
29 2017 21100
30 2018 21100

Related

Python Pandas when i try to add a column in an existing dataframe my new column is not correct

I am trying to get a religion adherence data visulization project done. But on my i am stuck with this problem pls help thank you
x= range(1945,2011,5)
for i in x:
df_new= df_new.append(pd.DataFrame({'year':[i]}))
years
0 1945
0 1950
0 1955
0 1960
0 1965
0 1970
0 1975
0 1980
0 1985
0 1990
0 1995
0 2000
0 2005
0 2010
this is my dataframe for now and i want to add a column which looks like this :
0 1.307603e+08
1 2.941211e+08
2 3.440720e+08
3 4.351231e+08
4 5.146341e+08
5 5.923423e+08
6 6.636743e+08
7 6.471395e+08
8 7.457716e+08
9 9.986003e+08
10 1.153186e+09
11 1.314048e+09
12 1.426454e+09
13 1.555483e+09
when i add them up like that
a=df.groupby(['year'],as_index=False)['islam'].sum()
b=a['islam']
df_new.insert(1,'islam',b)
the dataframe look like this which is not correct help me pls thank you !
year islam
0 1945 130760281.0
0 1950 130760281.0
0 1955 130760281.0
0 1960 130760281.0
0 1965 130760281.0
0 1970 130760281.0
0 1975 130760281.0
0 1980 130760281.0
0 1985 130760281.0
0 1990 130760281.0
0 1995 130760281.0
0 2000 130760281.0
0 2005 130760281.0
0 2010 130760281.0
df:
year name christianity judaism islam budism nonrelig
0 1945 USA 110265118 4641182.0 0.0 1601218 22874544
1 1950 USA 122994019 6090837.0 0.0 0 22568130
2 1955 USA 134001770 5333332.0 0.0 90173 23303540
3 1960 USA 150234347 5500000.0 0.0 2012131 21548225
4 1965 USA 167515758 5600000.0 0.0 1080892 19852362
... ... ... ... ... ... ... ...
1990 1990 WSM 159500 0.0 37.0 15 1200
1991 1995 WSM 161677 0.0 43.0 16 1084
1992 2000 WSM 174600 0.0 50.0 18 1500
1993 2005 WSM 177510 0.0 58.0 18 1525
1994 2010 WSM 180140 0.0 61.0 19 2750
Try skipping the first step where you create a dataframe of years. If you group the dataframe by year and leave off the as_index argument, it will give you what you're looking for.
summed_df = df.groupby(['year'],as_index=False)['islam'].sum()
That will give you a dataframe with the year as the index. Now you just have to reset the index, and you'll have a two-column dataframe with years and the sum values.
summed_df = summed_df.reset_index()
(Note: the default for reset_index() is drop=False. The drop parameter specifies whether you discard the index values (True) or insert them as a column into the dataframe (False). You want False here to preserve those year values.)

How replaces values from a column in a Panel Data with values from a list in Python?

I have a database in panel data form:
Date id variable1 variable2
2015 1 10 200
2016 1 17 300
2017 1 8 400
2018 1 11 500
2015 2 12 150
2016 2 19 350
2017 2 15 250
2018 2 9 450
2015 3 20 100
2016 3 8 220
2017 3 12 310
2018 3 14 350
And I have a list with the labels of the ID
List = ['Argentina', 'Brazil','Chile']
I want to replace values of id with labels from my list.
Thanks in advance
Date id variable1 variable2
2015 Argentina 10 200
2016 Argentina 17 300
2017 Argentina 8 400
2018 Argentina 11 500
2015 Brazil 12 150
2016 Brazil 19 350
2017 Brazil 15 250
2018 Brazil 9 450
2015 Chile 20 100
2016 Chile 8 220
2017 Chile 12 310
2018 Chile 14 350
map is the way to go, with enumerate:
d = {k:v for k,v in enumerate(List, start=1)}
df['id'] = df['id'].map(d)
Output:
Date id variable1 variable2
0 2015 Argentina 10 200
1 2016 Argentina 17 300
2 2017 Argentina 8 400
3 2018 Argentina 11 500
4 2015 Brazil 12 150
5 2016 Brazil 19 350
6 2017 Brazil 15 250
7 2018 Brazil 9 450
8 2015 Chile 20 100
9 2016 Chile 8 220
10 2017 Chile 12 310
11 2018 Chile 14 350
Try
df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'})
or
df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})

How to apply values in column based on condition in row values in python

I have a dataframe that looks like this:
df
Name year week date
0 Adam 2016 16 2016-04-24
1 Mary 2016 17 2016-05-01
2 Jane 2016 20 2016-05-22
3 Joe 2016 17 2016-05-01
4 Arthur 2017 44 2017-11-05
5 Liz 2017 41 2017-10-15
6 Janice 2016 47 2016-11-27
And I want to create column season so df['season'] that attributes a season MAM or OND depending on the value in week.
The result should look like this:
df_final
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
In essence, values of week that are below 40 should be paired with MAM and values above 40 should be OND.
So far I have this:
condition =df.week < 40
df['season'] = df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: 'OND')
But it is clunky and does not produce the final response.
Thank you.
Use numpy.where:
condition = df.week < 40
df['season'] = np.where(condition, 'MAM', 'OND')
print (df)
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
EDIT:
For convert strings to integers use astype:
condition = df.week.astype(int) < 40
Or convert column:
df.week = df.week.astype(int)
condition = df.week < 40

Pandas Melt with Multiple Value Vars

I have a data set which is in wide format like this
Index Country Variable 2000 2001 2002 2003 2004 2005
0 Argentina var1 12 15 18 17 23 29
1 Argentina var2 1 3 2 5 7 5
2 Brazil var1 20 23 25 29 31 32
3 Brazil var2 0 1 2 2 3 3
I want to reshape my data to long so that year, var1, and var2 become new columns
Index Country year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
....
6 Brazil 2000 20 0
7 Brazil 2001 23 1
I got my code to work when I only had one variable by writing
df=(pd.melt(df,id_vars='Country',value_name='Var1', var_name='year'))
I cant figure out how to do this for a var1,var2, var3, etc.
Instead of melt, you can use a combination of stack and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 1
Using melt then unstack for var1, var2, etc...
(df1.melt(id_vars=['Country','Variable'],var_name='Year')
.set_index(['Country','Year','Variable'])
.squeeze()
.unstack()
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 2
Using pivot then stack:
(df1.pivot(index='Country',columns='Variable')
.stack(0)
.rename_axis(['Country','Year'])
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 3 (ayhan's solution)
Using set_index, stack, and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
numpy
years = df.drop(['Country', 'Variable'], 1)
y = years.values
m = y.shape[1]
c = df.Country.values
v = df.Variable.values
f0, u0 = pd.factorize(df.Country.values)
f1, u1 = pd.factorize(df.Variable.values)
w = np.empty((u1.size, u0.size, m), dtype=y.dtype)
w[f1, f0] = y
results = pd.DataFrame(dict(
Country=u0.repeat(m),
Year=np.tile(years.columns.values, u0.size),
)).join(pd.DataFrame(w.reshape(-1, m * u1.size).T, columns=u1))
results
Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3

Transforming Dataframe columns into Dataframe of rows

I have following DataFrame:
data = {'year': [2010, 2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2013],
'store_number': ['1944', '1945', '1946', '1947', '1948', '1949', '1947', '1948', '1949', '1947'],
'retailer_name': ['Walmart','Walmart', 'CRV', 'CRV', 'CRV', 'Walmart', 'Walmart', 'CRV', 'CRV', 'CRV'],
'product': ['a', 'b', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'c'],
'amount': [5, 5, 8, 6, 1, 5, 10, 6, 12, 11]}
stores = pd.DataFrame(data, columns=['retailer_name', 'store_number', 'year', 'product', 'amount'])
stores.set_index(['retailer_name', 'store_number', 'year', 'product'], inplace=True)
stores.groupby(level=[0, 1, 2, 3]).sum()
I want to transform following Dataframe:
amount
retailer_name store_number year product
CRV 1946 2011 a 8
1947 2012 a 6
2013 c 11
1948 2011 a 6
b 1
1949 2012 a 12
Walmart 1944 2010 a 5
1945 2010 b 5
1947 2010 b 10
1949 2012 a 5
into dataframe of rows:
retailer_name store_number year a b c
CRV 1946 2011 8 0 0
CRV 1947 2012 6 0 0
etc...
The products are known ahead.
Any idea how to do so ?
Please see below for solution. Thanks to EdChum for corrections to original post.
Without reset_index()
stores.groupby(level=[0, 1, 2, 3]).sum().unstack().fillna(0)
amount
product a b c
retailer_name store_number year
CRV 1946 2011 8 0 0
1947 2012 6 0 0
2013 0 0 11
1948 2011 6 1 0
1949 2012 12 0 0
Walmart 1944 2010 5 0 0
1945 2010 0 5 0
1947 2010 0 10 0
1949 2012 5 0 0
With reset_index()
stores.groupby(level=[0, 1, 2, 3]).sum().unstack().reset_index().fillna(0)
retailer_name store_number year amount
product a b c
0 CRV 1946 2011 8 0 0
1 CRV 1947 2012 6 0 0
2 CRV 1947 2013 0 0 11
3 CRV 1948 2011 6 1 0
4 CRV 1949 2012 12 0 0
5 Walmart 1944 2010 5 0 0
6 Walmart 1945 2010 0 5 0
7 Walmart 1947 2010 0 10 0
8 Walmart 1949 2012 5 0 0
Unstack product from the index and fill NaN values with zero.
df = stores.groupby(level=[0, 1, 2, 3]).sum().unstack('product')
mask = pd.IndexSlice['amount', :]
df.loc[:, mask] = df.loc[:, mask].fillna(0)
>>> df
amount
product a b c
retailer_name store_number year
CRV 1946 2011 8 0 0
1947 2012 6 0 0
2013 0 0 11
1948 2011 6 1 0
1949 2012 12 0 0
Walmart 1944 2010 5 0 0
1945 2010 0 5 0
1947 2010 0 10 0
1949 2012 5 0 0

Categories