I have a dataframe that currently looks like this:
country series year value
usa a 2010 21
usa b 2015 22
usa a 2017 23
usa b 2010 22
usa b 2017 23
aus a 2010 21
aus b 2015 22
aus a 2017 23
aus b 2010 22
aus b 2017 23
When I run this code, it reduces the duplicity of the countries but not the series like I expect it to.
pop2.set_index(['Country','Series'])
I want:
country series year value
usa a 2010 21
2017 23
b 2010 22
2015 22
2017 23
aus a 2010 21
2017 23
b 2010 22
2015 22
2017 23
Instead, it is returning:
country series year value
usa a 2010 21
b 2015 22
a 2017 23
b 2010 22
b 2017 23
aus a 2010 21
b 2015 22
a 2017 23
b 2010 22
b 2017 23
There must be an index label for each row to display in a dataframe. Therefore, you need is a another level of index then you can show index "grouping" as you wish.
Let's try this:
df.set_index(['country','series',np.arange(df.shape[0])]).sort_index()
Output:
year value
country series
aus a 5 2010 21
7 2017 23
b 6 2015 22
8 2010 22
9 2017 23
usa a 0 2010 21
2 2017 23
b 1 2015 22
3 2010 22
4 2017 23
Related
i'm try create table like in example:
Example_picture
My code:
data = list(range(39)) # mockup for 39 values
columns = pd.MultiIndex.from_product([['1', '2', '6'], [str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame(data, index=['World'], columns=columns)
print(df)
But i get error:
Shape of passed values is (39, 1), indices imply (1, 39)
What i'm did wrong?
You need to wrap the data in a list to force the DataFrame constructor to interpret the list as a row:
data = list(range(39))
columns = pd.MultiIndex.from_product([['1', '2', '6'],
[str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame([data], index=['World'], columns=columns)
output:
Factor 1 2 6
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
World 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
It is posible to convert a dataframe on Pandas like that:
Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"2009": np.random.randn(12),
"2010": np.random.randn(12),
"2011": np.random.randn(12),
},
index=range(1, 13)
)
print(df)
Out[45]:
2009 2010 2011
1 -1.133838 -1.440585 0.570594
2 0.384319 0.773703 0.915420
3 1.496554 -1.027967 -1.669341
4 -0.355382 -0.090986 0.482714
5 -0.787534 0.492003 -0.310473
6 -0.459439 0.424672 2.394690
7 -0.059169 1.283049 1.550931
8 -0.354174 0.315986 -0.646465
9 -0.735523 -0.408082 -0.928937
10 -1.183940 -0.067948 -1.654976
11 0.238894 -0.952427 0.350193
12 -0.589920 -0.110677 -0.141757
df_out = df.unstack(1).reset_index()
df_out.columns = ["year", "month", "value"]
print(df_out)
Out[46]:
year month value
0 2009 1 -1.133838
1 2009 2 0.384319
2 2009 3 1.496554
3 2009 4 -0.355382
4 2009 5 -0.787534
5 2009 6 -0.459439
6 2009 7 -0.059169
7 2009 8 -0.354174
8 2009 9 -0.735523
9 2009 10 -1.183940
10 2009 11 0.238894
11 2009 12 -0.589920
12 2010 1 -1.440585
13 2010 2 0.773703
14 2010 3 -1.027967
15 2010 4 -0.090986
16 2010 5 0.492003
17 2010 6 0.424672
18 2010 7 1.283049
19 2010 8 0.315986
20 2010 9 -0.408082
21 2010 10 -0.067948
22 2010 11 -0.952427
23 2010 12 -0.110677
24 2011 1 0.570594
25 2011 2 0.915420
26 2011 3 -1.669341
27 2011 4 0.482714
28 2011 5 -0.310473
29 2011 6 2.394690
30 2011 7 1.550931
31 2011 8 -0.646465
32 2011 9 -0.928937
33 2011 10 -1.654976
34 2011 11 0.350193
35 2011 12 -0.141757
Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10
I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.
Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5
I have a csv file like this:
year month Company A Company B Company C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22
I want to find out for each company the month and year where they had the highest number of sale for ex: for company A in year 1990 they had highest sale of 18. I want to do this using pandas. but to understand how to proceed with this. pointers needed please.
ps: here is what I have done till now.
import pandas as pd
df = pd.read_csv('SAMPLE.csv')
num_of_rows = len(df.index)
years_list = []
months_list = []
company_list = df.columns[2:]
for each in df.columns[2:]:
each = []
for i in range(0,num_of_rows):
years_list.append(df[df.columns[0]][i])
months_list.append(df[df.columns[1]][i])
years_list = list(set(years_list))
months_list = list(set(months_list))
for each in years_list:
for c in company_list:
print df[(df.year == each)][c].max()
I am getting the biggest number for a year for a company but how to get the month and year also I dont know.
Use a combination of idxmax() and loc to filter the dataframe:
In [36]:
import pandas as pd
import io
temp = """year month Company_A Company_B Company_C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
# the call to .unique() is because the same row for A and C appears twice
df.loc[df[['Company_A', 'Company_B', 'Company_C']].idxmax().unique()]
Out[36]:
year month Company_A Company_B Company_C
13 1991 june 24 20 32
21 1992 june 15 42 9