I have a pandas DataFrame like this
year id1 id2 jan jan1 jan2 feb feb1 feb2 mar mar1 mar2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
and I need this format
year month id1 id2 val val1 val2
2018 01 01 10 3 30 31
2018 02 01 10 2 23 25
2018 03 01 10 7 52 53
..........
As you can see, I have 3 values for each month, and I only add one column assigned to the month with 3 columns for the values. If it were only one column, I think I could use stack.
I wouldn't have any problem renaming the month columns to 01 01-1 01-2 (for january) or something like that to make it easier.
I'm also thinking on separating the info on 3 different DataFrames to stack them separately and then merge the results, or should I melt it?
Any ideas for achieving this easily?
using reshape and stack
pd.DataFrame(df.set_index(['year','id1','id2']).values.reshape(4,3,3).tolist(),
index=df.set_index(['year','id1','id2']).index,
columns=[1,2,3])\
.stack().apply(pd.Series).reset_index().rename(columns={'level_3':'month'})
Out[261]:
year id1 id2 month 0 1 2
0 2018 1 10 1 3 30 31
1 2018 1 10 2 2 23 25
2 2018 1 10 3 7 52 53
3 2018 1 20 1 3 30 31
4 2018 1 20 2 2 23 25
5 2018 1 20 3 7 52 53
6 2018 2 10 1 3 30 31
7 2018 2 10 2 2 23 25
8 2018 2 10 3 7 52 53
9 2018 2 20 1 3 30 31
10 2018 2 20 2 2 23 25
11 2018 2 20 3 7 52 53
So I renamed the header columns this way
01 01 01 02 02 02 03 03 03 ...
year id1 id2 val val1 val2 val val1 val2 val val1 val2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
on a file, and opened it this way
df = pd.read_csv('my_file.csv',header=[0, 1], index_col=[0,1,2], skipinitialspace=True, tupleize_cols=True)
df.columns = pd.MultiIndex.from_tuples(df.columns)
then, I actually only needed to stack it on level 0
df = df.stack(level=0)
and add the titles
df.index.names = ['year','id1','id2','month']
df = df.reset_index()
Related
Can any one please elaborate with good example the difference between header and skiprows in syntax of
pd.read_excel("name",header=number,skiprows=number)
You can follow this article, which explains the difference between the parameters header and skiprows with examples from the olympic dataset, which can be downloaded here.
To summarize: the default behavior for pd.read() is to read in all of the rows, which in the case of this dataset, includes an unnecessary first row of row numbers.
import pandas as pd
df = pd.read_csv('olympics.csv')
df.head()
0 1 2 3 4 ... 11 12 13 14 15
0 NaN № Summer 01 ! 02 ! 03 ! ... № Games 01 ! 02 ! 03 ! Combined total
1 Afghanistan (AFG) 13 0 0 2 ... 13 0 0 2 2
2 Algeria (ALG) 12 5 2 8 ... 15 5 2 8 15
3 Argentina (ARG) 23 18 24 28 ... 41 18 24 28 70
4 Armenia (ARM) 5 1 2 9 ... 11 1 2 9 12
However the parameter skiprows allows you to delete one or more rows when you read in the .csv file:
df1 = pd.read_csv('olympics.csv', skiprows = 1)
df1.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
And if you want to skip a bunch of different rows, you can do the following (notice the missing countries):
df2 = pd.read_csv('olympics.csv', skiprows = [0, 2, 3])
df2.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Argentina (ARG) 23 18 24 ... 18 24 28 70
1 Armenia (ARM) 5 1 2 ... 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 ... 144 155 181 480
4 Austria (AUT) 26 18 33 ... 77 111 116 304
The header parameter tells you where to start reading in the .csv, which in the following case, does the same thing as skiprows = 1:
# this gives the same result as df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)
df4 = pd.read_csv('olympics.csv', header = 1)
df4.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
However you cannot use the header parameter to skip a bunch of different rows. You would not be able to replicate df2 using the header parameter. Hopefully this clears things up.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.
Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2
it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')
i have a pandas dataframe such as below:
id price hour minute date
1 10 03 07 01/11
2 4 03 59 01/11
3 5 02 21 01/11
4 6 03 47 02/09
5 1 04 28 02/04
6 7 05 50 01/11
7 3 02 01 01/11
8 2 01 23 01/11
...
and i want an output like:
id price hour minute date cumprice
1 10 03 07 01/11 19
2 4 03 59 01/11 14
3 5 02 21 01/11 20
4 6 03 47 02/09 6
5 1 04 28 02/04 1
6 7 05 50 01/11 7
7 3 02 01 01/11 10
8 2 01 23 01/11 10
...
I dont have any idea to do this job fast.
anybody could help me, to do this fast ?
You could groupby the date and use transform to add a column with the sum of the prices per group:
df['cumsprice'] = df.groupby('date').price.transform('sum')
id price hour minute date cumsprice
0 1 10 3 7 01/11 19
1 2 4 3 59 01/11 19
2 3 5 2 21 01/11 19
3 4 6 3 47 02/09 6
4 5 1 4 28 02/04 1
Update
Update after changing the expected solution. In order to group by consecutive dates that are equal, you can create a custom grouper for by checking on which rows the dates change, and taking the cumsum of these:
g = df.date.ne(df.date.shift(1))
df['cumprice'] = df.groupby(g.cumsum()).price.transform('sum')
print(df)
id price hour minute date cumsprice cumprice
0 1 10 3 7 01/11 31 19.0
1 2 4 3 59 01/11 31 19.0
2 3 5 2 21 01/11 31 19.0
3 4 6 3 47 02/09 6 6.0
4 5 1 4 28 02/04 1 1.0
5 6 12 5 50 01/11 31 12.0
I have the following dates dataframe:
dates
0 2012 10 4
1
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
6
7 2013 03 19
8 2016 2 5
9 2011 2 19
10
11 2011 05 23
12 2012 04 5
How can I normalize the dates column into:
dates
0 2012 10 04
1
2 2012 01 19
3 2020 06 11
4 2020 10 07
5 2019 11 12
6
7 2013 03 19
8 2016 02 05
9 2011 02 19
10
11 2011 05 23
12 2012 04 05
I tried with regex and splitting and tweaking each column separately. However I am complicating the task. Is it possible to normalize this into the latter dataframe?. The rule is to add a 0 if the year is incomplete or a 20 at the beggining of the string if the year is incomplete, the format is yyyymmdd.
Solution:
x = (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
.str.split(expand=True)
.rename(columns={0:'year',1:'month',2:'day'})
.astype(int)
)
x.loc[x.year <= 50, 'year'] += 2000
df['new'] = pd.to_datetime(x, errors='coerce').dt.strftime('%Y%m%d')
Result:
In [148]: df
Out[148]:
dates new
0 2012 10 4 20121004
1 NaN
2 2012 01 19 20120119
3 20 6 11 20200611
4 20 10 7 20201007
5 19 11 12 20191112
6 NaN
7 2013 03 19 20130319
8 2016 2 5 20160205
9 2011 2 19 20110219
10 NaN
11 2011 05 23 20110523
12 2012 04 5 20120405
Explanation:
In [149]: df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
Out[149]:
0 2012 10 4
2 2012 01 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 03 19
8 2016 2 5
9 2011 2 19
11 2011 05 23
12 2012 04 5
Name: dates, dtype: object
In [152]: (df.loc[df.dates.str.contains(r'\d+\s*\d+\s*\d+'), 'dates']
...: .str.split(expand=True)
...: .rename(columns={0:'year',1:'month',2:'day'})
...: .astype(int))
Out[152]:
year month day
0 2012 10 4
2 2012 1 19
3 20 6 11
4 20 10 7
5 19 11 12
7 2013 3 19
8 2016 2 5
9 2011 2 19
11 2011 5 23
12 2012 4 5