I have following csv
Name Date Qty Date Qty Date Qty
---------------------------------------------------
ABC Jan 2023 10 Feb 2023 11 Mar 2023 12
XYZ Jan 2023 20 Feb 2023 21 Mar 2023 22
I want output as follows in csv/dataframe
Name Date Qty
---------------------
ABC Jan 2023 10
ABC Feb 2023 11
ABC Mar 2023 12
XYZ Jan 2023 20
XYZ Feb 2023 21
XYZ Mar 2023 22
How I achieve this result?
A bit complicated but does the job. You can execute step by step to view the transformation:
>>> (df.melt('Name').assign(row=lambda x: x.groupby('variable').cumcount())
.pivot(['row', 'Name'], 'variable', 'value')
.reset_index('Name').rename_axis(index=None, columns=None))
Name Date Qty
0 ABC Jan 2023 10
1 XYZ Jan 2023 20
2 ABC Feb 2023 11
3 XYZ Feb 2023 21
4 ABC Mar 2023 12
5 XYZ Mar 2023 22
Less streamlined solution compared to #Corralien's. Also uses melt and pivot.
import pandas as pd
import io
#-----------------------------------------------#
#Recreate OP's table with duplicate column names#
#-----------------------------------------------#
df = pd.read_csv(io.StringIO("""
ABC Jan-2023 10 Feb-2023 11 Mar-2023 12
XYZ Jan-2023 20 Feb-2023 21 Mar-2023 22
"""),header=None,delim_whitespace=True)
df.columns = ['Name','Date','Qty','Date','Qty','Date','Qty']
#-----------------#
#Start of solution#
#-----------------#
#melt from wide to long (maintains order)
melted_df = df.melt(
id_vars='Name',
var_name='col',
value_name='val',
)
#add a number for Date1/Date2/Date3 to keep track of Qty1/Qty2/Qty3 etc
melted_df['col_number'] = melted_df.groupby(['Name','col']).cumcount()
#pivot back to wide form
wide_df = melted_df.pivot(
index=['Name','col_number'],
columns='col',
values='val',
).reset_index().drop(columns=['col_number'])
wide_df.columns.name = None #remove column index name
#Final output
print(wide_df)
Output
Name Date Qty
0 ABC Jan-2023 10
1 ABC Feb-2023 11
2 ABC Mar-2023 12
3 XYZ Jan-2023 20
4 XYZ Feb-2023 21
5 XYZ Mar-2023 22
Related
i'm try create table like in example:
Example_picture
My code:
data = list(range(39)) # mockup for 39 values
columns = pd.MultiIndex.from_product([['1', '2', '6'], [str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame(data, index=['World'], columns=columns)
print(df)
But i get error:
Shape of passed values is (39, 1), indices imply (1, 39)
What i'm did wrong?
You need to wrap the data in a list to force the DataFrame constructor to interpret the list as a row:
data = list(range(39))
columns = pd.MultiIndex.from_product([['1', '2', '6'],
[str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame([data], index=['World'], columns=columns)
output:
Factor 1 2 6
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
World 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
I have data in a google sheet with >200 columns. In below format: I want to adjust row 1 and 2 in columns to create a dataframe.
0 1 2 3 4
0 profit sales sales
1 2019 2019 2020
2 Name Currency Jan Feb Mar
3 Ohashi JPY 1 22 43
4 Ohashi JPY 2 23 44
5 Lee USD 3 24 45
6 Lee USD 4 25 46
original dataset:
I tried:
headers = data[2]
df = pd.DataFrame.from_records(data[2:], columns=headers)
result will ignore first two rows. Looking something to bring first two rows in columns also.
Output Req.
below groupby with year:
Year Month Name Transation Currency Value
2019 Jan Ohashi profit JPY 1
2019 Jan Ohashi profit JPY 2
2019 Jan Lee profit USD 3
2019 Jan Lee profit USD 4
2019 Feb Ohashi sales JPY 22
2019 Feb Ohashi sales JPY 23
2019 Feb Lee sales USD 24
2019 Feb Lee sales USD 25
2020 Mar Ohashi sales JPY 43
2020 Mar Ohashi sales JPY 44
2020 Mar Lee sales USD 45
2020 Mar Lee sales USD 46
First create MultiIndex in df.index and df.columns in read_csv:
df = pd.read_csv(file, index_col=[0,1], header=[0,1,2,3])
#for remove top level of MultiIndex
df = df.droplevel(0, axis=1)
print (df)
profit sales
2019 2019 2020
Jan Feb Mar
Ohashi JPY 1 22 43
JPY 2 23 44
Lee USD 3 24 45
USD 4 25 46
print (df.columns)
MultiIndex([('profit', '2019', 'Jan'),
( 'sales', '2019', 'Feb'),
( 'sales', '2020', 'Mar')],
)
print (df.index)
MultiIndex([('Ohashi', 'JPY'),
('Ohashi', 'JPY'),
( 'Lee', 'USD'),
( 'Lee', 'USD')],
)
EDIT:
df = df.astype(int).groupby(levels=[0,1,2], axis=1).sum()
And then is possible use DataFrame.stack with reorder levels by DataFrame.reorder_levels, set index names:
df = (df.stack([0,1,2])
.reorder_levels([3,4,0,2,1])
.rename_axis(['Year','Month','Name','Transation','Currency'])
.reset_index(name='Value'))
print (df)
EDIT:
print (df)
0 1 2 3 4
0 profit sales sales
1 2019 2019 2020
2 Name Currency Jan Feb Mar
3 Ohashi JPY 1 22 43
4 Ohashi JPY 2 23 44
5 Lee USD 3 24 45
6 Lee USD 4 25 46
Get 3rd row for index names:
names = df.iloc[2]
print (names)
0 Name
1 Currency
2 Jan
3 Feb
4 Mar
Name: 2, dtype: object
Convert first 2 columns to index:
df = df.set_index([0,1])
Convert first 3 rows to MultiIndex columns:
df.columns = pd.MultiIndex.from_frame(df.iloc[:3].T, names=['Transation','Year','Month'])
Remove first 3 rows and set index names by variable names:
df = df.iloc[3:].rename_axis(names.iloc[:df.index.nlevels].tolist(), axis=0)
print (df)
Transation profit sales
Year 2019 2019 2020
Month Jan Feb Mar
Name Currency
Ohashi JPY 1 22 43
JPY 2 23 44
Lee USD 3 24 45
USD 4 25 46
print (df.columns)
MultiIndex([('profit', '2019', 'Jan'),
( 'sales', '2019', 'Feb'),
( 'sales', '2020', 'Mar')],
names=['Transation', 'Year', 'Month'])
print (df.index)
MultiIndex([('Ohashi', 'JPY'),
('Ohashi', 'JPY'),
( 'Lee', 'USD'),
( 'Lee', 'USD')],
names=['Name', 'Currency'])
Reshape, in real data many columns, so removed reorder levels:
df = df.stack([0,1,2]).reset_index(name='Value')
print (df)
Name Currency Transation Year Month Value
0 Ohashi JPY profit 2019 Jan 1
1 Ohashi JPY sales 2019 Feb 22
2 Ohashi JPY sales 2020 Mar 43
3 Ohashi JPY profit 2019 Jan 2
4 Ohashi JPY sales 2019 Feb 23
5 Ohashi JPY sales 2020 Mar 44
6 Lee USD profit 2019 Jan 3
7 Lee USD sales 2019 Feb 24
8 Lee USD sales 2020 Mar 45
9 Lee USD profit 2019 Jan 4
10 Lee USD sales 2019 Feb 25
11 Lee USD sales 2020 Mar 46
I'm quite new to programming, and I'm using Python it for data manipulation and analysis.
I have a dataframe that looks like:
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
And I would like to group by month, year and Brand. If it helps, I also have separate columns for Month and Year. The expected result should look like this:
Brand Date Unit
A Jan 2019 25
B Jan 2019 16
A Feb 2019 18
B Feb 2019 19
A Jan 2020 8
B Feb 2020 5
I tried adapting an answer from someone else's question:
per = df.Date.dt.to_period("M")
g = df.groupby(per,'Brand')
g.sum()
but I get prompted:
ValueError: No axis named Brand for object type <class 'pandas.core.frame.DataFrame'>
and I don't have any idea how to solve this.
I used to do this with dictionaries by selecting each month/year individually, group by sum and then create the dictionary, but it seems kind of brute force, really rough and it won't help if the df gets updated with new data.
Even more, maybe I'm having a bad approach to the situation. In the end I'd like to have a df looking like:
Brand Jan 19 Feb 19 Jan 20
A 25 18 8
B 16 19 5
Use pandas.to_datetime and pandas.DataFrame.pivot_table:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b %Y")
new_df = df.pivot_table(index="Brand", columns="Date", aggfunc=sum)
print(new_df)
Output:
Unit
Date Feb 2019 Jan 2019 Jan 2020
Brand
A 18 25 10
B 19 16 5
You were close, DataFrame.groupby wants a list of groupers, not bare arguments.
Here's how I did it:
import pandas
from io import StringIO
csv = StringIO("""\
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
""")
(
pandas.read_csv(csv, parse_dates=['Date'], sep='\s+', dayfirst=True)
.groupby(['Brand', pandas.Grouper(key='Date', freq='1M')])
.sum()
.reset_index()
)
And that gives me:
Brand Date Unit
0 A 2019-01-31 25
1 A 2019-02-28 18
2 A 2020-01-31 10
3 B 2019-01-31 16
4 B 2019-02-28 19
5 B 2020-01-31 5
I have a pandas data frame with Multindex (id and datetime) and one column named X1.
X1
id datetime
a1ssjdldf 2019 Jul 10 2
2019 Jul 11 22
2019 Jul 12 21
r2dffs 2019 Jul 10 14
2019 Jul 11 13
2019 Jul 12 11
I want to create a new variable X2 where the corresponding value is the difference between the X1 value of the same row and the X1 value of the previous row. But every time it sees a new id the corresponding value has to be restarted from zero.
For example:
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2
Use DataFrameGroupBy.diff by first level and replace missing values by Series.fillna:
df['X2'] = df.groupby(level=0)['X1'].diff().fillna(0, downcast='int')
print (df)
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2
I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017
After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22