Melting Transformation - python

I work with financial files that are organized with dates as the columns.
Example Table.
However, I need to transform a table like this to have column names like this: Name, Date, Apples, Oranges. How would you do this using Python, Power Query, or Excel?
Type
Name
Jan-21
Feb-21
Mar-21
Apples
John
$1.20
$1.05
$1.65
Oranges
John
$1.42
$1.15
$1.77
Apples
Jim
$1.60
$1.15
$1.85
Oranges
Jim
$1.62
$1.45
$1.37
I'm wanting the table to look like this:
Name
Dates
Apples
Oranges
John
Jan-21
$1.20
$1.42
John
Feb-21
$1.05
$1.15
Jim
Jan-21
$1.60
$1.62
Jim
Feb-21
$1.15
$1.45

Solved my own problem using Power Query (within Power BI). I just unpivoted all columns other than Type and Name. Then pivot the Type column using the new "Value" column. Also, this is apparently called melting which I wasn't familiar with.

Related

Add column to DataFrame and assign number to each row

I have the following table
Father
Son
Year
James
Harry
1999
James
Alfi
2001
Corey
Kyle
2003
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father
Son
Year
Child
James
Harry
1999
1
James
Alfi
2001
2
Corey
Kyle
2003
1
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount()+1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount()+1
df
Here is an idea of solving this using groupby and cumsum functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.

Python remove text if same of another column

I want to drop in my dataframe the text in a column if it starts with the same text that is in another column.
Example of dataframe:
name var1
John Smith John Smith Hello world
Mary Jane Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Dataframe that I want:
name var1
John Smith Hello world
Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Something simple as:
df[~df.var1.str.contains(df.var1)]
does not work. How I should write my python code?
Try using apply lambda;
df["var1"] = df.apply(lambda x: x["var1"][len(x["name"]):].strip() if x["name"] == x["var1"][:len(x["name"])] else x["var1"],axis=1)
How about this?
df['var1'] = [df.loc[i, 'var1'].replace(df.loc[i, 'name'], "") for i in df.index]

How to delete Pandas rows that have been seen before

If I have a table as follows in Pandas Dataframe:
Date Name
15/12/01 John Doe
15/12/01 John Doe
15/12/01 John Doe
15/12/02 Mary Jean
15/12/02 Mary Jean
15/12/02 Mary Jean
I would like to delete all instances of John Doe/Mary Jean (or whatever name may be there) with the same date and only keep the latest one. After the operation it would look like this:
Date Name
15/12/01 John Doe
15/12/02 Mary Jean
Where the third instance of both John Doe and Mary Jean have been kept and the rest have been deleted. How could I do this in an efficient and fast way in Pandas?
Thanks!

Drop duplicate rows from a pandas DataFrame whose timestamps are within a specified range or duration

I have a DataFrame like this:
Subject Verb Object Date
---------------------------------
Bill Ate Food 7/11/2015
Steve Painted House 8/12/2011
Bill Ate Food 7/13/2015
Steve Painted House 8/25/2011
I would like to drop all duplicates, where a duplicate is defined as having the same Subject, Verb, Object, and falls within an X day range (in my example: 5 days).
Subject Verb Object Date
---------------------------------
Bill Ate Food 7/11/2015
Steve Painted House 8/12/2011
Steve Painted House 8/25/2011
Neither instance of "Steve - Painted - House" is removed because they are outside of a 5 day window.
I know I can do this using some data structures and the iterrows method of the DataFrame, but is there a way to do this using Pandas drop_duplicates?
Use duplicated + diff in conjunction with groupby to figure out what rows you want to remove.
c = ['Subject', 'Verb', 'Object']
def f(x):
return x[c].duplicated() & x.Date.diff().dt.days.lt(5)
df = df.sort_values(c)
df[~df.groupby(c).apply(f).values]
Subject Verb Object Date
0 Bill Ate Food 2015-07-11
1 Steve Painted House 2011-08-12
3 Steve Painted House 2011-08-25

how to aggregate by month of datetime in pandas frame?

I have the table below in a Pandas dataframe:
name birth
jack 1989-11-17
joe 1988-09-10
ben 1980-10-20
kate 1985-05-15
nichos 1986-07-05
john 1989-11-12
tom 1980-10-25
jason 1985-05-21
eron 1985-07-10
yun 1989-11-05
kung 1986-07-01
i want to do some aggregation by the month of birth,the results should be like this :
moth cnt
1989-11 3
1988-09 1
1986-07 2
1985-07 1
1985-05 2
1980-10 2
Is there any convenience way of doing this?
Many thanks
Make your data into a TimeSeries object and then call resample:
s.resample("M", how="count")

Categories