Python - rearranging dataframe data - python

I wanna rearrange my dataframe from the left one to the right table, like I show you in the next picture:
df = pd.DataFrame({
"Unnamed:0": ["Entity","","Var1","Var2","Var3","Var4"],
"Unnamed:1": ["A","X","0.45","0.14","0.16","0.28"],
"Unnamed:2": ["A","Y","0.66","0.55","0.39","0.49"],
"Unnamed:3": ["A","Z","0.3","0.24","0.31","0.13"],
"Unnamed:4": ["B","X","0.22","0.08","0.74","0.41"],
"Unnamed:5": ["B","Y","0.94","0.47","0.17","0.16"],
"Unnamed:6": ["B","Z","0.76","0.4","0.93","0.15"],
"Unnamed:7": ["C","X","0.4","0.76","0.71","0.01"],
"Unnamed:8": ["C","Y","0.86","1","0.26","0.32"],
"Unnamed:9": ["C","Z","0.35","0.1","0.36","0.4"],
})
I try using pd.melt, but I can't get what I want. Ty in advance

As your dataframe is not clean (the 2 first rows are a multiindex column name), you can first create the inner dataframe before melting it :
new_df = pd.DataFrame(df.iloc[2:,1:]).set_index(df.iloc[2:,0])
new_df.columns = pd.MultiIndex.from_frame(df.iloc[:2,1:].T)
new_df.melt(ignore_index=False).reset_index()

Related

Convert column to dataframe column geaders

I have the following pandas dataframe
I would like it to be converted to a pandas dataframe with one row. Is there a simple way to do it. I tried pivot but was getting weird results.
You can pivot, swap the level of columns names, shift values up to fill NaN values and flatten column names:
out = df.pivot(columns='Study Identification').swaplevel(0,1,axis=1).apply(lambda x: pd.Series(x.dropna().values)).fillna('')
out.columns = s.columns.map(''.join)
So in your case reshape the df with unstack
s = df.set_index('A',append=True).unstack(level=1).swaplevel(0,1,axis=1)
s.columns = s.columns.map(''.join)

Subtract dataframes with completely different row names and column names

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.
Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values
You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.
Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']
I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Pandas Dataframes - Combine two Dataframes but leave out entry with same column

I'm trying to create a DataFrame out of two existing ones. I read the title of some articles in the web, first column is title and the ones after are timestamps
i want to concat both data frames but leave out the ones with the same title (column one)
I tried
df = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
but because the other columns may not be the exact same all the time, I need to leave out every data pack that has the same first column. how would I do this?
btw sorry for not knowing all the right terms for my problem
You should first remove the duplicate rows from df2 and then concat it with df1:
df = pd.concat([df1, df2[~df2.title.isin(df1.title)]]).reset_index(drop=True)
This probably solves your problem:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df2=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah1','blah2','blah3','blah4','blah']
df2.columns=['blah5','blah6','blah7','blah8','blah']
for i in range(len(df.columns)):
for j in range(len(df2.columns)):
if df.columns[i] == df2.columns[j]:
df2 = df2.drop(df2.columns[j], axis = 1)
else:
continue
print(pd.concat([df, df2], axis =1))

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.
If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

How to merge 2 columns in 1 within same dataframe (Python, Pandas)?

I'm following tutorial of Wes McKinney on using pandas/python for trading backtesting (http://youtu.be/6h0IVlp_1l8).
After pd.read_csv(...) he's using 'dt' (datetime) column as index of dataframe.
df.index = pd.to_datetime(df.pop('dt'))
However, my data has 2 separate columns, 'Date[G]' and 'Time[G]' and the data inside is something like 04-JAN-2013,00:00:00.000 (comma-separated).
How do i modify that line of code in order to do the same? I.e. merge two columns within one data frame, and then delete it. Or is there a way to do that during read_csv itself?
Thanks for all answers.
You should be able to concat two columns using apply() and then use to_datetime().
To remove columns from dataframe use drop() or just select columns you need:
df['dt'] = pd.to_datetime(df.apply(lambda x: x['Date[G]'] + ' ' + x['Time[G]'], 1))
df = df.drop(['Date[G]', 'Time[G]'], 1)
# ..or
# df = df[['dt', ...]]
df.set_index('dt', inplace = True)

Categories