Subtract dataframes with completely different row names and column names - python

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.

Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values

You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.

Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']

I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Merge two data-frames with only one column different. Need to append that column in the new dataframe. Please check below for detailed view

Little new to Python, I am trying to merge two data-frame with columns similar. 2nd data-frame consists of 1 column different need to append that in new data-frame.
Detailed view of dataframes
Code Used :
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id')
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id', how='outer')
Getting Output csv as
Unnamed: 0 Id_x Number_x Class_x Section_x Place_x Name_x Executed_Date_x Version_x Value PartDateTime_x Cycles_x Id_y Mumber_y Class_y Section_y Place_y Name_y Executed_Date_y Version_y Value_data PartDateTime_y Cycles_y
whereas i dont want _x & _y i wanted the output to be :
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
If i use df2=pd.concat([df,df1],axis=0,ignore_index=True)
then i will get values in the below mentioned format in all columns except Value_data; whereas Value_data would be empty column.
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
Please help me with a solution for this. Thanks for your time.
I think easiest path is to make a temporary df, let's call it df_temp2 , which is a copy of df_2, with renamed column, then append it to df_1
df_temp2 = df_2.copy()
df_temp2.columns = ['..','..', .... 'value' ...]
then
df_total = df_1.append(df_temp2)
This provides you a total DataFrame with all the rows of DF_1 and DF_2. 'append()' method supports a few arguments, check the docs for more details.
--- Added --------
One other possible approach is to use pd.concat() function, which can work in the same way ad .append() method, like this
result = pd.concat([df_1, df_temp2])
In your case the two approaches would lead to similar performances. You can consider append() as a method written on top of pd.concat() but it is applied to a DF itself.
Full docs about concat() here: pd.Concat() docs
Hope this was helpful.
import pandas as pd
df =pd.read_csv('C:/Users/output_2.csv')
df1 pd.read_csv('C:/Users/output_1.csv')
df1_temp=df1[['Id','Cycles','Value_data']].copy()
df3=pd.merge(df,df1_temp,on = ['Id','Cycles'], how='inner')
df3=df3.drop(columns="Unnamed: 0")
df3.to_csv('C:/Users/output.csv')
This worked

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

Summing columns in dataframe in python

I am trying to add 3 columns' values to come up with a new column as total value. Code is below:
df3[["Bronze","Gold","Silver"]] =
df3[["Bronze","Gold","Silver"]].astype("int")
df3["Total Medal"]= df3.iloc[:, -3:0].sum(axis=1)
df3[["Total Medal"]].astype("int")
I know that Bronze, Gold, Silver columns have 1 and 0 values and they are the last 3 columns in the dataframe. Their original types were "uint8" so I changed them to "int".
Total Medal column after these lines come out as type "float" (instead of int) and yield only the value 0. How can I properly add these columns?
To add the value of 3 columns to a new column simply do
df['Total Medal'] = df.sum(axis=1)
This can e.g. be done using assign:
import numpy as np
import pandas as pd
#create data frame
data = {"gold":np.random.choice([0,1],size=10),"silver":np.random.choice([0,1],size=10), "bronze":np.random.choice([0,1],size=10)}
df = pd.DataFrame(data)
#calculate new column and add to dataframe
df = df.assign(mysum=df.gold+df.silver+df.bronze)
Edit: df["mysum"] = df.sum(axis=1) only works if your dataframe only has the three relevant columns, because it sums over all columns (and not only over the three you want).

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Categories