I have two dataframes with the exact same index titles and the exact same column titles but different values within those tables. Also the number of rows and columns is exactly the same. Let's call them df1, and df2.
df1 = {'A':['a1','a2','a3','a4'],'B':['b1','b2','b3','b4'],'C':['c1','c2','c3','c4']}
df2 = {'A':['d1','d2','d3','d4'],'B':['e1','e2','e3','e4'],'C':['f1','f2','f3','f4']}
I want to perform several operations on these matrices i.e.
Multiplication - create the following matrix:
df2 = {'A':['a1*d1','a2*d2','a3*d3','a4*d4'],'B':['b1*e1','b2*e2','b3*e3','b4*e4'],'C':['c1*f1','c2*f2','c3*f3','c4*f4']}
as well as Addition, Substraction, Division using the exact same logic.
Please note that the question is more about the generic code which can be replicated since the matrix which I am using has hundreds of rows and columns.
This is pretty trivial to achieve using the pandas library. The data type of the columns is unclear from OPs question, but if they are numeric, then then the code below will run.
Try:
import pandas as pd
pd.DataFrame(df1) * pd.DataFrame(df2)
If you don't want to import panda just for this operation you can do this with the following code:
df1_2 = {key: [x*y for x,y in zip(df1[key],df2[key])] for key in df1.keys()}
NOTE : This works only if the values are numerics. If not use concatenation for strings like x'*'y, replace + with your desired operation.
Related
I want to merge two dataframes to create a single time-series with two variables.
I have a function that does this by iterating over each dataframe using itterows()... which is terribly slow and doesn't take advantage of the vectorization that pandas and numpy provide...
Would you be able to help?
This code illustrates what I am trying to do:
a = pd.DataFrame(data={'timestamp':[1,2,5,6,10],'x':[2,6,3,4,2]})
b = pd.DataFrame(data={'timestamp':[2,3,4,10],'y':[3,1,2,1]})
#z = Magical line of code/function call here
#z output: {'timestamp':[1,2,3,4,5,6,10],'x':[2,6,6,6,3,4,2], 'y':[NaN,3,1,2,2,2,1] }
This can be broken down into 2 steps:
The first step is the equivalent of an outer join in SQL, where create a table containing keys of both source tables. This is done with merge(..., how="outer")
The second is filling the NaN with the previous non-NaN values, which can done with ffill
z = a.merge(b, on="timestamp", how="outer").sort_values("timestamp").ffill()
I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)
I checked the answer here but this doesn't work for me.
How to get the integer portion of a float column in pandas
As I need to write further conditional statements which will perform operations on the exact values in the columns and the corresponding values in other columns.
So basically I am hoping that for my two dataframes df1 and df2 I will form a concatenated dataframe using
dfn_c = pd.concat([dfn_1, dfn_2], axis=1)
then write something like
dfn_cn = dfn_c.loc[df1.X1.isin(df2['X2'])]
where X1 and X2 are the said columns respectively. The above line of course makes an exact comparison whereas I want to compare only the integer portion and then form the new dataframe.
IIUC, try casting to int then compare.
dfn_cn = dfn_c.loc[df1['X1'].astype(int).isin(df2['X2'].astype(int))]
is there any better pandas function then pandas.testing.assert_frame_equal because i am facing two issues while comparing
if the data set is like this
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'m':['abc','pqr','ss','sd','sds'], 'n':['sdd','gbc','mqr','sas','ho']})
I will give this error
if the dataset is like this
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'a':['abc','pqr','sds','ss','sd'], 'b':['sdd','gbc','ho','mqr','sas']})
then I am getting this error
due to the order of rows which is wrong
Pandas.testing.assert_frame_equal is a very robust package that checks a lot of things, if you just want to check that the data they contain are equal (without regards to colnames, index or dtype etc.) it might be easier just to write a simple function to do it.
You will have to sort your values, then convert to a numpy array to get rid of the indices and column headers. Finally we can compare the arrays using np.array_equal().
import pandas as pd
import numpy as np
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'x':['abc','pqr','sds','ss','sd'], 'b':['sdd','gbc','ho','mqr','sas']})
print(df1.equals(df2))
# False
def assert_equal_df(df1, df2):
df1 = df1.sort_values(df1.columns[0]).to_numpy()
df2 = df2.sort_values(df2.columns[0]).to_numpy()
return np.array_equal(df1, df2)
print(assert_equal_df(df1, df2))
# True
I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.