I have two data frames that I merged together on a common ID. I am trying to uncover when values in each row for a matching ID are different.
I merged the files so that I have the below table. I think I might be able to approach this with a series of if statements but the actual data file has hundreds of column attributes which doesn't seem efficient at all. I'm trying to determine if there's an easy way to do this.
x Loan_ID Trade_Quantity_x Principal_x Interest_x Late_Fee_x Trade_Quantity_y Principal_y Interest_y Late_Fee_y
0 1 10 30 0 0 10 30 0 0
1 2 10 0 0 5 10 0 0 0
2 3 10 0 50 0 10 0 0 0
3 4 10 0 0 0 10 0 0 0
4 5 10 100 10 0 10 100 10 0
5 6 9 0 0 0 9 0 0 0
6 7 10 0 0 0 10 0 0 0
Expected output should be:
2. Late_Fee_y
3. Interest_y
I am assuming that what you are after is to compare two data frames of the same structure, i.e. having the same list of columns and the same number of rows identified by values of special Loan_ID.
The goal is to list all "cells" which are different between the two frames, cell location is by the id from Loan_ID and column name.
Can I suggest to merge the two frames differently first, to get a list of values and then find differences by scanning melted frames, or by applying a filter?
Example data (think of id as Loan_ID)
x = {'id':[1,2],'A':[0,1],'B':[2,3]}
y = {'id':[1,2],'A':[0,2],'B':[2,4]}
df_x = pd.DataFrame(x)
df_y = pd.DataFrame(y)
print(df_x)
print(df_y)
melted
df_xm = pd.melt(df_x, id_vars=['id'])
df_xm['source']='x'
df_ym = pd.melt(df_y, id_vars=['id'])
df_ym['source']='y'
print(df_xm)
print(df_ym)
Assuming that both frames are sorted by id correspondingly
for i in df_xm.index:
if df_xm['value'][i] != df_ym['value'][i]:
print(f"{df_xm['id'][i]},{df_xm['variable'][i]}")
Second method :
merged = df_xm.merge(df_ym, left_on= ['id','variable'], right_on=['id','variable'])
print(merged)
filter_diff = merged['value_x'] != merged['value_y']
print('differences:')
print(merged[ filter_diff ])
I'm sure this can be improved for efficiency but this is my general idea how to tackle the "difference between two table snapshots" with general frame/tables and filter operations.
Related
I want to create matrix for each observation in my dataset.
Each row should correspond to disease group (i.e. xx, yy, kk). Example data
id xx_z xx_y xx_a yy_b yy_c kk_t kk_r kk_m kk_y
1 1 1 0 0 1 0 0 1 1
2 0 0 1 0 0 1 1 0 1
Given that there are 3 types of diseases and there are maximum of 4 diseases in the dataset. The matrix should be by 3 X 4, and the output should look like:
id matrix
xx_z xx_y xx_a null
1 xx [ 1 1 0 0
yy_b yy_c null null
yy 0 1 0 0
kk_t kk_r kk_k kk_y
kk 0 0 1 1]
2 [ 0 0 1 0
0 0 0 0
1 1 0 1]
Please note that I do not know the exact number disease per disease group. How could I do it in python pandas?
P.S. I just need a nested matrix structure for each observation, later I will compare the matrices of different observations, e.g. Jaccard similarity of matrices for observation id == 1 and observation id == 2
Ok, how about something like this:
# make a copy just in case
d = df[:]
# get the groups, in case you don't have them already
groups = list({col.split('_')[0] for col in d.columns})
# define grouping condition (here, groups would be 'xx', 'yy', 'kk')
gb = d.groupby(d.columns.map(lambda x: x.split('_')[0]), axis=1)
# aggregate values of one group to list and save as extra columns
for g in groups:
d[g] = gb.get_group(g).values.tolist()
# now aggregate to list of lists
d['matrix'] = d[groups].values.tolist()
# convert list of lists to a matrix
d['matrix'] = d['matrix'].apply(lambda x: pd.DataFrame.from_records(x).fillna(0).astype(int).values)
# for the desired output
d[['matrix']]
Not the most elegant, but I'm hoping it does the job :)
I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Starting with a training data set for a variable var1 as:
var1
A
B
C
D
I want to create a model (let's call it dummy_model1) that would then transform the training data set to:
var1_A var1_B var1_C var1_D
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functionality (or similar) exists in, among others, the dummies package in R and get_dummies in Pandas, or even case statements in SQL.
I'd like to then be able to apply dummy_model1 to a new data set:
var1
C
7
#
A
and get the following output:
var1_A var1_B var1_C var1_D
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
I know I can do this in SQL with 'case' statements but would love to automate the process given I have ~2,000 variables. Also, the new data sets will almost always have "bad" data (e.g., 7 and # in the above example).
Somewhat language agnostic (as long as its open source) but would prefer Python or R. Please note the data is over 500GB so that limits some of my options. Thanks in advance.
Assuming var1 fits in memory on its own, here is a possible solution:
First, read in var1.
Next, use get_dummies to get all the "training" categories encoded as dummy variables. Store the column names as a list or an array.
Then, read in the first few rows of your training dataset to get the column names and store them as a list (or if you know these already you can skip this step).
Create a new list or array containing the dummy variable column names and the relevant other columns (this could just be every column in the dataset except var1). This will be the final columns encoding.
Then, read in your test data. Use get_dummies to encode var1 in your test data, knowing it may be missing categories or have extraneous categories. Then reindex the data to match the final columns encoding.
After reindexing, you will end up a test dataset with var1 dummies consistent with your training var1.
To illustrate:
import pandas as pd
import numpy as np
training = pd.DataFrame({
'var1': ['a','b','c'],
'other_var':[4,7,3],
'yet_another':[8,0,2]
})
print training
other_var var1 yet_another
0 4 a 8
1 7 b 0
2 3 c 2
test = pd.DataFrame({
'var1': ['a','b','q'],
'other_var':[9,4,2],
'yet_another':[9,1,5]
})
print test
other_var var1 yet_another
0 9 a 9
1 4 b 1
2 2 q 5
var1_dummied = pd.get_dummies(training.var1, prefix='var1')
var_dummy_columns = var1_dummied.columns.values
print var_dummy_columns
array(['var1_a', 'var1_b', 'var1_c'], dtype=object)
final_encoding_columns = np.append(training.drop(['var1'], axis = 1).columns, var_dummy_columns)
print final_encoding_columns
array(['other_var', 'yet_another', 'var1_a', 'var1_b', 'var1_c'], dtype=object)
test_encoded = pd.get_dummies(test, columns=['var1'])
print test_encoded
other_var yet_another var1_a var1_b var1_q
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 1
test_encoded_reindexed = test_encoded.reindex(columns = final_encoding_columns, fill_value=0)
print test_encoded_reindexed
other_var yet_another var1_a var1_b var1_c
0 9 9 1 0 0
1 4 1 0 1 0
2 2 5 0 0 0
This should be what you want, based on the expected output in your question and the comments.
If the test data easily fits in memory, you can easily extend this to multiple variables. Just save and then update final_encoding_columns iteratively for each training variable you want to encode. Then pass all of those columns to the columns= argument when reindexing the test data. Reindex with your complete final_encoding_columns and you should be all set.
just a try:
# first set the variable to factor with levels specified
df$var1 <- factor(df$var1, levels = LETTERS[1:4])
model.matrix(data = df, ~var1-1)
# var1A var1B var1C var1D
#1 0 0 1 0
#4 1 0 0 0
# or even
sapply(LETTERS[1:4], function(x) as.numeric(x==df$var1))
# A B C D
#[1,] 0 0 1 0
#[2,] 0 0 0 0
#[3,] 0 0 0 0
#[4,] 1 0 0 0
Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select
I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0