Natively compare two Pandas Dataframes

Natively compare two Pandas Dataframes - python

I want to compare two very similar DataFrames, one is loaded from json file and resamples, the second one is loaded from CSV file from some more complicated use-case.
Those are the first values of df1:
page
logging_time
2021-07-04 18:14:47.000 748.0
2021-07-04 18:14:47.100 0.0
2021-07-04 18:14:47.200 0.0
2021-07-04 18:14:47.300 3.0
2021-07-04 18:14:47.400 4.0
[5 rows x 1 columns]
And those are the second values of df2 :
#timestamp per 100 milliseconds Sum of page
0 2021-04-07 18:14:47.000 748.0
1 2021-04-07 18:14:47.100 0.0
2 2021-04-07 18:14:47.200 0.0
3 2021-04-07 18:14:47.300 3.0
4 2021-04-07 18:14:47.400 4.0
[5 rows x 2 columns]
I'm comparing them with pandas.testing.assert_frame_equal, trying to do some customizations for the data in order to be equal, would like some help with that.
The first column should be removed and the labels names should be ignored.
I want to do that in the most pandas-native way, and not compare only the values.
Any help would be appreciated

This is a lot of code but is an almost comprehensive compare of two data frames given a join key and column(s) to ignore. Its current weakness is that it does not compare/analyze the values that may not exists in each of the data sets.
Also please note that this script will write out .csv files of the rows that are different with the join key specified and only the column values from the two data sets. (comment out that portion if you don't want to write out those files)
Here is a link in git if you like the way Jupyter notebook looks more. https://github.com/marckeelingiv/MyPyNotebooks/blob/master/Test-Prod%20Compare.ipynb
# Imports
import pandas as pd
# Set Target Data Sets
test_csv_location = 'test.csv'
prod_csv_location = 'prod.csv'
# Set what columns to join on and what colmns to remove
join_columns = ['ORIGINAL_IID','CLAIM_IID','CLAIM_LINE','EDIT_MNEMONIC']
columns_to_remove = ['Original Clean']
# Peek at the data to get a list of the column names
test_df = pd.read_csv(test_csv_location,nrows=10)
prod_df = pd.read_csv(prod_csv_location,nrows=10)
# Create a dictinary to set all colmns to strings
all_columns = set()
for c in test_df.columns.values:
all_columns.add(c)
for c in prod_df.columns.values:
all_columns.add(c)
dtypes = {}
for c in all_columns:
dtypes[f'{c}']=str
# Perform full import setting data types and specifiying index
test_df = pd.read_csv(test_csv_location,dtype=dtypes,index_col=join_columns)
prod_df = pd.read_csv(prod_csv_location,dtype=dtypes,index_col=join_columns)
# Drop desired columns
for c in columns_to_remove:
try:
del test_df[f'{c}']
except:
pass
try:
del prod_df[f'{c}']
except:
pass
# Join Data Frames to prepare for comparing
compare_df = test_df.join(
prod_df,
how='outer',
lsuffix='_test',rsuffix='_prod'
).fillna('')
# Create list of columns to compare
columns_to_compare = []
for c in all_columns:
if c not in columns_to_remove and c not in join_columns:
columns_to_compare.append(c)
# Show the difference in columns for each data set
list_of_different_columns = []
for column in columns_to_compare:
are_different = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
differences = sum(are_different)
test_not_nulls = ~(compare_df[f'{column}_test']=='')
prod_not_nulls = ~(compare_df[f'{column}_prod']=='')
temp_df = compare_df[are_different & test_not_nulls & prod_not_nulls]
if len(temp_df)>0:
print(f'{differences} differences in {column}')
print(f'\t{(test_not_nulls).sum()} Nulls in Test')
print(f'\t{(prod_not_nulls).sum()} Nulls in Prod')
to_file = temp_df[[f'{column}_test',f'{column}_prod']].copy()
to_file.to_csv(path_or_buf=f'{column}_Test.csv')
list_of_different_columns.append(column)
del to_file
del temp_df,prod_not_nulls,test_not_nulls,differences,are_different
# Functions to show/analyze differences
def return_detla_df(column):
mask = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
mask2 = ~(compare_df[f'{column}_test']=='')
mask3 = ~(compare_df[f'{column}_prod']=='')
df = compare_df[mask & mask2 & mask3][[f'{column}_test',f'{column}_prod']].copy()
try:
df['Delta'] = df[f'{column}_prod'].astype(float)-df[f'{column}_test'].astype(float)
df.sort_values(by='Delta',ascending=False,inplace=True)
except:
pass
return df
def show_count_of_differnces(column):
df = return_detla_df(column)
return pd.DataFrame(
df.groupby(by=[f'{column}_test',f'{column}_prod']).size(),
columns=['Count']
).sort_values('Count',ascending=False).copy()
# ### Code to run to see differences
# Copy and resulting code into individual jupyter notebook cells to dig into the differences
for c in list_of_different_columns:
print(f"## {c}")
print(f"return_detla_df('{c}')")
print(f"show_count_of_differnces('{c}')")

You can use the equals function to compare the dataframes. The catch is that column names must match:
data = [
["2021-07-04 18:14:47.000", 748.0],
["2021-07-04 18:14:47.100", 0.0],
["2021-07-04 18:14:47.200", 0.0],
["2021-07-04 18:14:47.300", 3.0],
["2021-07-04 18:14:47.400", 4.0],
]
df1 = pd.DataFrame(data, columns = ["logging_time", "page"])
df1.set_index("logging_time", inplace=True)
df2 = pd.DataFrame(data1, columns = ["logging_time", "page"])
df2.columns = df2.columns
print(df1.reset_index().equals(df2))
Output:
True

from pandas.testing import assert_frame_equal
Dataframes used by me:
df1=pd.DataFrame({'page': {'2021-07-04 18:14:47.000': 748.0,
'2021-07-04 18:14:47.100': 0.0,
'2021-07-04 18:14:47.200': 0.0,
'2021-07-04 18:14:47.300': 3.0,
'2021-07-04 18:14:47.400': 4.0}})
df1.index.names=['logging_time']
df2=pd.DataFrame({'#timestamp per 100 milliseconds': {0: '2021-07-04 18:14:47.000',
1: '2021-07-04 18:14:47.100',
2: '2021-07-04 18:14:47.200',
3: '2021-07-04 18:14:47.300',
4: '2021-07-04 18:14:47.400'},
'Sum of page': {0: 748.0, 1: 0.0, 2: 0.0, 3: 3.0, 4: 4.0}})
Solution:
df1=df1.reset_index()
#reseting the index of df1
df2.columns=df1.columns
#renaming the columns of df2 so that they become same as df1
print((df1.dtypes==df2.dtypes).all())
#If the above code return True it means they are same
#If It return False then check the output of:print(df1.dtypes==df2.dtypes)
#and change the dtypes of any one df(either df1 or df2) accordingly
#Finally:
print(assert_frame_equal(df1,df2))
#The above code prints None then It means they are equal
#otherwise it will throw AssertionError

Thanks for your answer
But df2.columns=df1.columns
Failes with this error: ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements
Printing those columns gives:
print(df2.columns)
print(df1.columns)
Index(['index', '#timestamp per 100 milliseconds', 'Sum of page'], dtype='object')
Index(['page'], dtype='object')
And no possible change in the columns worked, how can i compare them?
Thanks very much for the help!

Related

Optimizing an Excel to Pandas import and transformation from wide to long data

I need to import and transform xlsx files. They are written in a wide format and I need to reproduce some of the cell information from each row and pair it up with information from all the other rows:
[Edit: changed format to represent the more complex requirements]
Source format
ID
Property
Activity1name
Activity1timestamp
Activity2name
Activity2timestamp
1
A
a
1.1.22 00:00
b
2.1.22 10:05
2
B
a
1.1.22 03:00
b
5.1.22 20:16
Target format
ID
Property
Activity
Timestamp
1
A
a
1.1.22 00:00
1
A
b
2.1.22 10:05
2
B
a
1.1.22 03:00
2
B
b
5.1.22 20:16
The following code works fine to transform the data, but the process is really, really slow:
def transform(data_in):
data = pd.DataFrame(columns=columns)
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
data_in = data_in.to_dict("records") # Convert to dict for speed optimization
for row_dict in tqdm(data_in): # Iterate over each row of the original file
new_row = {}
# Set common columns for each process step
for column in column_matching:
new_row[column] = row_dict[column_matching[column]]
for step in range(0, steps_per_row):
rep = str(step+1) if step > 0 else ""
# Iterate for as many times as there are process steps in one row of the original file and
# set specific columns for each process step, keeping common column values identical for current row
for column in process_matching:
new_row[column] = row_dict[process_matching[column]+rep]
data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
data.index.name = "SortKey"
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
Obviously, iterating over each row and then even each column is not at all how to use pandas the right way, but I don't see how this kind of transformation can be vectorized.
I have tried using parallelization (modin) and played around with using dict or not, but it didn't work / help. The rest of the script literally just opens and saves the files, so the problem lies here.
I would be very grateful for any ideas on how to improve the speed!

The df.melt function should be able to do this type of operation much faster.
df = pd.DataFrame({'ID' : [1, 2],
'Property' : ['A', 'B'],
'Info1' : ['x', 'a'],
'Info2' : ['y', 'b'],
'Info3' : ['z', 'c'],
})
data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])
** Edit to address modified question **
Combine the df.melt with df.pivot operation.
# create data
df = pd.DataFrame({'ID' : [1, 2, 3],
'Property' : ['A', 'B', 'C'],
'Activity1name' : ['a', 'a', 'a'],
'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
'Activity2name' : ['b', 'b', 'b'],
'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
})
# melt dataframe
df_melted = df.melt(id_vars=['ID','Property'],
value_vars=['Activity1name', 'Activity1timestamp',
'Activity2name', 'Activity2timestamp',],
)
# merge categories, i.e. Activity1name Activity2name become Activity
df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'
# add category ids (dataframe may need to be sorted before this operation)
u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)
# pivot table
table = df_melted.pivot_table(index=['unique_id','ID','Property',],
columns='variable', values='value',
aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)

Using pd.melt, as suggested by #Pantelis, I was able to speed up this transformation so extremely much, it's unbelievable. Before, a file with ~13k rows took 4-5 hours on a brand-new ThinkPad X1 - now it takes less than 2 minutes! That's a speed up by factor 150, just wow. :)
Here's my new code, for inspiration / reference if anyone has a similar data structure:
def transform(data_in):
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
# Specify columns for pd.melt, transforming wide data format to long format
id_columns = column_matching.values()
var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
for _ in range(2, steps_per_row+1):
try:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
except IndexError:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
# Improve column names
data.index.name = "SortKey"
column_names = {v: k for k, v in column_matching.items()}
data.rename(mapper=column_names, axis="columns", inplace=True)
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data

Python: concat data frames then save them to one csv

I have multiple data frames. I want to get some rows from each data frame based on a certain condition and add them into one data frame, then save them to one csv file.
I tried multiple methods, append with data frames is deprecated.
Here is the simple code. I want to retrieve the above and below values for all the rows larger than 2.
result= pd.concat() returns the required rows with the headers. That means with every iteration from the for loop, it prints the required rows. However, when I save them to csv, only the last three saved. How do I save/append the rows before adding them to the csv? What am I missing here?
df_sorted = pd.DataFrame({"ID": [1,2,3,4,5,6],
"User": ['a','b','c','d','e','f']})
Max = pd.DataFrame()
above = pd.DataFrame()
below = pd.DataFrame()
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max = df_sorted.iloc[[i]] # first df
if i < len(df_sorted) - 1:
above = df_sorted.iloc[[i+1]] # second df
if i > 0:
below = df_sorted.iloc[[i-1]] #third df
frames = [above, Max, below]
result = pd.concat(frames)
result.to_csv('new_df.csv')
The desired result should be,
ID User
2 b
3 c
4 d
3 c
4 d
5 e
4 d
5 e
6 f
5 e
6 f
what I get from result is,
ID User
5 e
6 f
6 f

Here it is:
columns = [ 'id', 'user']
Max = pd.DataFrame(columns=columns)
above = pd.DataFrame(columns=columns)
below = pd.DataFrame(columns=columns)
for i in range(len(df_sorted)):
if df_sorted.ID[i] > 2:
Max.loc[i,'id' ]=df_sorted.iloc[i, 0]
Max.loc[i,'user' ]=df_sorted.iloc[i, 1]
if i < len(df_sorted) - 1:
above.loc[i,'id' ]=df_sorted.iloc[i+1, 0]
above.loc[i,'user' ]=df_sorted.iloc[i+1, 1]
elif i > 0:
below.loc[i,'id' ]=df_sorted.iloc[i-1, 0]
below.loc[i,'user' ]=df_sorted.iloc[i-1, 1]
result = pd.concat([above, Max, below], axis = 0)
result

It seems that you did not define the Max, above and below.
Now, Max and above and below are only one value and every time, they are updated.
You should define Max=pd.dataframe(columns) or array and same thing for above and below. With this, you can save the data in these dataframes and with concat, you don't lose the data.

How to calculate the mean and standard deviation of multiple dataframes at one go?

I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?

You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')

One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]

Pandas rename df rows from list

I have looked at many similar questions, yet I still cannot get pandas to rename the rows of a df from a list of values from another df. What am I doing wrong?
def calculate_liabilities(stakes_df):
if not stakes_df.empty:
liabilities_df = pd.DataFrame( decimal_odds_lay.values * stakes_df.values ) #makes df with stakes rows, decimal odds columns
stakes_list = stakes_df.to_dict()
print(stakes_list)
liabilities_df = liabilities_df.rename(stakes_list)
return liabilities_df
else:
print ("Failure to calculate liabilities")
stakes_list = stakes_df.to_dict() gives the following dict:
{'Stakes': {0: 3.7400000000000002, 1: 5.5999999999999996, 2: 7.0700000000000003}}
I want the rows of liabilities_df to be renamed 3.7400000000000002, 5.5999999999999996 and 7.0700000000000003 respectively.

if you want to rename liabilities_df's row name(index) to stakes_df's value, you need to give dict not dict of dict.
liabilities_df = liabilities_df.rename(stakes_list['Stakes'])
example:
df= pd.DataFrame([1,2,3])
0
0 1
1 2
2 3
df.rename({0: 3.7400000000000002, 1: 5.5999999999999996, 2: 7.0700000000000003})
0
3.74 1
5.60 2
7.07 3

You can rename the rows with a data.frame, here you have a dictionary, that's why.
would be better if you gave us the data, but here you don't have to make a dictionary from stakes_list

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.

You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)

Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Natively compare two Pandas Dataframes - python

Related

Optimizing an Excel to Pandas import and transformation from wide to long data

Python: concat data frames then save them to one csv

How to calculate the mean and standard deviation of multiple dataframes at one go?

Pandas rename df rows from list

Given a pandas dataframe, is there an easy way to print out a command to generate it?

Categories

Resources