Best way to compare two huge dataframes in python - python

I have a use case where I need to compare a file from s3 bucket with the output of a sql query.
Below is how I am reading the s3 file.
if s3_data is None:
s3_data = pd.read_feather(BytesIO(obj.get()['Body'].read()))
else:
s3_data = s3_data.append(pd.read_feather(BytesIO(obj.get()['Body'].read())), ignore_index=True)
Below is how I am reading from the database.
db_data=pandas.read_sql
Now I need to compare s3_data with db_data. Both of these dataframes are huge with as much as 2 million rows of data each.
The format of the data is somewhat like below.
name | Age | Gender
--------------------
Peter| 30 | Male
Tom | 24 | Male
Riya | 28 | Female
Now I need to validate whether exact same rows with same column data exist in both s3 file and db.
I tried using dataframe.compare() but the kind of results it gives is not what I am looking for. The position of the row in the dataframe is not relevant for me. So if 'Tom' appears in row 1 or 3 in one of the dataframes while on a different position in other, it should still pass the equality validation as long as the record itself with same column values is present in both. dataframe.compare() is not helping me in this case.
The alternate approach which I took is to use csv_diff. I merged all the columns into one in the following way and saved it as a csv in my local creating two csv files- one for s3 data and one for db data.
data
------------
Peter+30+Male
Tom+24+Male
Riya+28+Female
Then, using below code, I am comparing the files.
s3_file = load_csv(open("s3.csv"),key="data")
db_file = load_csv(open("db.csv"),key="data")
diff = compare(s3_file,db_file)
This works but is not very performant as I have to first write huge csv files with size as big as 500mb to local and then read and compare them this way.
Is there a better way to handle the comparison part without the need to write files to local and also at the same time ensuring that I am able to compare entire records(with each column value compared for a given row) for equality irrespective of the position of the row in the dataframe?

Related

which function could be further calculation with INDEX+MATCH(excel) in python

I'm very new to trying use python with my work,
here's an excel worksheet I got:
A as product name in July,
B as product Qty in July,
C as product name in Aug,
D as product Qty in Aug
I needed to get the result of difference between them:
find exactly sold Qty in next month
calculated the subtract
|A | B|C | D|
|SRHAVWRQT | 1|SRHAVWRQT | 4|
|SCMARAED3MQT| 0|SCMARAED3MQT| 2|
|WVVMUMCRQT | 7|WVVMUMCBR | 7|
...
...
I know how to solved this in excel like what I did,
with INDEX + MATCH and the difference:
=G3-INDEX(B:B,MATCH(F3,A:A,0))
than I would having the result as I need
The original data
The result perform
but how am I would perform it in python?
and which tool would be use?
(e.g. pandas? numpy?)
with other answer I'd read, but it seems just performed only INDEX/MATCH function
and/or they are trying to solve the calculation between Multiple Sheet
but I just need the result of 2 columns.
How to perform an Excel INDEX MATCH equivalent in Python
Index match with python
Calculate Match percentage of values from One sheet to another Using Python
https://focaalvarez.medium.com/vlookup-and-index-match-equivalences-in-pandas-160ac2910399
Or there's just will be a completely different way of processing in python
A classic use case there. For anything involving Excel and Python, you'll want to familiarise yourself with the Pandas library; it can handle a lot of what you're asking for.
Now, to how to solve this problem in particular. I'm going to assume that the data in the relevant worksheet is as you showed it above; No column headings, with the data from row 1 down in columns A, B, C and D. You could use the below code to load this into Python; This loading would load it without column or row names, and as such the dataframe loaded in python would start at [0,0] rather than "A1", since rows and columns in Pandas start at 0.
import pandas as pd
excelData = pd.read_excel("<DOCUMENT NAME>", sheet_name="<SHEET NAME>", header=None)
After you have loaded the data, you then need to match the month 1 data to its month 2 indices. This is a little complicated, and the way I recommend doing it involves defining your own python function using the "def" keyword. A version of this I quickly whipped up is below:
#Extract columns "A & B" and "C & D" to separate month 1 and 2 dataframes respectively.
month1_data: pd.DataFrame = excelData.iloc[:, 0:2]
month2_data: pd.DataFrame = excelData.iloc[:, 2:4]
#Define a matching function to match a single row (series) with its corresponding row in a passed dataframe
def matchDataToIndex(dataLine: pd.Series, comparison: pd.DataFrame):
matchIndex = list(comparison.iloc[0].tolist()).index(dataLine.tolist()[0])
return dataLine.append(pd.Series([matchIndex]))
#Apply the defined matching function to each row of the month 1 data
month1_data_with_match = month1_data.apply(matchDataToIndex, axis=1, args=(month2_data,))
There is a lot of stuff there that you are probably not familiar with if you are only just getting started with Python, hence why I recommend getting acquainted with Pandas. That being said, after that is run, the variable month1_data_with_match will be a three column table, containing the following columns:
Your Column A product name data.
Your Column B product quantity data.
An index expressing which row in month2_data contains the matching Column C and D data.
With those three pieces of information together, you should then be able to calculate your other statistics.

Combining several csv files and calculating averages of a column

I have a folder with several csv files. Each file has several columns but the ones I'm interested in are id, sequence (3 different sequences) and time. See an example of a csv file below. I have over 40 csv files-each file belongs to a different participant.
id
sequence
time
300
1
2
2
3
1
3
etc
I need to calculate the average times for 3 different sequences for each participant. The code I currently have combines the csv files in one dataframe, selects the columns I'm interested in (id, sequence and time) and calculates the average for each person for 3 conditions, pivots the table to wide format (it's the format I need) and exports this as a csv file. I am not sure if this is the best way to do it but it works. However, for some conditions 'time' is 0. I want to exclude these sequences from the averages. How do I do this? Many thanks in advance.
filenames = sorted(glob.glob('times*.csv'))
df = pd.concat((pd.read_csv(filename) for filename in filenames))
df_new = df[["id","sequence","time"]]
df_new_ave = df_new.groupby(['id','sequence'] ['time'].mean().reset_index(name='Avg_Score')
df_new_ave_wide = df_new_ave.pivot(index='id', columns='sequence', values='Avg_Score')
df_new_ave_wide.to_csv('final_ave.csv', encoding='utf-8', index=True)

Is it possible to modify output data file names in pySpark?

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz
You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Pandas apply Series- Order of the columns

To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.
Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).
This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')

What is the Best way to compare large datasets from two different sources in Python?

I have large datasets from 2 sources, one is a huge csv file and the other coming from a database query. I am writing a validation script to compare the data from both sources and log/print the differences. One thing I think is worth mentioning is that the data from the two sources is not in the exact same format or the order. For example:
Source 1 (CSV files):
email1#gmail.com,key1,1
email2#gmail.com,key1,3
email1#gmail.com,key2,1
email1#gmail.com,key3,5
email2#gmail.com,key3,2
email2#gmail.com,key3,2
email3#gmail.com,key2,3
email3#gmail.com,key3,1
Source 2 (Database):
email key1 key2 key3
email1#gmail.com 1 1 5
email2#gmail.com 3 2 <null>
email4#gmail.com 1 1 5
The output of the script I want is something like:
source1 - source2 (or csv - db): 2 rows total with differences
email2#gmail.com 3 2 2
email3#gmail.com <null> 3 1
source2 - source1 (or db-csv): 2 rows total with differences
email2#gmail.com 3 2 <null>
email4#gmail.com 1 1 5
The output format could be a little different to show more differences, more clearly (from thousands/millions of records).
I started writing the script to save the data from both sources into two dictionaries, and loop through the dictionaries or create sets from the dictionaries, but it seems like a very inefficient process. I considered using pandas, but pandas doesn't seem to have a way to do this type of comparison of dataframes.
Please tell me if theres a better/more efficient way. Thanks in advance!
You were in the right path. What do you want is to quickly match the 2 tables. Pandas is probably overkill.
You probably want to iterate through you first table and create a dictionary. What you don't want to do, is interact the two lists for each elements. Even little lists will demand a large searches.
The ReadCsv module is a good one to read your data from disk. For each row, you will put it in a dictionary where the key is the email and the value is the complete row. In a common desktop computer you can iterate 10 millions rows in a second.
Now you will iterate throw the second row and for each row you'll use the email to get the data from the dictionary. See that this way, since the dict is a data structure that you can get the key value in O(1), you'll interact through N + M rows. In a couple of seconds you should be able to compare both tables. It is really simple. Here is a sample code:
import csv
firstTable = {}
with open('firstTable.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
firstTable[row[0]] = row #email is in row[0]
for row2 in get_db_table2():
email = row2[0]
row1 = firstTable[email] #this is a hash. The access is very quick
my_complex_comparison_func(row1, row2)
If you don't have enough RAM memory to fit all the keys of the first dictionary in memory, you can use the Shelve module for the firstTable variable. That'll create a index in disk with very quick access.
Since one of your tables is already in a database, maybe what I'd do first is to use your database to load the data in disk to a temporary table. Create an index, and make a inner join on the tables (or outer join if need to know which rows don't have data in the other table). Databases are optimized for this kind of operation. You can then make a select from python to get the joined rows and use python for your complex comparison logic.
You can using pivot convert the df , the using drop_duplicates after concat
df2=df2.applymap(lambda x : pd.to_numeric(x,errors='ignore')
pd.concat([df.pivot(*df.columns).reset_index(),df2)],keys=['db','csv']).\
drop_duplicates(keep=False).\
reset_index(level=0).\
rename(columns={'level_0':'source'})
Out[261]:
key source email key1 key2 key3
1 db email2#gmail.com 3 2 2
1 csv email2#gmail.com 3 2 <null>
Notice , here I am using the to_numeric to convert to numeric for your df2

Categories