I have a number of tests where the Pandas dataframe output needs to be compared with a static baseline file. My preferred option for the baseline file format is the csv format for its readability and easy maintenance within Git. But if I were to load the csv file into a dataframe, and use
A.equals(B)
where A is the output dataframe and B is the dataframe loaded from the CSV file, inevitably there will be errors as the csv file does not record datatypes and what-nots. So my rather contrived solution is to write the dataframe A into a CSV file and load it back out the same way as B then ask whether they are equal.
Does anyone have a better solution that they have been using for some time without any issues?
If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as following:
import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )
This will ensure that each column of the csv is read as a particular data type
After that you can just compare the dataframes easily by using
A.equals(B)
EDIT:
If you need to compare a lot of pairs, another way to do it would be to compare the hash values of the dataframes instead of comparing each row and column of individual data frames
hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())
Now compare these two hash values which are just integers to check if the original data frames were same or not.
Be Careful though: I am not sure if the data types of the original data frame would matter or not. Be sure to check that.
I came across a solution that does work for my case by making use of Pandas testing utilities.
from pandas.util.testing import assert_frame_equal
Then call it from within a try except block where check_dtype is set to False.
try:
assert_frame_equal(A, B, check_dtype=False)
print("The dataframes are the same.")
except:
print("Please verify data integrity.")
(A != B).any(1) returns a Series with Boolean values which tells you which rows are equal and which ones aren't ...
Boolean values are internally represented by 1's and 0's, so you can do a sum() to check how many rows were not equal.
sum((A != B).any(1))
If you get an output of 0, that would mean all rows were equal.
Related
I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.
So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck
import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.
I have a python code which performs mathematical calculations on multiple columns of the dataframe. This input comes from various sources so there is a possibility that sometimes one column is missing from the same.
This column is missing because its insignificant but i need to have a null column atleast for the code to run without errors.
I can add a null column using if loop but there are around 120 columns and i do not want to slow down the code. Is there any other way where the code can check each column is present in the original dataframe and then if any column is not present it adds a null column and then starts with execution of the actual code?
If you know that the column name is the same for every dataframe you could do something like this without having to loop over the column names
if col_name not in df.columns:
df[col_name] = '' # or whatever value you want to set it to
If speed is a super concern, which I can't tell, you could always convert the the columns to a set with set(df.columns) and reduce the search to O(1) time because it will be a hashed search. You can read more in detail on the efficiency of the in operator at this link How efficient is Python's 'in' or 'not in' operators?