How to validate dataframe in pandera using multiple columns - python

I have following dataframe. Need to validate dataframe to check if there exists rows with columns Name and tag both NULL at the same time.
I tried following - but index where it fails are 0 & 2.
import pandas as pd
import pandera as pa
data = [['Alex',10,'t1'],['Bob',12,None],['Clarke',13,'t3'],[None,14,'t3'],[None,15,None]]
df = pd.DataFrame(data,columns=['Name','Age','Tag'])
schema = pa.DataFrameSchema(checks=pa.Check(lambda df: ~(pd.notnull(df["Name"])&pd.notnull(df["Tag"])) )
)
try:
schema.validate(df)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err.failure_cases)
I want above code to return index as 4. How should I create check for pandera schema.

As per the docs on Handling null values,
By default, pandera drops null values before passing the objects to
validate into the check function. For Series objects null elements are
dropped (this also applies to columns), and for DataFrame objects,
rows with any null value are dropped.
If you want to check the properties of a pandas data structure while
preserving null values, specify Check(..., ignore_na=False) when
defining a check.
That way, make sure to add ignore_na=False:
schema = pa.DataFrameSchema(checks=pa.Check(lambda df:
~(df['Name'].isnull() &
df['Tag'].isnull()),
ignore_na=False))

Related

How to convert a dataframe to JSON without the index?

I have the following dataframe output that I would like to convert to json, but it adds a leading zero, which gets added to the json. How do I remove it? Pandas by default numbers each row.
id version ... token type_id
0 10927076529 0 ... 56599bb6-3b56-425b-8688-8fc0c73fbedc 3
{"0":{"id":10927076529,"version":0,"token":"56599bb6-3b56-425b-8688-8fc0c73fbedc","type_id":3}}
df = df.rename(columns={'id': 'version', 'token': 'type_id' })
df2 = df.to_json(orient="index")
print(df2)
Pandas has that 0 value as the row index for your single DataFrame entry. You can't remove it in the actual DataFrame as far as I know.
This is showing up in your JSON specifically because you're using the "index" option for the "orient" parameter.
If you want each row in your final dataframe to be a separate entry, you can try the "records" option instead of "index".
df2 = df.to_json(orient="records")
This hyperlink has a good illustration of the different options.
Another option you have is to set one of your columns as an index that you want to use, such as id/version. This will preserve a title, but without using the default indexing scheme provided by Pandas.
df = df.set_index('version')
df2 = df.to_json(orient="index")

Process single data set with different JSON schema rows using Pyspark

I am using PySpark and I need to process the log files that are appended into a single data frame. Most of the columns are look normal, but one of the columns has JSON string in {}. Basically, each row is an individual event and for JSON string I can apply individual Schema. But I don't know what is the best way to process data here.
Example:
This table later will help to aggregate the events in the way I need.
I tried to use function withColumn and use from_json. It worked successfully for a single column:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int")))
It did for my 1st row what I want when I will query nested_json. But it is applied schema on the whole column, and I would like to process each row depends on the event_name
I was naive and try to do this:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int"))
F.when(F.col("event_name") == "Action1",F.from_json("json_string","Name String, Version Int, UserName String, PosX int, PosY int"))
)
And this is failed to run with when() can only be applied on a Column previously generated by when() function
I assumed, my 1st withColumn applied schema for the whole column.
What other options do I have to apply JSON schema based on event_name value and flattened values?
What if you chain your when statements?
For example,
df.withColumn("nested_json", F.when(F.col("event_name") =="EventStart",F.from_json(...)).when(F.col("event_name") == "Action1", F. from_json(...)))

How can I drop the rows with null values and categorical variables from the dataframe?

I am trying to drop the rows with null values and categorical variables from the dataframe that I imported from Excel. I've tried many other functions and many different ways to do so as well but I am not able to drop them, at least not all.
There are around 185000 rows with 6 columns.
What I was trying to do is using for loop to go through the entire rows and drop the rows if there is a null value or categorical variable on the column "Order ID".
This is one of the codes I've tried:
f = 0
value = merged_file.at[f, 'Order ID']
for value in merged_file:
if value is None:
merged_file.drop(merged_file.index[f])
merged_file.reset_index(inplace=True, drop=True)
f+=1
continue
elif value == 'Order ID':
merged_file.drop(merged_file.index[f])
merged_file.reset_index(drop=True, inplace=True)
f+=1
continue
elif f==186845:
break
else:
f+=1
continue
I would be grateful if correct me what I am doing wrong and please let me know if there is a better way to specify and drop the rows or columns with null values and categorical variables.
Thank you.
So, it seems you're using pandas even if the code does not look really pythonic.
Anyway, I would suggest to not iterate though each row of the dataframe, in pandas rows containing nan can be dropped using dropna:
merged_file.dropna(subset=['Order ID'],inplace=True)
To remove the rows containing categorical variables instead you can use numpy isreal. Apply simply apply the function isreal to all rows, labelling as False all rows which do not contain numerical values.
import numpy as np
merged_file = merged_file[merged_file['Order ID'].apply(lambda x: np.isreal(x))]

Pandas - merge/join/vlookup df and delete all rows that get a match

I am trying to reference a list of expired orders from one spreadsheet(df name = data2), and vlookup them on the new orders spreadsheet (df name = data) to delete all the rows that contain expired orders. Then return a new spreadsheet(df name = results).
I am having trouble trying to mimic what I do in excel vloookup/sort/delete in pandas. Please view psuedo code/steps as code:
Import simple.xls as dataframe called 'data'
Import wo.xlsm, sheet
name "T" as dataframe called 'data2'
Do a vlookup , using Column
"A" in the "data" to be used to as the values to be
matched with any of the same values in Column "A" of "data2" (there both just Order Id's)
For all values that exist inside Column A in 'data2'
and also exist in Column "A" of the 'data',group ( if necessary) and delete the
entire row(there is 26 columns) for each matched Order ID found in Column A of both datasets. To reiterate, deleting the entire row for the matches found in the 'data' file. Save the smaller dataset as results.
import pandas as pd
data = pd.read_excel("ors_simple.xlsx", encoding = "ISO-8859-1",
dtype=object)
data2 = pd.read_excel("wos.xlsm", sheet_name = "T")
results = data.merge(data2,on='Work_Order')
writer = pd.ExcelWriter('vlookuped.xlsx', engine='xlsxwriter')
results.to_excel(writer, sheet_name='Sheet1')
writer.save()
I re-read your question and think I undertand it correctly. You want to find out if any order in new_orders (you call it data) have expired using expired_orders (you call it data2).
If you rephrase your question what you want to do is: 1) find out if a value in a column in a DataFrame is in a column in another DataFrame and then 2) drop the rows where the value exists in both.
Using pd.merge is one way to do this. But since you want to use expired_orders to filter new_orders, pd.merge seems a bit overkill.
Pandas actually has a method for doing this sort of thing and it's called isin() so let's use that! This method allows you to check if a value in one column exists in another column.
df_1['column_name'].isin(df_2['column_name'])
isin() returns a Series of True/False values that you can apply to filter your DataFrame by using it as a mask: df[bool_mask].
So how do you use this in your situation?
is_expired = new_orders['order_column'].isin(expired_orders['order_column'])
results = new_orders[~is_expired].copy() # Use copy to avoid SettingWithCopyError.
~is equal to not - so ~is_expired means that the order wasn't expired.

How to search pandas data frame by index value and value in any column

I am trying to select data, read in from a file, represented by the values one and zero. I want to be able to select rows from a list of values and at the same time select for any column in which each of the selected rows has a value of one. To make it more complex I also want to select rows from a list of values where all values in a column for these rows is zero. Is this possible? Ultimately if another method besides pandas data frame would work better I would be willing to try that.
To be clear, any column may be selected and I do not know which ones ahead of time.
Thanks!
You can use all() any() iloc[] operators. Check the official documentation, or this thread for more details
import pandas as pd
import random
import numpy as np
# Created a dump data as you didn't provide one
df = pd.DataFrame({'col1': [random.getrandbits(1) for i in range(10)], 'col2': [random.getrandbits(1) for i in range(10)], 'col3': [1]*10})
print(df)
# You can select the value directly by using iloc[] operator
# df.iloc to select by postion .loc to Selection by Label
row_indexer,column_indexer=3,1
print(df.iloc[row_indexer,column_indexer])
# You can filter the data of a specific column this way
print(df[df['col1']==1])
print(df[df['col2']==1])
# Want to be able to select rows from a list of values and at the same time select for any column in which each of the selected rows has a value of one.
print(df[(df.T == 1).any()])
# If you wanna filter a specific columns with a condition on rows
print(df[(df['col1']==1)|(df['col2']==1)])
# To make it more complex I also want to select rows from a list of values where all values in a column for these rows is zero.
print(df[(df.T == 0).all()])
# If you wanna filter a specific columns with a condition on rows
print(df[(df['col1']==0) & (df['col2']==0)])

Categories