comparing a dataframe column to the values found in another dataframe - python

I have a dataframe 'df' where I would like to compare the list of column names to the values found under the dataframe titled 'set_cols' I have code that i previously used to compare a dictionary key to the df column name but i cant figure out how to make it work to compare a dataframe column heading values to another dataframe values under the heading
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
errors= {}
errors[filename]={}
df_cols = df[list(df.columns)]
mask = df_cols.apply(lambda d: d.isin(set_cols[d.name]))
df_cols.mask(mask|df_cols.eq(' ')).stack()
for err_i, (r, v) in enumerate(df_cols.mask(mask|df_cols.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"column": r[1],
"message": r + " is invalid column heading'}
in the errors dictionary I would expect an output something along the line of this:
{'column': 'person name', 'message': 'person name is an invalid column heading'}
How do i compare the heading column values of one data frame to a data frame with the values under the column?

If I get you right the goal is to find the columns of 'df' that are absent in 'set_cols'.
In this case 2 sets substraction can be utilized.
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
error_cols = set(df.columns) - set(set_cols.Numbers)
errors= {}
errors[filename]={}
for _col in error_cols:
errors[filename]["column"] = _col
errors[filename]["message"] = f"{_col} is an invalid column heading"

It looks like a very complicated way to do:
diff = df.columns.difference(set_cols['Numbers'])
errors= {}
errors[filename] = dict(enumerate({'column': c,
'message': f'{c} is an invalid column heading'}
for c in diff))
print(errors)
output:
{'template': {0: {'column': 'persons name', 'message': 'persons name is an invalid column heading'}}}

Related

unable to write tuple into xslx file using python without pandas?

I am trying to write the output into xslx file, but able to only write the headers not able to write the data below headers.
import xlsxwriter
csv_columns = (
'id', 'name', 'place', 'salary', 'email',
)
details = [{'id':1, 'name': 'A', 'place':'B', 'salary': 2, 'email': 'c#d.com'},
{'id':3, 'name':'C', 'place':'D', 'salary': 4, 'email':'e#f.com'}]
workbook = xlsxwriter.Workbook(path)
worksheet = workbook.add_worksheet()
for col, name in enumerate(csv_columns):
worksheet.write(0, col, name)
for row, det in enumerate(details, 1):
for col, value in enumerate(det):
worksheet.write(row, col, value)
workbook.close()
This code is only writing the csv_columns in xslx file and repeating same in all rows as below
id name place salary email
id name place salary email
id name place salary email
How to solve this issue of repeating columns in xslx? any help ?
I expected like below:
id name place salary email
1 A B 2 c#d.com
3 C D 4 e#f.com
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
csv_columns = (
'id', 'name', 'place', 'salary', 'email',
)
details = [{'id':1, 'name': 'A', 'place':'B', 'salary': 2, 'email': 'c#d.com'},
{'id':3, 'name':'C', 'place':'D', 'salary': 4, 'email':'e#f.com'}]
details_values = [tuple(d.values()) for d in details]
details_values.insert(0, csv_columns)
for row in details_values:
print(row)
ws.append(row)
wb.save(output_file_path)
I corrected your code. Now it works as you would expect:
import xlsxwriter
csv_columns = (
'id', 'name', 'place', 'salary', 'email',
)
values = [(1, 'A', 'B', 2, 'c#d.com'),
(3, 'C', 'D', 4, 'e#f.com')]
workbook = xlsxwriter.Workbook(path)
worksheet = workbook.add_worksheet()
row, col = 0, 0
worksheet.write_row(row, col, csv_columns)
row += 1
for value in values:
worksheet.write_row(row, col, value)
row += 1
workbook.close()
It would probably be best to map your dictionaries into a list of lists and then process it that way, but here is one way of doing it based on your sample code:
import xlsxwriter
csv_columns = ('id', 'name', 'place', 'salary', 'email')
details = [{'id': 1, 'name': 'A', 'place': 'B', 'salary': 2, 'email': 'c#d.com'},
{'id': 3, 'name': 'C', 'place': 'D', 'salary': 4, 'email': 'e#f.com'}]
workbook = xlsxwriter.Workbook("test.xlsx")
worksheet = workbook.add_worksheet()
worksheet.write_row(0, 0, csv_columns)
for row, det in enumerate(details, 1):
for col, key in enumerate(csv_columns):
worksheet.write(row, col, det.get(key, ''))
workbook.close()
Output:

group by with python and streamlit

i have the below dataframe i want to filter the dataframe and return result based on the user selection from a multiselectbox , and grouped by name
the selectbox is the unique value of name field
import streamlit as st
import pandas as pd
data = {
'ID': [1, 2, 3, 4],
'name': ['peter', 'john', 'james', 'james'],
'nickname': ['pet', 'jon','james', 'jem'],
'mother_name': ['maria', 'linda', 'ana', 'beth'],
'bd': ['2000-05-15', '2006-09-12', '2004-10-25',]
}
with st.sidebar.form(key='search_form',clear_on_submit= False):
choices =df["name"].unique().tolist()
regular_search_term = st.multiselect(" ",choices)
if st.form_submit_button("search"):
df_result_search=df[df["name"].isin(regular_search_term)]
df_group = df_result_search.groupby('name')
st.write(df_group)
if i select james it return the 2 records while i need to return
1 record that includes the 2 data related to james
how can i return this result.
There is a missing value for the key bd in your data dictionnary.
You can use this :
import streamlit as st
import pandas as pd
data = {
'ID': [1, 2, 3, 4],
'name': ['peter', 'john', 'james', 'james'],
'nickname': ['pet', 'jon', 'james', 'jem'],
'mother_name': ['maria', 'linda', 'ana', 'beth'],
'bd': ['2000-05-15', '2006-09-12', '2004-10-25', '2004-10-26']
}
df = pd.DataFrame(data)
with st.sidebar.form(key='search_form', clear_on_submit=False):
choices = df["name"].unique().tolist()
regular_search_term = st.multiselect(" ", choices)
if st.form_submit_button("search"):
st.text('Filter on name')
st.write(df[df["name"].isin(regular_search_term)])
st.text('Filter on nickname')
st.write(df[df["nickname"].isin(regular_search_term)])
df_gr = df[['ID', 'nickname', 'mother_name', 'bd']
].astype(str).groupby(df['name']).agg('|'.join).reset_index()
st.text('Filter on name with grouped columns')
st.write(df_gr[df_gr["name"].isin(regular_search_term)])
>>> Output (in browser):
I let you choose whatever type of filter/display you want between the three.

Pandas dataframe, get the row number for a column meeting certain conditions

This is similar to a question I asked before but I needed to make some changes to my condition statement.
I have the below output that I make a dataframe from. I then check each row if Status rows are empty and if comment is not empty. The next thing I want to get is the row number of the Status col that meets those conditions:
output = [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement', 'Status'], ['ACCOUNT', 'SO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'date and time when table was loaded', 'VICE_DEV'], ['ACCOUNT', 'SO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV'], ['ACCOUNT', 'SO', '_LOAD_FILE_ROW_NUMBER', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV']]
df = pd.DataFrame(output)
df.columns = df.iloc[0]
df = df[1:]
query_list = []
for index, row in df.iterrows():
if row['Status'] is None and row['comment'] is not None and row['comment'] != '':
empty_status = df[df['Status'].isnull()].index.tolist()
I've tried with empty_status var above but I see:
empty_status_idx = [1, 2, 3]
when I would like it to be is below, because only the first row meets those conditions:
empty_status = [1]
because only the first row has a comment and has status empty or null

Best way to convert a defaultdict(list) dictionary with list of dictionaries to a csv

My default dict has an address key and has a list of dictionaries that match that key. I'd like to export this defaultdict to a csv file.
See below:
Right now my structure looks like this defaultdict(list)
#As you can see 1 key with multiple matching dictionaries.
#And im just copying 1 address but I have ~10 w/ varying matches
defaultdic1 =
defaultdict(list,
{'Address_1': [{'Name': 'name',
'Address_match': 'address_match_1',
'ID': 'id',
'Type': 'abc'},
{'Name': 'name',
'Address_match': 'address_match_2',
'ID': 'id',
'Type': 'abc'},
{'Name': 'name',
'Address_match': 'address_match_3',
'ID': 'id',
'Type': 'abc'}]})
I tried doing this:
json_data = json.dumps(data_json, indent=2)
jsondf = pd.read_json(json_data, typ = 'series')
and my result was this:
Address 1 [{'Name':'name', 'Address_match':'address_match_1' 'ID' : 'id', 'Type':'abc'} {'Name':'name', 'Address_match':'address_match_2' 'ID' : 'id', 'Type':'abc'}, {'Name':'name', 'Address_match':'address_match_3' 'ID' : 'id', 'Type':'abc'}]
Result/output:
Id like to export this to an excel file
Update I tried this. The first row is printing the key but 2nd row is still in a {}, it would be great to get them out of the brackets and shifted into columns. Any tips there?
for k, v in defaultdict.items():
f.writerow([k])
for values in v:
f.writerow([values])
results in CSV are:
Address 1
{'Name':'name', 'Address_match':'address_match_1' 'ID' : 'id', 'Type':'abc'}
{'Name':'name', 'Address_match':'address_match_1' 'ID' : 'id', 'Type':'abc'}
{'Name':'name', 'Address_match':'address_match_2' 'ID' : 'id', 'Type':'abc'}
I'd like my results to be:
Address 1 Name, Address_match1, ID, Type
Name, Address_match2, ID, Type
Name, Address_match3, ID, Type
Address 2 Name1, Address_match1, ID, Type
Name1, Address_match1, ID, Type
Address 3 Name1, Address_match1, ID, Type
Name1, Address_match1, ID, Type
Your input data and output data do not match, so it's awfully difficult to tell how to transform things, but here is something that takes your defaultdict and converts it to a CSV file:
import csv
dic1 = {'Address_2':
[
{'Address 1':
[
{'Name':'name', 'Address_match':'address_match_1', 'ID':'id', 'Type':'abc'}
]
},
{'Address 2':
[
{'Name':'name', 'Address_match':'address_match_2', 'ID':'id', 'Type':'abc'}
]
},
{'Address 3':
[
{'Name':'name', 'Address_match':'address_match_3', 'ID':'id', 'Type':'abc'}
]
}
]
}
names = list(dic1['Address_2'][0]['Address 1'][0].keys())
myfile = csv.DictWriter( open('xxx.csv','w'), fieldnames = names )
for row in dic1['Address_2']:
myfile.writerow({'Name':list(row.keys())[0]})
myfile.writerow(list(row.values())[0][0])
This is what ended up solving it!
names = list(dic1['Address_1'][0].keys())
f.close()
with open ("file.csv", "w", newline="") as f:
writer = csv.writer(f)
keys = names
writer.writerow(["Address"] +(keys))
for k, vl in defaultdict.items():
for v in vl:
writer.writerow([k] + [v[key] for key in keys])
f.close()

Reading whitespace in heading of csv file using pandas

I need to read the heading from csv that have white between them, I need help to fix it. I try differnet way like delimiter = ' ' and delim_whitespace = True. Here is how I'm write the code:
df = pd.read_csv(
d,
dtype = 'str',
usecols=[
'Owner First Name',
'Owner Last Name',
'StreetNumber',
'StreetName',
'State',
'Zip Code',
'Bdrms',
'Legal Description',
'Sq Ftg',
'Address',
'Orig Ln Amt',
'Prop Value'
],
names=[
'Owner_FirstName',
'Owner_LastName',
'StreetNumber',
'StreetName',
'State',
'ZipCode',
'Bdrms',
'Legal_Description',
'Sq_Ftg',
'Address',
'Orig_Ln_Amt',
'Prop_Value'
],
skipinitialspace=True
)
Along with your existing options you can use engine as python and provide the separator like whitespaces (\s) and tabs with (\t) in the sep arg.
pd.read_csv(engine='python', sep='\s+|,|\t')

Categories