Reading whitespace in heading of csv file using pandas - python

I need to read the heading from csv that have white between them, I need help to fix it. I try differnet way like delimiter = ' ' and delim_whitespace = True. Here is how I'm write the code:
df = pd.read_csv(
d,
dtype = 'str',
usecols=[
'Owner First Name',
'Owner Last Name',
'StreetNumber',
'StreetName',
'State',
'Zip Code',
'Bdrms',
'Legal Description',
'Sq Ftg',
'Address',
'Orig Ln Amt',
'Prop Value'
],
names=[
'Owner_FirstName',
'Owner_LastName',
'StreetNumber',
'StreetName',
'State',
'ZipCode',
'Bdrms',
'Legal_Description',
'Sq_Ftg',
'Address',
'Orig_Ln_Amt',
'Prop_Value'
],
skipinitialspace=True
)

Along with your existing options you can use engine as python and provide the separator like whitespaces (\s) and tabs with (\t) in the sep arg.
pd.read_csv(engine='python', sep='\s+|,|\t')

Related

comparing a dataframe column to the values found in another dataframe

I have a dataframe 'df' where I would like to compare the list of column names to the values found under the dataframe titled 'set_cols' I have code that i previously used to compare a dictionary key to the df column name but i cant figure out how to make it work to compare a dataframe column heading values to another dataframe values under the heading
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
errors= {}
errors[filename]={}
df_cols = df[list(df.columns)]
mask = df_cols.apply(lambda d: d.isin(set_cols[d.name]))
df_cols.mask(mask|df_cols.eq(' ')).stack()
for err_i, (r, v) in enumerate(df_cols.mask(mask|df_cols.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"column": r[1],
"message": r + " is invalid column heading'}
in the errors dictionary I would expect an output something along the line of this:
{'column': 'person name', 'message': 'person name is an invalid column heading'}
How do i compare the heading column values of one data frame to a data frame with the values under the column?
If I get you right the goal is to find the columns of 'df' that are absent in 'set_cols'.
In this case 2 sets substraction can be utilized.
import pandas as pd
filename='template'
df= pd.DataFrame(columns=['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c','persons name'])
data= ['firstName', 'lastName', 'state', 'Communication_Language__c',
'country', 'company', 'email', 'industry', 'System_Type__c',
'AccountType', 'customerSegment', 'Existing_Customer__c',
'GDPR_Email_Permission__c']
set_cols=pd.DataFrame(data, columns=['Numbers'])
error_cols = set(df.columns) - set(set_cols.Numbers)
errors= {}
errors[filename]={}
for _col in error_cols:
errors[filename]["column"] = _col
errors[filename]["message"] = f"{_col} is an invalid column heading"
It looks like a very complicated way to do:
diff = df.columns.difference(set_cols['Numbers'])
errors= {}
errors[filename] = dict(enumerate({'column': c,
'message': f'{c} is an invalid column heading'}
for c in diff))
print(errors)
output:
{'template': {0: {'column': 'persons name', 'message': 'persons name is an invalid column heading'}}}

Pandas dataframe, get the row number for a column meeting certain conditions

This is similar to a question I asked before but I needed to make some changes to my condition statement.
I have the below output that I make a dataframe from. I then check each row if Status rows are empty and if comment is not empty. The next thing I want to get is the row number of the Status col that meets those conditions:
output = [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement', 'Status'], ['ACCOUNT', 'SO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'date and time when table was loaded', 'VICE_DEV'], ['ACCOUNT', 'SO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV'], ['ACCOUNT', 'SO', '_LOAD_FILE_ROW_NUMBER', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV']]
df = pd.DataFrame(output)
df.columns = df.iloc[0]
df = df[1:]
query_list = []
for index, row in df.iterrows():
if row['Status'] is None and row['comment'] is not None and row['comment'] != '':
empty_status = df[df['Status'].isnull()].index.tolist()
I've tried with empty_status var above but I see:
empty_status_idx = [1, 2, 3]
when I would like it to be is below, because only the first row meets those conditions:
empty_status = [1]
because only the first row has a comment and has status empty or null

Match string value to dataframe value and add to string

I have a string of column names and their datatype called cols below:
_LOAD_DATETIME datetime,
_LOAD_FILENAME string,
_LOAD_FILE_ROW_NUMBER int,
_LOAD_FILE_TIMESTAMP datetime,
ID int
Next I make a df from a gsheet I'm reading from the below:
import pandas as pd
output = [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement', 'DateTime Comment Added'], ['ACCOUNT', 'SO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', '', 'V'], ['ACCOUNT', 'SO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['B_ACCOUNT', 'SO', '_LOAD_FILE_ROW_NUMBER', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'V'], ['ACCOUNT', 'SO', '_LOAD_FILE_TIMESTAMP', '{"type":"TIMESTAMP_NTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'TEST', 'VE', '', '2022-02-16'], ['ACCOUNT', 'SO', 'ID', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":false,"fixed":false}', 'NOT_NULL', '', 'COLUMN', '', 'ID of Account', 'V', '', '2022-02-16'],]
df = pd.DataFrame(output)
df.columns = df.iloc[0]
df = df[1:]
last_2_days = '2022-02-15'
query_list = []
for index, row in df.iterrows():
if row['comment'] is not None and row['comment'] != '' and (row['DateTime Comment Added'] >= last_2_days):
comment_data = row['column_name'], row['comment']
query_list.append(comment_data)
when I print out query_list it looks like this, which is getting the correct data since I only want to get the column_name and comment when the DateTime Comment Added column is within the last 2 days of today:
[('_LOAD_FILE_TIMESTAMP', 'TEST'), ('ID', 'ID of Account')]
What I want to do next (and I'm having trouble figuring out how) is from my cols string earlier I want to add the comment from the query_list to the correct column name in cols AND add the word COMMENT before the actual comment
so cols next should look like this:
_LOAD_DATETIME datetime,
_LOAD_FILENAME string,
_LOAD_FILE_ROW_NUMBER int,
_LOAD_FILE_TIMESTAMP datetime COMMENT 'TEST',
ID int COMMENT 'ID of Account'

best way to write dictionary data into csv or excel

I am trying to write dictionary data into csv file.
Keys:
['file_name', 'candidate_skills', 'SF_name', 'RB_name', 'mb_number', 'email']
Dictionary
{'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
I was thinking each dictionary will be written in on row with each value in new column
'file_name' 'candidate_skills' 'SF_name' 'RB_name' 'mb_number' 'email'
I am getting results like this, into single column only:
file_name,Aarti Banarashi.docx
candidate_skills,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']"
SF_name,
RB_name,aarti banarashi
mb_number,['+918108493333']
email,aartisingh271294#gmail.com
Can you please help me to write it in correct manner? Also when I add new records, it should get appended
My code:
with open('dict.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in res.items():
writer.writerow([key, value])
Expected output
enter image description here
As soon as you work with tables I recommend pandas.
Here is the pandas solution:
d = {'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
import pandas as pd
df = pd.DataFrame.from_dict(d, orient='index').T
df.to_csv("output.csv",index=False)
Output:
file_name,candidate_skills,SF_name,RB_name,mb_number,email
Aarti Banarashi.docx,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']",,aarti banarashi ,['+918108493333'],aartisingh271294#gmail.com
Your script was iterating over each key value pair in your dictionary and then calling writerow() for each pair. writerow() will give you a single new row, so calling it multiple time in this way will give you one row per pair.
res only contains data for a single row in your CSV file. Using a csv.DictWriter(), a single call to writerow() will convert all the dictionary entries into a single output row:
import csv
res = {'file_name': 'Aarti Banarashi.docx', 'candidate_skills': ['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS'], 'SF_name': None, 'RB_name': 'aarti banarashi\t\t\t', 'mb_number': ['+918108493333'], 'email': 'aartisingh271294#gmail.com'}
fieldnames = ['file_name', 'candidate_skills', 'SF_name', 'RB_name', 'mb_number', 'email']
with open('dict.csv', 'wb') as f_file:
csv_writer = csv.DictWriter(f_file, fieldnames=fieldnames)
csv_writer.writeheader()
csv_writer.writerow(res)
Giving you an output dict.csv file as:
file_name,candidate_skills,SF_name,RB_name,mb_number,email
Aarti Banarashi.docx,"['JQuery', ' ', 'Bootstrap', 'codeigniter', '\n', 'Javascript', 'Analysis', 'Ajax', 'HTML', 'Html5', 'SQL', 'MySQL', 'PHP', 'CSS']",,aarti banarashi ,['+918108493333'],aartisingh271294#gmail.com
By explicitly passing fieldnames is forces the ordering of the columns in the output to what you provide. If the ordering is not important, and you can instead use fieldnames=res.keys()

How Do I Use a CSV File To Send Keys?

I am a beginner and I've tried searching online everywhere, but I'm not sure I'm searching the right terms.
My CSV file looks this:
https://drive.google.com/file/d/0B74bmJNIxxW-dWl0Y0dsV1E4bjA/view?usp=sharing
I want to know how to use the CSV file to do something like this,
Email
driver.find_element_by_name('emailAddress').send_keys("johndoe#example.com")
print "Successfully Entered Email..."
There are lots of ways that you could do this. One would be to use the csv module.
with open("foo.csv", "r") as fh:
lines = csv.reader(fh)
for line in lines:
address = line[0]
driver.find_element_by_name('emailAddress').send_keys(address)
It really helps to post the data here so that we see what the format really is and run code ourselves. So, I invented some sample data
emails.csv
Email,Password,First Name,Last Name,City
foo1#example.com,frobinate,John,Doe,District Heights
foo2#example.com,frobinate,John,Doe,District Heights
foo3#example.com,frobinate,John,Doe,District Heights
foo4#example.com,frobinate,John,Doe,District Heights
I can use the csv module to read that. csv.DictReader reads each row into its own dict that lets me reference cells by the name given in the header. Since I'll be looking up records by email name later, I'll read it into another dict that will act as an index into the records. If the same user is in there multiple times, only the last one will be remembered.
With the index in place, I can grab the row by email name.
>>> import csv
>>> with open('emails.csv', newline='') as fp:
... reader = csv.DictReader(fp) # auto-reads header
... for row in reader:
... email_index[row['Email']] = row
...
>>> for item in email_index.items():
... print(item)
...
('foo3#example.com', {'Email': 'foo3#example.com', 'City': 'District Heights', 'First Name': 'John', 'Password': 'frobinate', 'Last Name': 'Doe'})
('foo2#example.com', {'Email': 'foo2#example.com', 'City': 'District Heights', 'First Name': 'John', 'Password': 'frobinate', 'Last Name': 'Doe'})
('foo4#example.com', {'Email': 'foo4#example.com', 'City': 'District Heights', 'First Name': 'John', 'Password': 'frobinate', 'Last Name': 'Doe'})
('foo1#example.com', {'Email': 'foo1#example.com', 'City': 'District Heights', 'First Name': 'John', 'Password': 'frobinate', 'Last Name': 'Doe'})
>>>
>>> user = 'foo1#example.com'
>>> record = email_index[user]
>>> print("{Email} is {First Name} {Last Name} and lives in {City}".format(**record))
foo4#example.com is John Doe and lives in District Heights
>>>

Categories