Check inputs in csv file

Check inputs in csv file - python

I`m new to python. I have a csv file. I need to check whether the inputs are correct or not. The ode should scan through each rows.
All columns for a particular row should contain values of same type: Eg:
All columns of second row should contain only string,
All columns of third row should contain only numbers... etc
I tried the following approach, (it may seem blunder):
I have only 15 rows, but no idea on number of columns(Its user choice)
df.iloc[1].str.isalpha()
This checks for string. I don`t know how to check ??

Simple approach that can be modified:
Open df using df = pandas.from_csv(<path_to_csv>)
For each column, use df['<column_name>'] = df['<column_name>'].astype(str) (str = string, int = integer, float = float64, ..etc).
You can check column types using df.dtypes

Related

Do not convert numerical column names to float in pandas read_excel

I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.

try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use

Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!

Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])

for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

How to add a something to a string in the first column of my csv?

I have 3 columns in my csv. In the first column of my csv i want to add the number "1" at start of all the entries in the column in python.
I can`t figure out how to do that?
e.g
Current data in column: 5678967745
I want to add a 1 at the start of it so it starts like this "15678967745."
I want to do this for all entries in the column.

You can convert your number to a string, add the "1" as a string, and convert the whole thing back to a number, assuming that's important.
numbers_as_strings = df["numbers"].astype(str)
numbers_with_1 = "1"+numbers_as_strings
numbers_as_numeric = pd.to_numeric(numbers_with_1)
df["numbers"] = numbers_as_numeric

Supposing that you have a dataframe df and a column named Datetime and it is an int. To modify the value avoiding the for loop you can use map on the column. You can try:
df['Datetime'] = df['Datetime'].map(lambda x: int("1"+str(x))
You are first converting each value in the column as a string and they you concatenate 1. At the end, the new number is converted again as an int.
Hope it helps.

Cleaning Dataframe in Python 3

I've got a dataframe (haveleft) full of people who have left a service and their reason for leaving. The 'text' column is their reason, but some of them aren't strings. Not many, so I just want to remove those rows, either in place or to a new dataframe. Below code just gives me a dataframe populated with only NaN. Why doesn't it work?
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft[haveleft['text'] == str]]
print(holder[0:10])
or if I remove one of the 'haveleft[ ]' I get an empty dataframe
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft['text'] == str]
print(holder[0:10])
I've tried to add a type() but can't seem to figure out the way to do this.

It doesn't work because DataFrame columns cannot contain mixed types; your text column will be string or object, even if some values are numerical. You'll want to figure out how to characterize unwanted data and drop them accordingly.
For instance, to drop rows where 'text' consists only of digits as in the single-line example you give:
cleaned = df[~df['text'].str.match('^\d+$')]

python distinguish between '300' and '300.0' for a dataframe column

Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.
I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.
My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.
From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.
Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?
Thanks

In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.
df.columns = [str(int(float(column))) for column in df.columns]
Or, if integer value is not required,extra int conversion can be removed and float string value can be used:
df.columns = [str(float(column)) for column in df.columns]
Then, df['300.0'] can be used instead of df['300'].
If string type is not required then, I think converting them float would work as well.
df.columns = [float(column) for column in df.columns]
Then, df[300.0] would work as well.
Other alternative to change column names may be using map:
Changing to float value for all columns, then as mentioned above use df[300.0]:
df.columns = map(float, df.columns)
Changing to string value of float, then df['300.0']:
df.columns = map(str, map(float, df.columns))
Changing to string value of int, then df['300']:
df.columns = map(str, map(int, map(float, df.columns)))

Some solutions:
Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.
Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.
Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.

This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.
Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.
cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)
For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.
My thanks to user Nipun Batra for this answer that explained df.rename.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check inputs in csv file - python

Simple approach that can be modified: Open df using df = pandas.from_csv(<path_to_csv>) For each column, use df['<column_name>'] = df['<column_name>'].astype(str) (str = string, int = integer, float = float64, ..etc). You can check column types using df.dtypes

Related

Do not convert numerical column names to float in pandas read_excel

How to update/apply validation to pandas columns

How to add a something to a string in the first column of my csv?

Cleaning Dataframe in Python 3

python distinguish between '300' and '300.0' for a dataframe column

Categories

Resources