Pandas: how to keep data that has all the needed columns - python

I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL

Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.

One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.

Related

Pandas: How to read contents of a CSV into a single column?

I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

How to tell pandas to read columns from the left?

I have a csv in which one header column is missing. Eg, I have n data columns, but n-1 header names. When this happens, it seems like pandas shifts my first column to be an index, as shown in the image. So what happens is the column to the right of date_time in the csv, is under the date_time column in the pandas data frame.
My question is: how can I force pandas to read from the left so that the date_time data remains under the date_time column instead of becoming the index? I'm thinking if pandas can simply read from left to right and add dummy column names at the end of the file, that would be great.
Side note: I concede that my input csv should be "clean", however, I think that pandas/frameworks in general should be able to handle the case in which some data might be unclean, but the user wants to proceed with the analysis instead of spending 30 minutes writing a side function/script to fix these minor issues. In my case, the data I care about is usually in the first 15 columns and I don't really care if the columns after that are misaligned. However, when I read the dataframe into pandas, I'm forced to care and waste time fixing these issues even though I don't care about the remaining columns.
Since you don't care about the last column, just set index_col=False
df = pd.read_csv(file, index_col=False)
That way, it will sequentially match the columns with data for the first n-1 columns. Data after that will not be in the data frame
You may also skip the first row to have all your data in the data frame first
df = pd.read_csv(file, skiprows=1)
and then just set the column name after
df.columns = ['col1', 'col2', ....] + ['dummy_col1', 'dummy_col2'...]
where the first list comes from the row=0 of your csv, and the second list you just fill dinamically with a list comprehension.

How to read a dataset into pandas and leave out rows with uneven column count

I am trying to read a dataset, which has few rows with uneven column count ('ragged'). I want to leave out those rows and read the rest of the rows. Is it possible in pandas instead of breaking the dataset into separate data frames and combining them?
If I understand your question, you have uneven columns, but want to drop any rows that don't have every column. If so, simply read the entire data set (read_csv) and then call dropna() on the dataframe. dropna() has a swarg called 'how' which defaults to 'any' ... that is, if any of the items in the given row (or column) are NA. (Consider also doing 'inplace=True'). See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

Most efficient way to compare two near identical CSV's in Python?

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.

Categories