Drop rows from Dask DataFrame where column count is not equal - python

I have a CSV file which I want to normalize for SQL input. I want to drop every line, where's the column count not equal to a certain number within a row, this way I can ignore the bad lines, where column shift can happen. In the past, I used AWK to normalize this CSV dataset, but I want to implement this program in Python for easier parallelization other than GNU Parallel + AWK solution.
I tried the following codes to drop the lines:
df.drop(df[df.count(axis='columns') != len(usecols)].index, inplace=True)
df = df[df.count(axis=1) == len(usecols)]
df = df[len(df.index) == len(usecols)]
None of this work, I need some help, Thank You!
EDIT:
I'm working on a single CSV file on a single worker.
EDIT 2:
Here is the awk script for reference:
{
line = $0;
# ...
if (line ~ /^$/) next; # if line is blank, then remove it
if (NF != 13) next; # if column count is not equal to 13, then remove it
}

The question is not easy to understand. From the first statement it appears as if you are working with a single file, is that correct?
If so, if there are unnamed columns, then there will be an attempt by pandas (or dask via pandas) to 'fix' the structure by adding missing column labels with something like 'Untitled: 0'. Once that happens, it's easy to drop the misaligned rows by using something like:
mask = df['Untitled: 0'].isna()
df = df[mask]
Edit: if there are rows that contain more entries than the number of defined columns, pandas will raise an error, saying it was not able to parse csv.
If, however, you are working with multiple csv files, then one option is to use dask.delayed to enforce compatible columns, see this answer for further guidance.

It's easier to post a separate answer, but it seems that this problem can be solved by passing on_bad_lines kwarg to pandas.read_csv (note: if you are using pandas version lower than 1.3.0, you will need to use error_bad_lines). Roughly, the code would look like this:
from pandas import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
Since dask.dataframe can pass kwargs to pandas, the above can also be written for dask.dataframe:
from dask.dataframe import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
With this, the imported csv will not reflect any lines that have more columns than expected based on the header (if there is a line with fewer elements than the number of columns, it will be included such that the missing values will be set to None).

I ended up creating a function which pre-processing the zipped CSV file for Pandas/Dask. These are not CPU/Memory heavy tasks, parallelization is not important in this step, so until there's no better way to do this, here we are. I'm adding a proper header for my pre-processed CSV file, too.
with open(csv_filename, 'wt', encoding='utf-8', newline='\n') as file:
join = '|'.join(usecols)
file.write(f"{join}\n") # Adding header
with ZipFile(destination) as z:
with TextIOWrapper(z.open(f"{filename}.csv"), encoding='utf-8') as f:
for line in f:
line = line.strip() # Remove whitespace from line
if line not in ['\n', '\r\n']: # Exclude empty line
array = line.split("|")
if len(array) == column_count:
del array[1:3] # Remove 1st, 2nd element
array = [s.strip() for s in array] # Strip whitespace
join = '|'.join(array)
file.write(f"{join}\n")
# file.close()
PS.: This is not an answer for my original question, that's why I won't accept this.

Related

How to read specific rows and columns, which satisfy some condition, from file while initializing a dataframe in Pandas?

I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)

read_csv read rows with value contained in a column of another data frame

I'm working with large dataframes (15gb) and every time I try to open them it raises a memory error.
I successfully opened dataframe A, which the first column is an ID contained also in dataframe B.
Now, B has many more rows and ID that I don't care and, since I can't filter rows after opening it due to the memory error, I was trying to filter the rows that I need while opening it.
By following this post skip specific line that contains certain value when you read pandas data frame I tried to use:
import StringIO
import pandas as pd
emptylist = []
def read_file(file_name):
with open(file_name, 'r') as fh:
for line in fh.readlines():
parts = line.split(',')
if parts[0] not in emptylist:
emptylist.append(parts[0])
if parts[0] in set(idlist):
yield line
stream = StringIO.StringIO()
stream.writelines(read_file('B.csv'))
stream.seek(0)
df = pd.read_csv(stream)
where emptylist should contain the unique values of dataframe B's ID, and idlist is the column ID of Dataframe A converted to list.
The problem is that it's still giving me memory error at stream.writelines(read_file('B.csv')), and I don't understand why, since the number of rows should be exactly the same of Dataframe A, and the number of columns of B is only 2, against the 3 of dataset A, that instead I can open.
Thank you very much for your help!
It's still the remories error because you still read the whole B.csv to RAM to process. You can use this:
with open("B.csv") as infile:
for line in infile:
do_something_with(line)
It only reads one line at a time. When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else.

Read one file and rewrite to another, dropping the first column

Reading a ;-separated csv-file in an attempt to rewrite it to another csv, separated by ",", the result ends up delivering an additional column as the very first with all the rows counting from 0 to n. How do I leave that new column out?
I import pandas, define the df to read with delimiter=";" (because the file to be read is already in separate columns), then I define "df.to_csv" to rewrite to a new csv-file, separated with commas, i.e. each row is one long string of data.
import pandas as pd
df = pd.read_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\old_file.csv", delimiter=";", encoding='cp1252', error_bad_lines=False)
print(df.columns)
df.to_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\new_file.csv", sep=",")
The code runs fine, but the result in new_file.csv looks as follows:
,data,data,data,....
0,data,data,data,....
1,data,data,data,....
2,data,data,data,....
3,data,data,data,....
4,data,data,data,....
…
11,625,data,data,data,....
So, I need to know how to rewrite the code to avoid the leftmost column, counting ,0 , 1, 2,...., 11,625. How do I do that?
Thx, in advance
Tell the exporter not to export the index:
df.to_csv("...", index=False)

UpperCasing CSV Columns by reading the header

I have a function that reads the csv, splits it, uppercases only ONE or ALL columns (by index) and joins it again.
I want to be able to uppercase multiple columns but I have no idea how.
This is my code.
def specific_upper(line, c):
split = line.split(",")
split[c] = split[c].upper()
split = ','.join(split)
return split
EDIT: I wanted to do this only with python ( No spark, if possible )
EDIT2 : This is for NIFI, so its jython and not 100% python.
You can do that easily with read_csv from pandas. Default behaviour is your first row in the csv contains the columns names.
import pandas as pd
df = pd.read_csv('<filename>')
df.columns = [x.upper() for x in df.columns]
This will upper case all your columns. You can add some conditions in order to upper case only the columns of your desire.

Output Issues with pandas, Python

Begin Code
import pandas as pd
df = pd.read_csv('C:/Users/lhicks/Documents/Corporate/test.csv', 'r')
saved_column = df.FirstName
saved_column2 = df.LastName
saved_column3 = df.Email
print saved_column
print saved_column2
print saved_column3
Itemlist = []
Itemlist.append(saved_column)
print Itemlist
End of Code
The objective is to select specific columns from a specified xls sheet, grab all the rows from the specified columns, and then print that data out.
The current issue is the data is grabbed correctly, but after 29-30 rows, it prints/stores a "...", and then jumps to line item 880s, and finishes out from there.
The additional issue is that it also stores this as the value, making it worthless due to not providing the full dataset.
The eventual process is to add the selected columns to a new xls sheet to clean up the old data, and then add the rows to a templated document to generate an advertisement letter.
The first question is how to I have all the fields populate? The second is what is the best approach for this? Please provide additional links as well if possible, this is a practical learning experience for me.
Pandas tries to shorten your data when printing it.
NOTE: all the data is still there (print(df.shape) - to check it, print the shape of your DataFrame), it's just a convenient way not to flood your screen with tons of data rows/columns
Try this:
fn = 'C:/Users/lhicks/Documents/Corporate/test.csv'
cols = ['FirstName','LastName','Email']
df = pd.read_csv(fn, usecols=cols)
df.to_excel('/path/to/excel.xlsx', index=False)
This will parse only ['FirstName','LastName','Email'] columns from a CSV file and will export them to Excel file
UPDATE:
if you want to control how many rows should Pandas print:
with pd.option_context("display.max_rows",200):
print(df)

Categories