UpperCasing CSV Columns by reading the header - python

I have a function that reads the csv, splits it, uppercases only ONE or ALL columns (by index) and joins it again.
I want to be able to uppercase multiple columns but I have no idea how.
This is my code.
def specific_upper(line, c):
split = line.split(",")
split[c] = split[c].upper()
split = ','.join(split)
return split
EDIT: I wanted to do this only with python ( No spark, if possible )
EDIT2 : This is for NIFI, so its jython and not 100% python.

You can do that easily with read_csv from pandas. Default behaviour is your first row in the csv contains the columns names.
import pandas as pd
df = pd.read_csv('<filename>')
df.columns = [x.upper() for x in df.columns]
This will upper case all your columns. You can add some conditions in order to upper case only the columns of your desire.

Related

Pandas: faster string operations in dataframes

I am working on a python script that read data from a database and save this data into a .csv file.
In order to save it correctly I need to escape different characters such as \r\n or \n.
Here is how I am currently doing it:
Firstly, I use the read_sql pandas function in order to read the data from the database.
import pandas as pd
df = pd.read_sql(
sql = 'SELECT * FROM exampleTable',
con = SQLAlchemyConnection
)
The table I get has different types of values.
Then, the script updates the dataframe obtained changing every string value to raw string.
In order to achive that I use two nested for loops in order to operate with every single value.
def update_df(df)
for rowIndex, row in df.iterrows():
for colIndex, values in row.items():
if isinstance(df[rowIndex, colIndex], str):
df.at[rowIndex, colIndex] = repr(df.at[rowIndex, colIndex])
return df
However, the amount of data I need to elaborate is large (more than 1 million rows with more than 100 columns) and it takes hours.
What I need is a way to create the csv file in a faster way.
Thank you in advance.
It should be faster to use applymap if really you have mixed types:
df = df.applymap(lambda x: repr(x) if isinstance(x, str) else x)
However, if you can identify string columns, then you can slice them, (maybe in combination with re.escape?).:
import re
str_cols = ['col1', 'col2']
df[str_cols] = df[str_cols].applymap(re.escape)

Drop rows from Dask DataFrame where column count is not equal

I have a CSV file which I want to normalize for SQL input. I want to drop every line, where's the column count not equal to a certain number within a row, this way I can ignore the bad lines, where column shift can happen. In the past, I used AWK to normalize this CSV dataset, but I want to implement this program in Python for easier parallelization other than GNU Parallel + AWK solution.
I tried the following codes to drop the lines:
df.drop(df[df.count(axis='columns') != len(usecols)].index, inplace=True)
df = df[df.count(axis=1) == len(usecols)]
df = df[len(df.index) == len(usecols)]
None of this work, I need some help, Thank You!
EDIT:
I'm working on a single CSV file on a single worker.
EDIT 2:
Here is the awk script for reference:
{
line = $0;
# ...
if (line ~ /^$/) next; # if line is blank, then remove it
if (NF != 13) next; # if column count is not equal to 13, then remove it
}
The question is not easy to understand. From the first statement it appears as if you are working with a single file, is that correct?
If so, if there are unnamed columns, then there will be an attempt by pandas (or dask via pandas) to 'fix' the structure by adding missing column labels with something like 'Untitled: 0'. Once that happens, it's easy to drop the misaligned rows by using something like:
mask = df['Untitled: 0'].isna()
df = df[mask]
Edit: if there are rows that contain more entries than the number of defined columns, pandas will raise an error, saying it was not able to parse csv.
If, however, you are working with multiple csv files, then one option is to use dask.delayed to enforce compatible columns, see this answer for further guidance.
It's easier to post a separate answer, but it seems that this problem can be solved by passing on_bad_lines kwarg to pandas.read_csv (note: if you are using pandas version lower than 1.3.0, you will need to use error_bad_lines). Roughly, the code would look like this:
from pandas import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
Since dask.dataframe can pass kwargs to pandas, the above can also be written for dask.dataframe:
from dask.dataframe import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
With this, the imported csv will not reflect any lines that have more columns than expected based on the header (if there is a line with fewer elements than the number of columns, it will be included such that the missing values will be set to None).
I ended up creating a function which pre-processing the zipped CSV file for Pandas/Dask. These are not CPU/Memory heavy tasks, parallelization is not important in this step, so until there's no better way to do this, here we are. I'm adding a proper header for my pre-processed CSV file, too.
with open(csv_filename, 'wt', encoding='utf-8', newline='\n') as file:
join = '|'.join(usecols)
file.write(f"{join}\n") # Adding header
with ZipFile(destination) as z:
with TextIOWrapper(z.open(f"{filename}.csv"), encoding='utf-8') as f:
for line in f:
line = line.strip() # Remove whitespace from line
if line not in ['\n', '\r\n']: # Exclude empty line
array = line.split("|")
if len(array) == column_count:
del array[1:3] # Remove 1st, 2nd element
array = [s.strip() for s in array] # Strip whitespace
join = '|'.join(array)
file.write(f"{join}\n")
# file.close()
PS.: This is not an answer for my original question, that's why I won't accept this.

How to read specific rows and columns, which satisfy some condition, from file while initializing a dataframe in Pandas?

I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)

Why is the pandas isin - query - loc function not finding all matching items

I have a dataframe where i'd like to add a column "exists" based on the item existing in another dataframe.
Using the isin function only answers back with 1 match based on that other dataframe. Same for a loc filter when i set the column i want to filter as index.
It just doesn't work as expected when i use a reference to a list or column of another DF like this:
table.loc[table.index.isin(tableOther['column']), : ]
In this case it only returns 1 item.
import pandas as pd
import numpy as np
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';', index_col='Keyword')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
### This column based check only returns 1 - seemingly random - match ###
table.loc[table.index.isin(tableSubject['subjects']), : ]
--------------
######## also tried it like this:
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
mask = table['Keyword'].isin(tableSubject.subjects)
table[mask]
I've also tried using .query and turning the pd subject column to a list which ends with the same singular match result as above.
as the output is the same in all tries, I expect that it is something with the datasource..
Thank you for your thoughts!
Found the answer to be as simple as capitalization of words. Both sources of data were not set in lower characters. One list had Capitalized Words Like This and the other was random.
Learning: Make sure to set columns to be exactly the same as all options for matching look for exact matches.
This can be done as following:
table['Keyword'] = table['Keyword'].str.lower()
Also found a great answer here in case you don't need exact match:
How to test if a string contains one of the substrings in a list, in pandas?

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

Categories