Ignore delimiters between parentheses when reading CSV - python

I am reading in a csv file delimited by | but some of the data includes extra | characters. When this occurs it appears to be only between two parentheses (example below). I want to be able to read in the data into a dataframe without the columns being messed up (or failing) due to these extra | characters.
Ive been trying to find a way to either
set the pandas read csv delimiter to ignore delimiters between parentheses ()
or
parse over the csv file before loading it to a dataframe and remove any | characters between parentheses ()
I havent had any luck so far. This is sample data that messes up when I attempt to pull into a dataframe.
1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email
I am trying to ignore the | characters between the parentheses () from (ABC(111) to ABC(111),1)
This sample data occurs repeatedly throughout the data so I cant address each time this pattern occurs so I am trying to address it programmatically.
This person seems to be attempting something similar but their solution did not work for me (when changing to |)

Depending on the specifics of your input file you might succeed by applying a regular expression with this strategy:
Replace the sub string containing the brackets with a dummy sub string.
Read the csv.
Re-replace the dummy sub string with the original.
A dirty proof of concept looks as follows:
import re
import pandas as pd
test_line = """1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email"""
# 1. step
dummy_string = '--REPLACED--'
pattern_brackets = re.compile('(\\(.*\\))')
pattern_replace = re.compile(dummy_string)
list_replaced = pattern_brackets.findall(test_line)
csvString = re.sub(pattern_brackets, dummy_string, test_line)
# 2. step
# Now I have to fake your csv
from io import StringIO
csvStringIO = StringIO(csvString)
df = pd.read_csv(csvStringIO, sep="|", header=None)
# 3. step - not the nicest way; just for illustration
# The critical column is the 20th.
new_col = pd.Series(
[re.sub(pattern_replace, list_replaced[nRow], cell)
for nRow, cell in enumerate(df.iloc[:, 20])
])
df.loc[:, 20] = new_col
This works for the line given. Most probably you have to adopt the recipe to the content of your input file. I hope you find your way from here.

Related

Splitting fields from a CSV file using pyspark

I'm having issues splitting a CSV file through PySpark. I'm trying to output the country and name of the wine (this is just to prove the parsing is working), but I get an error.
This is how the CSV file looks:
,country,description,designation,points,price,province,region_1,region_2,variety,winery
20,US,"Heitz has made this stellar rosé from the rare Grignolino grape since 1961. Ruby grapefruit-red, it's sultry with strawberry, watermelon, orange zest and salty spice flavor, highlighted with vibrant floral aromas.",Grignolino,95,24.0,California,Napa Valley,Napa,Rosé,Heitz
and here is my code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("SQLProject")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
country = fields[1]
points = fields[4]
return country, points
lines = sc.textFile("file:///Users/luisguillermo/IE/Spark/Final Project/wine-reviews/winemag-data-130k-v2.csv")
rdd = lines.map(parseLine)
results = rdd.collect()
for result in results:
print(result)
And get this error:
File "/Users/luisguillermo/IE/Spark/Final Project/wine-reviews/country_and_points.py", line 10, in parseLine
points = fields[4]
IndexError: list index out of range
It appears that the program gets confused as there are commas in the description. Any ideas on how to fix this?
I would recommend using Spark built-in CSV data source as it provides many options including quotes which is used to read delimiter from columns, Ofcourse the column with delimiter should have quoted with some character.
quotes
When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
If you wanted to read other options Spark CSV provides along with examples, I would suggest reading the following articles.
spark-read-csv-file-into-dataframe
read-csv
Happy Learning !!
See this code:
df = spark.read\
.csv('data.csv')
df.printSchema()
df.show()
The resulting df is a DataFrame with the columns just like the CSV.
See more advance features here

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

Pandas python replace empty lines with string

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"
I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))
Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.
Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

pandas read csv ignore newline

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
sample data:
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
every entry is delimited by the ">"
data is split by newlines (limited to, but not actually respected worldwide
with 80 chars per line)
You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
which gives me :
This should work simply by setting skip_blank_lines=True.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Docs
There is no good way to do this.
BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
After pd.read_csv(), you can use df.split().
import pandas as pd
data = pd.read_csv("test.csv")
data.split()

Categories