pandas read csv ignore newline - python

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
sample data:
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
every entry is delimited by the ">"
data is split by newlines (limited to, but not actually respected worldwide
with 80 chars per line)

You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
which gives me :

This should work simply by setting skip_blank_lines=True.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Docs

There is no good way to do this.
BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe

Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT

After pd.read_csv(), you can use df.split().
import pandas as pd
data = pd.read_csv("test.csv")
data.split()

Related

Python3: how to count columns in an external file

I am trying to count the number of columns in external files. Here is an example of a file, data.dat. Please note that it is not a CSV file. The whitespace is made up of spaces. Each file may have a different number of spaces between the columns.
Data Z-2 C+2
m_[a/b] -155555.0 -133333.0
n_[a/b] -188800.0 -133333.0
o_[a/b*Y] -13.5 -17.95
p1_[cal/(a*c)] -0.01947 0.27
p2_[a/b] -700.2 -200.44
p3_(a*Y)/(b*c) 5.2966 6.0000
p4_[(a*Y)/b] -22222.0 -99999.0
q1_[b/(b*Y)] 9.0 -6.3206
q2_[c] -25220.0 -171917.0
r_[a/b] 1760.0 559140
s 4.0 -4.0
I experimented with split(" ") but could not figure out how to get it to recognize multiple whitespaces; it counted each whitespace as a separate column.
This seems promising but my attempt only counts the first column. It may seem silly to attempt a CSV method to deal with a non-CSV file. Maybe this is where my problems are coming from. However, I have used CSV methods before to deal with text files.
For example, I import my data:
with open(data) as csvfile:
reader = csv.DictReader(csvfile)
n_cols = len(reader.fieldnames)
When I use this, only the first column is recognized. The code is too long to post but I know this is happening because when manually enter n_cols = 3, I do get the results I expect.
It does work if I use commas to delimit the columns, but I can't do that (I need to use whitespace).
Does anyone know an alternative method that deals with arbitrary whitespace and non-CSV files? Thank you for any advice.
Yes, there are alternative methods:
Pandas
import pandas as pd
df = pd.read_csv('data.dat', delim_whitespace=True)
NumPy
arr = np.loadtxt('data.dat', dtype='str')
# or
arr = np.genfromtxt('data.dat',dtype='str')
Python's csv
If you want to use the python's csv library, you can normalize the whitespaces first before reading it, eg:
import re
with open('data.dat') as csvfile:
content = csvfile.read().strip()
normalized_content = re.sub(r' +', r' ', content)
reader = csv.reader(normalized_content.split('\n'), delimiter=' ')

Ignore delimiters between parentheses when reading CSV

I am reading in a csv file delimited by | but some of the data includes extra | characters. When this occurs it appears to be only between two parentheses (example below). I want to be able to read in the data into a dataframe without the columns being messed up (or failing) due to these extra | characters.
Ive been trying to find a way to either
set the pandas read csv delimiter to ignore delimiters between parentheses ()
or
parse over the csv file before loading it to a dataframe and remove any | characters between parentheses ()
I havent had any luck so far. This is sample data that messes up when I attempt to pull into a dataframe.
1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email
I am trying to ignore the | characters between the parentheses () from (ABC(111) to ABC(111),1)
This sample data occurs repeatedly throughout the data so I cant address each time this pattern occurs so I am trying to address it programmatically.
This person seems to be attempting something similar but their solution did not work for me (when changing to |)
Depending on the specifics of your input file you might succeed by applying a regular expression with this strategy:
Replace the sub string containing the brackets with a dummy sub string.
Read the csv.
Re-replace the dummy sub string with the original.
A dirty proof of concept looks as follows:
import re
import pandas as pd
test_line = """1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email"""
# 1. step
dummy_string = '--REPLACED--'
pattern_brackets = re.compile('(\\(.*\\))')
pattern_replace = re.compile(dummy_string)
list_replaced = pattern_brackets.findall(test_line)
csvString = re.sub(pattern_brackets, dummy_string, test_line)
# 2. step
# Now I have to fake your csv
from io import StringIO
csvStringIO = StringIO(csvString)
df = pd.read_csv(csvStringIO, sep="|", header=None)
# 3. step - not the nicest way; just for illustration
# The critical column is the 20th.
new_col = pd.Series(
[re.sub(pattern_replace, list_replaced[nRow], cell)
for nRow, cell in enumerate(df.iloc[:, 20])
])
df.loc[:, 20] = new_col
This works for the line given. Most probably you have to adopt the recipe to the content of your input file. I hope you find your way from here.

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

Pandas python replace empty lines with string

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"
I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))
Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.
Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

Categories