Remove non-ASCII characters from DataFrame column headers - python

I have exported a comma separated value file from a MSQL database (rpt-file ending). It only has two columns and 8 rows. Looking at the file in notepad everything looks OK. I tried to load the data into a pandas data frame using the code below:
import pandas as pd
with open('file.csv', 'r') as csvfile:
df_data = pd.read_csv(csvfile, sep=',' , encoding = 'utf-8')
print(df_data)
When printing to console the first column header name is wrong with some extra characters,  , at the start of column 1. I get no errors but obviously the first column is decoded wrongly in my code:Image of output
Anyone have any ideas on how to get this right?

Here's one possible option: Fix those headers after loading them in:
df.columns = [x.encode('utf-8').decode('ascii', 'ignore') for x in df.columns]
The str.encode followed by the str.decode call will drop those special characters, leaving only the ones in ASCII range behind:
>>> 'aSA'.encode('utf-8').decode('ascii', 'ignore')
'aSA'

Related

Python Pandas - Read data rows and non text in quotes from csv file

I am having issue trying to read csv file with pandas as the data are within quotes and whitespaces are present.
The header row in csv file is "Serial No,First Name,Last Name,Country".
Example data of each row is "1 ,""David, T "",""Barnes "",""USA """.
Below is the code I have tried thus far trying to remove the quotes and reading the text that are within 2 quotes.
import pandas as pd
import csv
df = pd.read_csv('file1.csv', sep=',', encoding='ansi', quotechar='"', quoting=csv.QUOTE_NONNUMERIC, doublequote=True, engine="python")
Is there a way to pre-process the file so that the result is as follows?
Serial No, First Name, Last Name, Country
1, David,T, Barnes, USA
Try using this.
file1 = pd.read_csv('sample.txt',sep=',\s+',skipinitialspace=True,quoting=csv.QUOTE_ALL,engine=python)
Closing this as I am using editpad to replace the commas and removing quotes as a walk-around.

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.
I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')
Try df = pd.read_csv(file, header=None, error_bad_lines=False)
The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

Pandas python replace empty lines with string

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"
I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))
Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.
Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

How to read pandas CSV file with comma separator and comma thousand separator

I have a csv file with rows that looks like this:
87.89,"2,392.05",14.77,373.2 ( third row has coma thousand separator)
pandas keeps considering the comma in second column as a row separator and showing "Error tokenizing data" Error.
is there a away in pandas to ignore comas between double quotes?
thanks
sample rows :
9999992613813558569,87.89,"2,392.05",14.77,373.2
9999987064038821584,95.11,"3,397.04",42.15,"1,461.14"
9999956300203713283,6.67,194.02,41.23,"1,105.45"
9999946809576027532,15.08,353.84,29.43,591.9
Edit:
i already tried :
read_csv(file, quotechar='"', encoding='latin1', thousands=',')
read_csv(file, quotechar='"', encoding='latin1', escapechar ='"')
Try reading it with:
pd.read_csv(myfile, encoding='latin1', quotechar='"')
Each column that contains these will be treated as type object.
Once you get this, to get back to float use:
df = df.apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',',''), errors='coerce'))
Alternatively you can try:
pd.read_csv(myfile, encoding='latin1', quotechar='"', error_bad_lines=False)
Here you can see what was omitted from original csv - what caused the problem.
For each line that was omitted you'll receive a Warning instead of Error.
This worked for me:
pd.read_csv(myfile, encoding='latin1', quotechar='"', thousands=',')

pandas read csv ignore newline

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
sample data:
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
every entry is delimited by the ">"
data is split by newlines (limited to, but not actually respected worldwide
with 80 chars per line)
You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
which gives me :
This should work simply by setting skip_blank_lines=True.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Docs
There is no good way to do this.
BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
After pd.read_csv(), you can use df.split().
import pandas as pd
data = pd.read_csv("test.csv")
data.split()

Categories