Pandas python replace empty lines with string - python

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"

I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))

Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.

Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

Related

Python3: how to count columns in an external file

I am trying to count the number of columns in external files. Here is an example of a file, data.dat. Please note that it is not a CSV file. The whitespace is made up of spaces. Each file may have a different number of spaces between the columns.
Data Z-2 C+2
m_[a/b] -155555.0 -133333.0
n_[a/b] -188800.0 -133333.0
o_[a/b*Y] -13.5 -17.95
p1_[cal/(a*c)] -0.01947 0.27
p2_[a/b] -700.2 -200.44
p3_(a*Y)/(b*c) 5.2966 6.0000
p4_[(a*Y)/b] -22222.0 -99999.0
q1_[b/(b*Y)] 9.0 -6.3206
q2_[c] -25220.0 -171917.0
r_[a/b] 1760.0 559140
s 4.0 -4.0
I experimented with split(" ") but could not figure out how to get it to recognize multiple whitespaces; it counted each whitespace as a separate column.
This seems promising but my attempt only counts the first column. It may seem silly to attempt a CSV method to deal with a non-CSV file. Maybe this is where my problems are coming from. However, I have used CSV methods before to deal with text files.
For example, I import my data:
with open(data) as csvfile:
reader = csv.DictReader(csvfile)
n_cols = len(reader.fieldnames)
When I use this, only the first column is recognized. The code is too long to post but I know this is happening because when manually enter n_cols = 3, I do get the results I expect.
It does work if I use commas to delimit the columns, but I can't do that (I need to use whitespace).
Does anyone know an alternative method that deals with arbitrary whitespace and non-CSV files? Thank you for any advice.
Yes, there are alternative methods:
Pandas
import pandas as pd
df = pd.read_csv('data.dat', delim_whitespace=True)
NumPy
arr = np.loadtxt('data.dat', dtype='str')
# or
arr = np.genfromtxt('data.dat',dtype='str')
Python's csv
If you want to use the python's csv library, you can normalize the whitespaces first before reading it, eg:
import re
with open('data.dat') as csvfile:
content = csvfile.read().strip()
normalized_content = re.sub(r' +', r' ', content)
reader = csv.reader(normalized_content.split('\n'), delimiter=' ')

Can't read .txt file with pandas because it's in a weird shape

I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :
http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column.
I tried the command
df = pd.read_csv("path/file.txt", header = None)
and also
df = pd.read_csv("path/file.txt", sep = " ", header = None)
where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.
This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get
Any advice/help would be appreciated. Thanks
EDIT
I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.
df1 = np.loadtxt("path/file.txt")
This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command
df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]
And yeah, I think it works. Check the following picture.
I think its correct but if you guys find something wrong let me know.
Edited
columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']
df = pd.read_csv("path/file.txt", sep = " ", columns=columns , skiprows=1)
You could try creating the dataframe from lists instead of the txt file, something like the following:
#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
lines = fp.read()
data = lines.split('\n')
df_data= []
for item in data:
df_data.append(item.split(' ')) #I cant see if 1 space or 2 separate the values
#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]
#List to dataframe
df = pd.DataFrame(df_data)
Doing this by memory so watch out for syntax, hope this helps!

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

pandas read csv ignore newline

i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
sample data:
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
every entry is delimited by the ">"
data is split by newlines (limited to, but not actually respected worldwide
with 80 chars per line)
You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
which gives me :
This should work simply by setting skip_blank_lines=True.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Docs
There is no good way to do this.
BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
After pd.read_csv(), you can use df.split().
import pandas as pd
data = pd.read_csv("test.csv")
data.split()

Categories