Having trouble removing headers when using pd.read_csv - python

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius

Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254

I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600

I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Related

Reading feature names from a csv file using pandas

I have a csv file looks like
F1 F2 F3
A1 2 4 2
A2 4 1 2
When I read the file using pandas, I see that the first column is unnamed.
import pandas as pd
df = pd.read_csv("data.csv")
features = df.columns
print( features )
Index(['Unnamed: 0', 'F1, 'F2, 'F3'])
In fact I want to get only F1, F2 and F3. I can fix that with some array manipulation. But I want to know if pandas has some builtin capabilities to do that. Any thought?
UPDATE:
Using index_col = False or None doesn't work either.
That unnamed is only because of index column being read, you can use the index_col = [0] argument in the read statement to resolve.
This picks the first column as index instead of a feature itself.
import pandas as pd
df = pd.read_csv("data.csv", index_col=[0])
features = df.columns
print( features )
Index(['F1', 'F2', 'F3'])

Pandas: Remove rows from the dataframe that begin with a letter and save CSV

Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating
The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)
Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN
I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)
For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)
It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.

Python script that efficiently drops columns from a CSV file

I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)
You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2
You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])

Opening csv file using python

I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.

How to skip text being used as column heading using python

I am importing a web log text file in Python using Pandas. Python is reading the headers however has used the text "Fields:" as a header and has then added another column of blanks (NaN's) at the end. How can I stop this text being used as a column heading?
here is my code:
arr = pd.read_table("path", skiprows=3, delim_whitespace=True, na_values=True)
Here is the start of the file:
Software: Microsoft Internet Information Services 7.5
Version: 1.0
Date: 2014-08-01 00:00:25
Fields: date time
2014-08-01 00:00:25...
Result is that 'Fields' is being used as a column heading and a column full of NaN values is being created for column 'time'.
You can do it calling read_table twice.
# reads the forth line into 1x1 df being a string,
# then splits it and skips the first field:
col_names = pd.read_table('path', skiprows=3, nrows=1, header=None).iloc[0,0].split()[1:]
# reads the actual data:
df = pd.read_table('path', sep=' ', skiprows=4, names=col_names)
If you already know the names of the columns (eg. date and time) then it's even simpler:
df = pd.read_table('path', sep=' ', skiprows=4, names = ['date', 'time'])
I think you may want skiprows = 4 and header = None

Categories