Deleting Columns from a CSV in Python - python

I know similar questions to this have been asked, but I couldn't find any that were dealing with the error I'm getting (though I apologize if I'm missing something!). I am trying to remove a few columns from a CSV that wouldn't load in Excel so I couldn't just delete them within the file. I have the following code:
import os
import pandas as pd
os.chdir(r"C:\Users\maria\Desktop\Project\North American Breeding Bird Survey")
data = pd.read_csv("NABBSStateData.csv")
data.drop(["CountryNum", "Route", "RPID"], axis = 1, inplace = True)
but when I run it I get this error message:
c:\program files (x86)\microsoft visual studio\2019\professional\common7\ide\extensions\microsoft\python\core\Packages\ptvsd\_vendored\pydevd\pydevd.py:1664: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13) have mixed types. Specify dtype option on import or set low_memory=False.
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
I am relatively new to python/visual studio, and I am having a hard time figuring out what this error message is saying and how to fix it. Thank you!!
Edit: The CSV in question is the state files from this site concatenated together, so you can open one of the state files to see the columns/data types.

Looks like you have mixed data types in some of your columns (e.g. columns 0,1,2,3,4,5,6,7,8,9,10,11,12,13).
Mixed data type means in one column, say column 'a', most rows are numbers, but there might be strings in some rows as well.
Try use dtype option from pd.read_csv to specify the column types. If you are not sure about the type, use object or str.
This is an example:
df = pd.read_csv('D:\\foo.csv', header=0, dtype={'currency':str, 'v1':object, 'v2':object})
A link to use read_csv
Here's list of all the types you can specify.

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

Reading csv-file in Python

I know this question has been asked a lot, but none of the solutions I can find seems to work.
I'm trying to read a csv in python using pandas. The csv file 'data.csv' contains 8 comma separated and no header in the format:
T,000027E7,24.56,3.41,5.03,12,1260497437.817,4,0.18
T,00006726,28.84,8.24,5.03,14,1260497437.818,4,3.62
However, when using the command below, only a single column containing all values is outputted.
import pandas as pd
data2=pd.read_csv('data.csv',header=None)
I've also tried specifying names of each column to no avail.
data2=pd.read_csv('data.csv',header=None, names=['Type','TagID','x','y','z','BatLvl','TimeStamp','Unit','DQI'])
Does anybody know of a way to solve this?

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Fixed Width File manipulation in Pandas

I have a fixed-width file with the following format:
5678223313570888271712000000024XAXX0101010006461801325345088800.0784001501.25abc#yahoo.com
5678223324686600271712000000070XAXX0101010006461801325390998280.0784001501.25abcde.12345#gmail.com 5678123422992299
Here's what i tried :
import pandas as pd
ColSpecs = [(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143)]
df = pd.read_fwf("~/filename.txt",colspecs=ColSpecs,Header=True)
Now this surely helps me to convert cleanly in Pandas format. However, the blank(or fixed white spaces) get trimmed off. For Eg: the Email field(#8) has 50 characters set fixed. They get truncated as soon as they're imported to Pandas dataframe.
For the data manipulation, I am creating 3 new fields that are extracted from the values of the previously imported fields.
Final Output file structure:
[(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143),(143,153),(153,163),(164,165)]
Since, I haven't found any to_fwf method on dataframes or any other alternative for Pandas -> Flat File (keeping original lengths intact) , I would really appreciate if anyone has a better solution.
P.S. : I read that awk/sed in Unix works better, but still would like to know for Python

Pandas read csv - dealing with mixed named/nameless columns

I am trying to open a csv file using pandas.
This is a screenshot of the file opened in excel.
Some columns have names and some do not. When trying to read this in with pandas I get the "ValueError: Passed header names mismatches usecols" error.
When I open part of the file in excel, add column names, save, and then import with pandas it works.
The problem is the files are large and cannot fully open in excel (plus I'd prefer a more elegant solution anyway).
Is there a way to deal with this issue in pandas?
I have read answers to other questions regarding this error but none were relevant.
Thanks so much in advance!
In names you can provide column names:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['col1', 'col2', 'col3'], engine='python')

Categories