Pandas read csv - dealing with mixed named/nameless columns - python

I am trying to open a csv file using pandas.
This is a screenshot of the file opened in excel.
Some columns have names and some do not. When trying to read this in with pandas I get the "ValueError: Passed header names mismatches usecols" error.
When I open part of the file in excel, add column names, save, and then import with pandas it works.
The problem is the files are large and cannot fully open in excel (plus I'd prefer a more elegant solution anyway).
Is there a way to deal with this issue in pandas?
I have read answers to other questions regarding this error but none were relevant.
Thanks so much in advance!

In names you can provide column names:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['col1', 'col2', 'col3'], engine='python')

Related

Spark dataframe with strange values after reading CSV

Coming from here, I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True), but now I have some weird values in some cells, as you can see in this picture (last lins):
Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!
Spark seems to have difficulty in reading this line:
2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...
because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.
df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))
The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.
If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED':
df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Deleting Columns from a CSV in Python

I know similar questions to this have been asked, but I couldn't find any that were dealing with the error I'm getting (though I apologize if I'm missing something!). I am trying to remove a few columns from a CSV that wouldn't load in Excel so I couldn't just delete them within the file. I have the following code:
import os
import pandas as pd
os.chdir(r"C:\Users\maria\Desktop\Project\North American Breeding Bird Survey")
data = pd.read_csv("NABBSStateData.csv")
data.drop(["CountryNum", "Route", "RPID"], axis = 1, inplace = True)
but when I run it I get this error message:
c:\program files (x86)\microsoft visual studio\2019\professional\common7\ide\extensions\microsoft\python\core\Packages\ptvsd\_vendored\pydevd\pydevd.py:1664: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13) have mixed types. Specify dtype option on import or set low_memory=False.
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
I am relatively new to python/visual studio, and I am having a hard time figuring out what this error message is saying and how to fix it. Thank you!!
Edit: The CSV in question is the state files from this site concatenated together, so you can open one of the state files to see the columns/data types.
Looks like you have mixed data types in some of your columns (e.g. columns 0,1,2,3,4,5,6,7,8,9,10,11,12,13).
Mixed data type means in one column, say column 'a', most rows are numbers, but there might be strings in some rows as well.
Try use dtype option from pd.read_csv to specify the column types. If you are not sure about the type, use object or str.
This is an example:
df = pd.read_csv('D:\\foo.csv', header=0, dtype={'currency':str, 'v1':object, 'v2':object})
A link to use read_csv
Here's list of all the types you can specify.

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

columns missing while reading a huge csv file in jupyter nootebook

So i am trying to read a csv file by the code as below:
import pandas as pd
user_cols = ['id','listing_type','status','listing_class','property_type','street_address','city','state',' 'zip_4','cross_street','street_index','unit','floor','location','Latitude',
'longitude','subway','neighborhood','price','incentives','fee_type','fee_percentage','fee_details_broker',
'fee_details_clients','application_information','maintenance','taxes','max_financing','other_costs','beds',
'baths','full_baths','three_quarter_baths','half_baths','total_rooms','square_feet','exterior_square_feet',
'lot_area','lot_dimensions','date_available','date_listed','closed_on','year_built','recent_renovation',
'lease_min','lease_max','date_added','date_edited','date_update','contact','access','keys','mls_name','mls_id',
'courtesy_of','vow_opt_out','idx_opt_out','pet_details','notes','sync','private','listing_score','added_by_id',
'featured_office_id','date_expires','exclusive_file_id','condition','guarantor','blast_link']
data = pd.read_csv("C:\\Users\\Desktop\\dump-4.csv", low_memory=False, dtype=object, header=None, names=user_cols)
I am able to read the file but when i try to display the columns there are about 15-16 column names that are missing. Why is this happening and what can I do.
So when i deleted the dtype=object and header=None..it did print all the columns. not really sure what wouldve been the correct dtype though! Thanks anyway! :)

Categories