How to separately add a header row while loading a parquet file? - python

While handling csv files we can say:
df = pd.read_csv("test.csv", names=header_list, dtype=dtype_dict)
Above would create a dataframe with headers as header_list and dtypes as of the dtype_dict
Can we do something similar with pd.read_parquet() ?
My issue involves passing in headers separately and would thus not be available in the "test.csv"
Another way to bypass it could be to move the entire data in df downwards by 1 (including shifting headers into rows) and then replacing the header with header_list (if it's even possible?)
Is there an optimal solution to my issue?
I'm not too familiar with parquet so any guidance would be appreciated, thanks.

Can we do something similar with pd.read_parquet() ?
parquet files contain some metadata, including the name of the columns and their types. So there is no need to pass this information when loading the data.

Related

Delete rows above headers in a CSV using Python Pandas

I need to clean up a files using Pandas. But the raw files we are using have a couple of rows above the column headers that I need to erase before getting to work. I do not find how to get rid of them.
I suppose this has to be done before generating the frame.
Can someone help?
Thanks in advance.
Sample CSV raw file
You can try using the skiprows parameter in read_csv() :
pd.read_csv('filename.csv', skiprows=5)

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

How can I have Pandas recognize the structure of my data properly?

I have some data saved in ".txt" files. this is how they are stored:
I used the code below to read the data and save it in a data frame object: (no need to mention that I'm using pandas library of python):
new_df = pd.read_csv(location, sep='\t', lineterminator='\n', names=None)
the problem is that when I get the shape of my data frame with new_df.shape I end up with: (123,1). It does not recognize that the data have 4 columns. How can I fix this?
It seems you don't have tab but spaces - use sep="\s+"
From your screenshot, your data appear to be in fixed width format.
Try to use pandas.read_fwf to read your data file:
pd.read_fwf(location)
You may pass the colspecs=... argument to tell it in which column each of the data are, but the routine is smart enough to figure this out automagically.

Pandas read csv - dealing with mixed named/nameless columns

I am trying to open a csv file using pandas.
This is a screenshot of the file opened in excel.
Some columns have names and some do not. When trying to read this in with pandas I get the "ValueError: Passed header names mismatches usecols" error.
When I open part of the file in excel, add column names, save, and then import with pandas it works.
The problem is the files are large and cannot fully open in excel (plus I'd prefer a more elegant solution anyway).
Is there a way to deal with this issue in pandas?
I have read answers to other questions regarding this error but none were relevant.
Thanks so much in advance!
In names you can provide column names:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['col1', 'col2', 'col3'], engine='python')

Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Categories