After df.toPandas().to_csv('mycsv.csv'), data is garbled when reread - python

I have a table named result_25. I use this code to successfully export data to csv on my disk.
result_25.toPandas().to_csv('mycsv.csv')
In order to check whether I save the file correctly, I read my table back in with this code:
rr = spark.read.csv('mycsv.csv', inferSchema=True, header=True)
I checked the data, it looked like fine.
rr & result_25 dataset
But when I checked my result_25 and rr with .describe().show(), they show differently(I was expecting they were the same).
result_25 and rr describe()
And when I grouped by 'prediction', they were even more different.
rr
result_25
What's wrong here? Anybody can help me? Thanks!!!

By default, pandas.to_csv adds an index to the CSV export (from the docs):
index: boolean, default True
Write row names (index)
You can export to CSV without the index:
result_25.toPandas().to_csv('mycsv.csv', index=False)
and you won't see the additional column _c0 (the column name _c0 is added by pyspark since pandas does not give any name to the index column).
If you only use spark (and don't need the saved data frame in human-readable format), another way to avoid this is to write/read pyspark data frames in other formats such as JSON or parquet:
# JSON
result_25.write.json('mydataframe.json')
rr = spark.read.json('mydataframe.json')
# parquet
result_25.write.parquet('mydataframe.parquet')
rr = spark.read.parquet('mydataframe.parquet')

Related

How can I print attributes/meta-data of a pandas dataframe while saving it in Excel or CSV file?

I have a pandas dataframe abc which I created as follows:
abc = pd.DataFrame({"A":[1,2,3],"B":[2,3,4]})
I added some additional attributes of the dataframe as follows:
abc.attrs = {"Name":"John", "Country":"Nepal"}
I'd like to save the pandas dataframe into an Excel file in xlsx or CSV format. I can do that using abc.to_excel("filename.xlsx") or abc.to_csv("filename.csv") where filename is the required name of the file.
However, I am not able to print the attributes in the saved file. I'd like to save the dataframe in Excel file such that first row gives Name and second row gives Country in two columns as shown below:
How can I do that?
Unfortunately, .to_excel() and .to_csv() do not provide any explicit functionality to insert meta information ahead of the actual dataframe as documented for the Excel and CSV write functions.
Regardless, one could exploit the header argument to hardcode this preamble into the frame. This can be achieved, for example, with
abc.to_csv("filename.csv", header=[str(k) + ',' + str(v) + '\n' for k,v in abc.attrs.items()])
Please note, however, that data tables store homogenous data across rows and columns. Adding meta information on top makes the data harder to read and process. Consider adding it (a) in the file name, (b) in a distinct table, or (c) dropping it altogether.
Additionally, it shall be noted that as of now (Pandas 1.4.3), the attributes feature is experimental and could change/disappear at any future version which makes any implementation brittle.

Pandas - Change the default date format when creating a dataframe from CSVs

I have a script that loops through a folder of CSVs, reads them, removes any empty rows (they all have 'empty' rows that Pandas reads as NaN) and appends them to a master dataframe. It then writes the dataframe to a new CSV. This is all working as expected:
if pl.Path(file).suffix == '.csv':
fullPath = os.path.join(sourceLoc, file)
print(file)
initDF = pd.read_csv(fullPath)
cleanDF = initDF.dropna(subset=['Name'])
masterDF = masterDF.append(cleanDF)
masterDF.to_csv(destLoc, index=False)
My only issue is the input dates are displayed like this 25/05/21 but the output dates end up formatted like this 05/25/21. As I'm in the UK and using a UK version of Excel to analyse the output, it's confusing all my functions.
The only solutions I've found so far are to reformat the date columns individually or style them, which to my understanding only affects how they look in Jupyter and not in the actual data. As there are multiple date columns in the source data files I'd rather not have to reformat them all individually.
Is there any way of defining the date format when first creating the dataframe, or reformatting every date column once the dataframe is filled?
In the end this issue was caused by two different problems.
The first was Excel intermittently exporting my dates in US format despite the original format (and my Windows Region settings) being UK format. I've now added a short VBA loop in my export code to ensure those columns are formatted correctly every time the data is exported.
The second was the CSV date being imported with incorrect dtypes. I suspect this was again the fault of Excel (2010 is problematic) but I'm unsure. I'm now correcting this with an astype() method.
The end result is my dates are now imported into Pandas in the correct format and outputted to a new CSV in the correct format too.

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Drop partition columns when writing parquet in pyspark

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

How to read data from excel from a particular column in python

I have an excel sheet and I am reading the excel sheet using pandas in python.
Now I want to read the excel file based on a column, if the column has some value then do not read that row, if the column is empty than read that and store the values in a list.
Here is a screenshot
Excel Example
Now in the above image when the uniqueidentifier is yes then it should not read that value, but if it is empty then it should start reading from that value.
How to do that using python and how to get index so that after I have performed some function that I am again able to write to that blank unique identifier column saying that row has been read
This is possible for csv files. There you could do
iter_csv = pandas.read_csv('file.csv', iterator=True, chunksize=100000)
df = pd.concat([chunk[chunk['UniqueIdentifier'] == 'True'] for chunk in iter_csv])
But pd.read_excel does not offer to return an iterator object, maybe some other excel-readers can. But I don't no which ones. Nevertheless you could export your excel file as csv and use the solution for csv files.

Categories