Spark dataframe with strange values after reading CSV - python

Coming from here, I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True), but now I have some weird values in some cells, as you can see in this picture (last lins):
Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!

Spark seems to have difficulty in reading this line:
2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...
because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.
df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))
The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.
If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED':
df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')

Related

save changes in a pandas column in python after text cleaning

I'm fairly new to Python. I have opened my CSV file using pandas. Here, I have applied text cleaning approaches to one of the columns (after copying the raw column "message").
My problem is,
When I convert my dataframe back into CSV the new column does not include the changes that I've applied such as removal of special characters. What am I doing wrong?
Thank you in advance.
This is the code that I've run:
Then I have converted into csv by adding:
df.to_csv(r'Path\filename.csv)
SORTEDDDD :DDD
You can use .to_csv() DataFrame method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

How to read pipe-separated string into dataframe?

I have to read a file into spark (databricks) as bytes, and convert it to a string.
file_bytes.decode("utf-8")
This is all fine, and I have my data, as a pipe delimited string, including carriage returns etc. It all looks good. Something like:
"Column1"|"Column2"|"Column3"|"Column4"|"Column5"
"This"|"is"|"some"|"data."|
"Shorter"|"line."|||
I want this in a dataframe though so that I can manipulate it, and initially I was attempting to use the following:
df = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", '|')
.load(???)
I appreciate that the load() portion is really meant to be a path to a location on the filesystem ... so have been struggling with that one.
I have therefore reverted to using pandas as it makes life a lot easier:
import io
import pandas
temp = io.StringIO(file_bytes.decode("utf-8"))
df = pandas.read_csv(temp, sep="|")
This is a pandas dataframe, and not a spark dataframe, which as far as I am aware (and it's a very loose awareness) has pros and cons relating to where it lives (in memory) which relates to scaleability/ cluster-usage etc.
Initially, is there a way for me to get my string into a spark dataframe using sqlContext? Maybe I am missing some parameter or switch etc., or should I just stick with pandas?
The main thing I am worried about is that right now files are quite small (200 kb or so), but they might not be forever, and I'd like to reuse a pattern that will allow me to work with larger things (which is why I am marginally concerned about using pandas).
You can actually load an RDD of strings using the CSV reader.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
So, assuming lines is an RDD of strings that you parsed as you described, you can run:
df = spark.read.csv(lines, sep='|', header=True, inferSchema=True)
The CSV source will then scan the RDD instead of trying to load files. This lets you perform custom pre-processing prior to parsing.

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

Pandas read csv - dealing with mixed named/nameless columns

I am trying to open a csv file using pandas.
This is a screenshot of the file opened in excel.
Some columns have names and some do not. When trying to read this in with pandas I get the "ValueError: Passed header names mismatches usecols" error.
When I open part of the file in excel, add column names, save, and then import with pandas it works.
The problem is the files are large and cannot fully open in excel (plus I'd prefer a more elegant solution anyway).
Is there a way to deal with this issue in pandas?
I have read answers to other questions regarding this error but none were relevant.
Thanks so much in advance!
In names you can provide column names:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['col1', 'col2', 'col3'], engine='python')

Categories