save changes in a pandas column in python after text cleaning - python

I'm fairly new to Python. I have opened my CSV file using pandas. Here, I have applied text cleaning approaches to one of the columns (after copying the raw column "message").
My problem is,
When I convert my dataframe back into CSV the new column does not include the changes that I've applied such as removal of special characters. What am I doing wrong?
Thank you in advance.
This is the code that I've run:
Then I have converted into csv by adding:
df.to_csv(r'Path\filename.csv)
SORTEDDDD :DDD

You can use .to_csv() DataFrame method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Related

Dataframe to CSV returns one empty column which is visible in the dataframe

After scraping I have put the information in a dataframe and want to export it to a .csv but one of the three columns returns empty in the .csv file ("Content"). This is weird since the all of the three columns are visible in the dataframe, see screenshot.
Screenshot dataframe
Line I use to convert:
df.to_csv('filedestination.csv')
Inspecting the df returns objects:
Inspecting dataframe
Does anyone know how it is possible that the last column, "Content" does not show any data in the .csv file?
Screenshot .csv file
After suggestions it seems that the data is available when opening with .txt. How is it possible that excel does not show the data properly?
Screenshot .txt file data
What is the data type of the Content column?
It is not a string, you can convert that to a string. And then perform df.to_csv
Sometimes, this happens weirdly. View & export will be different. Try Resetting the index before exporting it to .csv/ excel. This always works for me.
df.reset_index()
then,
df.to_csv(r'file location/filename.csv')

Spark dataframe with strange values after reading CSV

Coming from here, I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True), but now I have some weird values in some cells, as you can see in this picture (last lins):
Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!
Spark seems to have difficulty in reading this line:
2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...
because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.
df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))
The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.
If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED':
df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')

Fixed Width File manipulation in Pandas

I have a fixed-width file with the following format:
5678223313570888271712000000024XAXX0101010006461801325345088800.0784001501.25abc#yahoo.com
5678223324686600271712000000070XAXX0101010006461801325390998280.0784001501.25abcde.12345#gmail.com 5678123422992299
Here's what i tried :
import pandas as pd
ColSpecs = [(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143)]
df = pd.read_fwf("~/filename.txt",colspecs=ColSpecs,Header=True)
Now this surely helps me to convert cleanly in Pandas format. However, the blank(or fixed white spaces) get trimmed off. For Eg: the Email field(#8) has 50 characters set fixed. They get truncated as soon as they're imported to Pandas dataframe.
For the data manipulation, I am creating 3 new fields that are extracted from the values of the previously imported fields.
Final Output file structure:
[(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143),(143,153),(153,163),(164,165)]
Since, I haven't found any to_fwf method on dataframes or any other alternative for Pandas -> Flat File (keeping original lengths intact) , I would really appreciate if anyone has a better solution.
P.S. : I read that awk/sed in Unix works better, but still would like to know for Python

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Categories