Pandas, remove outer quote marks from specific columns on export? - python

I have a specific problem, we are moving our from old to a new system. Old databse was adjusted to a new one with Pandas. However, I am facing a problem.
If file opened with SQL or Csv, it has outer quotes,
"UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)"
I need to make sure it has no upper quotes like this:
UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)
What could be pandas solution to do this for specific columns, on exporting saving to SQL or csv? Because now it's as a string and returns like this?

If your problem is that the old system produces files like .csv with the quotes, you might just want to edit the .csv file itself as described here
If your problem is that pandas saves it as a string with double quotes you can either run the same thing on the csv output of pandas, or you could pass the .to_csv() function the argument
quotechar=''
for which you can find more info on this page

Try to read your data in Pandas using:
df = pd.read_csv(filename, sep=',').replace('"','', regex=True)
This reading line should remove the " " from your data.

Related

Trying to write/read CSV file with None objects for empty cells [Python]

I'm trying to read data in format CSV using pandas DataFrame so that the empty cells will be recognized as None values.
the delimiter is ',' and I have two of them wherever I need None value. for example, the row:
12345,'abc','abc',,,12,'abc'
Will be converted to a tuple and replaced to:
(12345,'abc','abc',None,None,12,'abc',)
I need it in order to insert data to MySQL later and I'm using cursor.execute() function with the query and the data
I have tried to load the CSV file to a DataFrame and replace but it is not supported:
chunk = chunk.replace(np.nan, None, regex=True)
Any suggestions?
Sorry I did not attain the meaning of the question completely but if it is in regards to CSV, why don't you have arbitrary values of your choice or even empty strings that you can then change later during the programme for when you want to write the data out or when you read.

Python/Pandas adding quotes to string

I'm using Python/Pandas to edit a csv file created by another program.
One of the columns contains values contained within duoble quotes:
"RGB(0,255,255)"
for example.
This is just how it is output by the program and I need to preserve these quotes in order for it to be read back into the program once I have edited it. Currently when I try to exporting the edited data frame to a .csv, the quotes around the values dissapear. so the values look like this:
RGB(0,255,255)
I tried adding quotes manually to the values in the column before exporting, but now the .csv file has triple quotes so looks like this:
"""RGB(0,255,255)"""
I'm not doing anything with this particular column, I literally just need it to retain the format it had before being read into my Python script. I'm assuiming there are some arguments in either my read_csv or to_csv commands but I'm not sure where to start. Any help gratefully appreciated!
Save the DataFrame as a pickle instead.
df.to_pickle('test.pkl')
# To load the dataframe again
df = pd.read_pickle('test.pkl')
This will preserve the structures!

Pyspark: how to read a .csv file?

I am trying to read a .csv file that has a strange format.
This is what I am doing
df = spark.read.format('csv').option("header", "true").option("delimiter", ',').load("muyFile.csv"))
df.show(5)
I do not understand why the lonlat entry of the third id is transposed. It seems that the file has two different delimiters. Your help would be much appreciated!
your tag field probably contains comma as a value which is treated as the delimiter.
enclose your data in quotes or any other quote char(remember to set .option('quote','')) and read the data again. It should work

PySpark caused mismatch column when reading from csv

Edit: The previous problem was solved by specifying the argument multiLine by Truein the spark.read.csv function. However, I discovered another problem when using the spark.read.csv function.
Another problem I encountered was with another csv file in the same dataset as described in the question. It is a review dataset from insideairbnb.com.
The csv file is like this:
:
But the output of the read.csv function concatenated several lines together and generated a weird format:
Any thoughts? Thank you for your time.
The following problem was solved by specifying the argument multiLine in spark.read.csv function. The root cause was there were \r\n\n\r strings in one of the columns, which the function treated as a line separator instead of a string
I attempted to load a large csv file to a spark dataframe using PySpark.
listings = spark.read.csv("listings.csv")
# Loading to SparkSession
listings.createOrReplaceTempView("listings")
When I tried to get a glance at the result using Spark SQL with the following code:
listing_query = "SELECT * FROM listings LIMIT 20"
spark.sql(listing_query).show()
I got the following result:
Which is very weird consider reading the csv with pandas outputs the correct format of the table without the mismatched column.
Any idea about what caused this issue and how to fix it?

How do I prevent a value from converting to a date or executing as division?

I have a column in a dataframe that has values in the format XX/XX (Ex: 05/23, 4/22, etc.) When I convert it to a csv, it converts to a date. How do I prevent this from happening?
I tried putting an equals sign in front but then it executes like division (Ex: =4/20 comes out to 0.5).
df['unique_id'] = '=' + df['unique_id']
I want the output to be in the original format XX/XX (Ex: 5/23 stays 5/23 in the csv file in Excel).
Check the datatypes of your dataframe with df.dtypes. I assume your column is interpreted as date. Then you can do df[col] = df[col].astype(np_type_you_want)
If that doenst bring the wished result, check why the column is interpreted as date when creating the df. Solution depends on where you get the data from.
The issue is not an issue with python or pandas. The issue is that excel thinks its clever and assumes it knows your data type. you were close with trying to put an = before your data but your data needs to be wrapped in qoutes and prefixed with an =. I cant claim to have come up with this answer myself. I obtained it from this answer
The following code will allow you to write a CSV file that will then open in excel without any formating trying to convert to date or executing division. However it shoudl be noted that this is only really a strategy if you will only be opening the CSV in excel. as you are wrapping formating info around your data which will then be stripped out by excel. If you are using this csv in any other software you might need to rethink about it.
import pandas as pd
import csv
data = {'key1': [r'4/5']}
df = pd.DataFrame.from_dict(data)
df['key1'] = '="' + df['key1'] + '"'
print(df)
print(df.dtypes)
with open(r'C:\Users\cd00119621\myfile.csv', 'w') as output:
df.to_csv(output)
RAW OUTPUT in file
,key1
0,"=""4/5"""
EXCEL OUTPUT

Categories