Convert bytestring containing line breaks inside quotes to CSV file - python

My target is to create a CSV file from an API call.
The problem is: The API returns a bytestring as content and I don't know how do convert it to a CSV file properly.
The content part of the response looks like this:
b'column_title_1, column_title_2, column_title_3, value1_1, value1_2, value1_3\nvalue2_1, value2_2(="Perfect!\nThank you"), value2_3\n, value3_1, value3_2, value3_3\n....'
How can I manage to get a clean CSV file from this? I tried Pandas, the CSV module and Numpy. Unfortunately, I was not able to handle the newline escapes which are sometimes within a string value (it is a column for comments) - see value2_2.
The result should look like this:
column_title_1, column_title_2, column_title_3
value1_1, value1_2, value1_3
value2_1, value2_2(="Perfect!\nThank you"), value2_3
value3_1, value3_2, value3_3
The closest of my results was this:
column_title_1, column_title_2, column_title_3
value1_1, value1_2, value1_3
value2_1, value2_2(="Perfect!
Thank you"), value2_3
value3_1, value3_2, value3_3
Even if I got close, I was not able to get rid of the \n within the values of some columns.
I did not figure out how to exclude the \n which are within "".

Related

printSchema having all columns in the first one

I have loaded a text file using the load csv function but when I try to print the schema it shows just one field from the root including every row in that one. like this:
root
|-- Prscrbr_Geo_Lvl Prscrbr_Geo_Cd Prscrbr_Geo_Desc Brnd_Name
Any idea how to fix this?
Adding my comment as an answer since it seems to have solved the problem.
From the output, it looks like the CSV file is actually using tab characters as the separator between columns instead of commas. To get Spark to use tabs as the separator, you can use spark.read.format("csv").option("sep", "\t").load("/path/to/file")

text response from get request into a python pandas data frame excluding begin and end lines

I am new to python, I am working on code that performs a get request from an api and returns the response in a text format and when I use
print(response.text)
I get the response in the below format -
ResponseBegin
Name|Age|Gender|Country
"ABC"|23|M|USA
"ABCD"|21|F|CAN
ResponseEnd
Can anyone please advise how to convert this into a pandas dataframe and also remove the ResponseBegin and ResponseEnd lines at the beginning and ending making the second row as the column header using | as a delimiter.
Thank you very much for your advise.
Thank you
It's more helpful if you do not show what print(response.text) contains, but just what response.text contains, since the print function is doing some formatting for human readability.
But I will assume that response.text is just a single string that looks like this:
'ResponseBegin\nName|Age|Gender|Country\n"ABC"|23|M|USA\n"ABCD"|21|F|CAN\nResponseEnd'
Notice the \n, which is the "newline" character.
There are several ways to solve this, but the easiest (fewest lines of code) I think is to export it to a CSV file and then read it in:
with open('mydf.csv', 'w') as fh:
fh.write(response.text)
import pandas as pd
df = pd.read_csv('mydf.csv', sep='|', skiprows=1, skipfooter=1)
You can read more about read_csv for all of its handy tools, but here I am using:
sep: the thing to use as a separator, | in your case
skiprows/skipfooter: the number of lines at the beginning or end to skip

Overflow error when reading json file

I am trying to read a json which includes a number of tweets, but I get the following error.
OverflowError: int too large to convert
The script filters multiple json files to get specific tweets, and it crashes when reaching to a specific json.
The line that creates the error is this one :
df_temp = pd.read_json(path_or_buf=json_path, lines=True)
Here is the error in the cmd
Just store the user id as a String, and treat it like it is one (this is actually what you should do when dealing with this kind of ids). If you can't change the json input format, you can always parse it like a string before parsing it like a json object, and add the quotes to the id code, using for instance regexes: Regex in python.
I don't know with which library you are parsing the json, but maybe also implicit casting will work: either try the "getString" method on the number instead of the "getInt" method, or force python to treat the object like a string, with something like x = "" + json.getId()
Python is pretty loose on typing and may let you do it.

Adding linebreaks within cells in CSV - Python 3

This is essentially the same question asked here: How can you parse excel CSV data that contains linebreaks in the data?
But I'm using Python 3 to write my CSV file. Does anyone know if there's a way to add line breaks to cell values from Python?
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
I've tried appending HTML line breaks between each item but the system to where I upload the data doesn't seem to recognize HTML.
Any and all help is appreciated.
Thanks!
Figured it out after playing around and I feel so stupid.
for key in dictionary:
outfile.writerow({
"Order ID": key,
"Item": "\n".join(dictionary[key])
})
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
The proper way to use newlines in fields is like this:
"order_number1","item1
item2"
"order_number2","item1"
"order_number3","item1
item2
item3"
The \n you show are just part of the string. Some software may convert it to a newline, other software may not.
Also try to avoid spaces around the separators.

Python replace plus sign from excel

The data I pull from DB comes in the following format:
+jacket
online trading account
+neptune
When I write this data to a CSV I end up with a #NAME? error. I tried adding single quote ' to the front of the values when I pull the data, however, this does not fix the issue. I need to write the values exactly as they come, with the plus sign at the front.
You simply need to format the desired output column as a text column. This will result in:
+jacket
online trading account
+neptune
being written to the file exactly as is. No more #NAME? errors.

Categories