Using %s in Python strips leading zeroes in CSV to XML conversion

Using %s in Python strips leading zeroes in CSV to XML conversion - python

Take this test CSV file:
COLUMN1;COLUMN2;COLUMN3;COLUMN4;COLUMN5;COLUMN6;COLUMN7
CODE;1234;0123456789;0987654321;012345678987654321;012345;10110025
I want to convert this file to XML. To do it, I am using the code in this Stackoverflow answer. The complete test code is this:
import csv
import pandas as pd
df = pd.read_csv('test.csv', sep=';')
def convert_row(row):
return """<root>
<column1>%s</column1>
<column2>%s</column2>
<column3>%s</column3>
<column4>%s</column4>
<column5>%s</column5>
<column6>%s</column6>
<column7>%s</column7>
</root>""" % (
row.COLUMN1, row.COLUMN2, row.COLUMN3, row.COLUMN4, row.COLUMN5, row.COLUMN6, row.COLUMN7)
print '\n'.join(df.apply(convert_row, axis=1))
However, every column value starting with a zero gets stripped of the leading zero character. This is the output:
<root>
<column1>CODE</column1>
<column2>1234</column2>
<column3>123456789</column3>
<column4>987654321</column4>
<column5>12345678987654321</column5>
<column6>12345</column6>
<column7>10110025</column7>
</root>
I thought using %s would keep the original string intact without modifying it in any way, is this not the case?
How can I make sure that the XML output receives exactly the same value in the CSV file?

The problem doesn't lie with the string formatting, but with the CSV import. Pandas converts your data to int64's when importing.
Try df = pd.read_csv('test.csv', sep=';', dtype='str') to avoid this.
Hope this helps!

Related

Quotation Marks disappearing in pandas read csv

So, I have a .txt file and I want to read it in pandas.
The line is this when I open in Notepad++:
"1013764";"Test INT"12345678"";"TEST";"TEST";""
Then, to open in pandas, I do this:
data = pd.read_csv("TestFile.TXT", sep=";")
When I print "data", it appears like this:
Any solution for the quotation mark not to disappear?

You need to remove the quotation marks by replacing them.
Let's say that the column name is col1, then:
df['col1'] = df['col1'].str.replace('/"', '')

The simplest solution I could find is
import csv
import pandas
data = pd.read_csv("<youre file>", sed=";", quoting=csv.QUOTE_NONE)
(In your case the code above will produce this:
Columns: ["1013764", "Test INT"12345678"", "TEST", "TEST", ""])
The Problem with this is that read_csv will parse everything as a string. I would advise you (If you want to preserve the "datatypes") to use diffrent quotes in your csv (like ') to signal to pandas that the data is a string. This can be done by adding the quotechar parameter!
import pandas
data = pd.read_csv("<youre file>", sed=";", quotechar="'")
More information about the read_csv can be found in the pandas docs:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Saving columns from csv

I am trying to write a code that reads a csv file and can save each columns as a specific variable. I am having difficulty because the header is 7 lines long (something I can control but would like to just ignore if I can manipulate it in code), and then my data is full of important decimal places so it can not change to int( or maybe string?) I've also tried just saving each column by it's placement in the file but am struggling to run it. Any ideas?
Image shows my current code that I have slimmed to show important parts and circles data that prints in my console.

save each columns as a specific variable
import pandas as pd
pd.read_csv('file.csv')
x_col = df['X']
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

If what you are looking for is how to iterate through the columns, no matter how many there are. (Which is what I think you are asking.) Then this code should do the trick:
import pandas as pd
import csv
data = pd.read_csv('optitest.csv', skiprows=6)
for column in data.columns:
# You will need to define what this save() method is.
# Just placing it here as an example.
save(data[column])
The line about formatting your data as a number or a string was a little vague. But if it's decimal data, then you need to use float. See #9637665.

Extracting individual rows from dataframe

I am currently doing one of my final assignment and I have a CSV file with a few columns of different data.
Currently interested in extracting out a single column and converting the individual rows into a txt file.
Here is my code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"a", encoding="utf-8") as f:
f.write(df["content"][i])
No issue with extracting out the individual rows. But when I examine the txt files that was extracted and look at the content, I noticed that it copied out the text (which is what I want) but it did so twice (which is not what I want).
Example:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
This is what was copied to the txt file:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file.This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
Any advise on how to just copy the content once only?

Thanks! While thinking about how to rectify this, I came to the same conclusion as you. I made a switch from "a" to "w" and it solved that issue.
Too used to append so I tried that before I tried write.
The correct code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"w", encoding="utf-8") as f:
f.write(df["content"][i])

Unable to parse string quoted csv data using pandas

I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)

First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the

delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)

Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using %s in Python strips leading zeroes in CSV to XML conversion - python

The problem doesn't lie with the string formatting, but with the CSV import. Pandas converts your data to int64's when importing. Try df = pd.read_csv('test.csv', sep=';', dtype='str') to avoid this. Hope this helps!

Related

Quotation Marks disappearing in pandas read csv

Saving columns from csv

Extracting individual rows from dataframe

Unable to parse string quoted csv data using pandas

Creating a dataframe from a csv file in pandas: column issue

Categories

Resources