Quotation Marks disappearing in pandas read csv - python

So, I have a .txt file and I want to read it in pandas.
The line is this when I open in Notepad++:
"1013764";"Test INT"12345678"";"TEST";"TEST";""
Then, to open in pandas, I do this:
data = pd.read_csv("TestFile.TXT", sep=";")
When I print "data", it appears like this:
Any solution for the quotation mark not to disappear?

You need to remove the quotation marks by replacing them.
Let's say that the column name is col1, then:
df['col1'] = df['col1'].str.replace('/"', '')

The simplest solution I could find is
import csv
import pandas
data = pd.read_csv("<youre file>", sed=";", quoting=csv.QUOTE_NONE)
(In your case the code above will produce this:
Columns: ["1013764", "Test INT"12345678"", "TEST", "TEST", ""])
The Problem with this is that read_csv will parse everything as a string. I would advise you (If you want to preserve the "datatypes") to use diffrent quotes in your csv (like ') to signal to pandas that the data is a string. This can be done by adding the quotechar parameter!
import pandas
data = pd.read_csv("<youre file>", sed=";", quotechar="'")
More information about the read_csv can be found in the pandas docs:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Related

importing a csv file with clean columns using pandas?

so i'm trying to import this csv file and each value is seperated by a comma but how do i make new rows and columns from the imported data?
I tried importing it as normal and printing the data frame in different ways.
try the same with
df = pd.read_csv('file_name.csv', sep = ',')
this might work

How can I read in csv files with columns as variable names in Python?

This is a Python question. I have a csv file and would like to read that in. The first row in the file are strings and I would like to use them as variable names. The other rows are integers and I would like them to be a vector of the name of the respective variable.
Thanks,
Tim
you need to first extract your first row I suggest to count the characters of first row and use this code to read them
f = open("demofile.txt", "r")
print(f.read(5))#put your desired counted charactor inside f.read(n)
when you successfully read it save it on variable and after saving use regex to split them with respect to ","
import re
txt = "The rain in Spain"
x = re.split("[,]", txt, 1)
print(x)
after that use dictionary methods to attain your desired result.
You can simply use pandas to read .csv files. Just install pandas using 'pip install pandas'. Then use the following code:
import pandas as pd
dataframe = pd.read_csv('data.csv')
# Returns a list containing names of the columns
column_names = list(dataframe.columns.values)

Unable to parse string quoted csv data using pandas

I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)

How to fetch input from the csv file in python

I have a csv (input.csv) file as shown below:
VM IP Naa_Dev Datastore
vm1 xx.xx.xx.x1 naa.ab1234 ds1
vm2 xx.xx.xx.x2 naa.ac1234 ds1
vm3 xx.xx.xx.x3 naa.ad1234 ds2
I want to use this csv file as an input file for my python script. Here in this file, first line i.e. (VM IP Naa_Dev Datastore) is the column heading and each value is separated by space.
So my question is how we can use this csv file for input values in python so if I need to search in python script that what is the value of vm1 IP then it should pickup xx.xx.xx.x1 or same way if I am looking for VM which has naa.ac1234 Naa_Dev should take vm2.
I am using Python version 2.7.8
Any help is much appreciated.
Thanks
Working with tabular data like this, the best way is using pandas.
Something like:
import pandas
dataframe = pandas.read_csv('csv_file.csv')
# finding IP by vm
print(dataframe[dataframe.VM == 'vm1'].IP)
# OUTPUT: xx.xx.xx.x1
# or find by Naa_Dev
print(dataframe[dataframe.Naa_Dev == 'xx.xx.xx.x2'].VM)
# OUTPUT: vm2
For importing csv into python you can use pandas, in your case the code would look like:
import pandas as pd
df = pd.read_csv('input.csv', sep=' ')
and for locating certain rows in created dataframe you can multiple options (that you can easily find in pandas or just by googling 'filter data python'), for example:
df['VM'].where(df['Naa_Dev'] == 'naa.ac1234')
Use the pandas module to read the file into a DataFrame. There is a lot of parameters for reading csv files with pandas.read_csv. The dataframe.to_string() function is extremely useful.
Solution:
# import module with alias 'pd'
import pandas as pd
# Open the CSV file, delimiter is set to white space, and then
# we specify the column names.
dframe = pd.read_csv("file.csv",
delimiter=" ",
names=["VM", "IP", "Naa_Dev", "Datastore"])
# print will output the table
print(dframe)
# to_string will allow you to align and adjust content
# e.g justify = left to align columns to the left.
print(dframe.to_string(justify="left"))
Pandas is probably the best answer but you can also:
import csv
your_list = []
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter=' ')
for row in reader:
your_list += [row]
print(your_list)

Using %s in Python strips leading zeroes in CSV to XML conversion

Take this test CSV file:
COLUMN1;COLUMN2;COLUMN3;COLUMN4;COLUMN5;COLUMN6;COLUMN7
CODE;1234;0123456789;0987654321;012345678987654321;012345;10110025
I want to convert this file to XML. To do it, I am using the code in this Stackoverflow answer. The complete test code is this:
import csv
import pandas as pd
df = pd.read_csv('test.csv', sep=';')
def convert_row(row):
return """<root>
<column1>%s</column1>
<column2>%s</column2>
<column3>%s</column3>
<column4>%s</column4>
<column5>%s</column5>
<column6>%s</column6>
<column7>%s</column7>
</root>""" % (
row.COLUMN1, row.COLUMN2, row.COLUMN3, row.COLUMN4, row.COLUMN5, row.COLUMN6, row.COLUMN7)
print '\n'.join(df.apply(convert_row, axis=1))
However, every column value starting with a zero gets stripped of the leading zero character. This is the output:
<root>
<column1>CODE</column1>
<column2>1234</column2>
<column3>123456789</column3>
<column4>987654321</column4>
<column5>12345678987654321</column5>
<column6>12345</column6>
<column7>10110025</column7>
</root>
I thought using %s would keep the original string intact without modifying it in any way, is this not the case?
How can I make sure that the XML output receives exactly the same value in the CSV file?
The problem doesn't lie with the string formatting, but with the CSV import. Pandas converts your data to int64's when importing.
Try df = pd.read_csv('test.csv', sep=';', dtype='str') to avoid this.
Hope this helps!

Categories