Unable to parse string quoted csv data using pandas - python

I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)

First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?
I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()
It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello

importing a csv file with clean columns using pandas?

so i'm trying to import this csv file and each value is seperated by a comma but how do i make new rows and columns from the imported data?
I tried importing it as normal and printing the data frame in different ways.
try the same with
df = pd.read_csv('file_name.csv', sep = ',')
this might work

Quotation Marks disappearing in pandas read csv

So, I have a .txt file and I want to read it in pandas.
The line is this when I open in Notepad++:
"1013764";"Test INT"12345678"";"TEST";"TEST";""
Then, to open in pandas, I do this:
data = pd.read_csv("TestFile.TXT", sep=";")
When I print "data", it appears like this:
Any solution for the quotation mark not to disappear?
You need to remove the quotation marks by replacing them.
Let's say that the column name is col1, then:
df['col1'] = df['col1'].str.replace('/"', '')
The simplest solution I could find is
import csv
import pandas
data = pd.read_csv("<youre file>", sed=";", quoting=csv.QUOTE_NONE)
(In your case the code above will produce this:
Columns: ["1013764", "Test INT"12345678"", "TEST", "TEST", ""])
The Problem with this is that read_csv will parse everything as a string. I would advise you (If you want to preserve the "datatypes") to use diffrent quotes in your csv (like ') to signal to pandas that the data is a string. This can be done by adding the quotechar parameter!
import pandas
data = pd.read_csv("<youre file>", sed=";", quotechar="'")
More information about the read_csv can be found in the pandas docs:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Split multiple times?

So I'm currently transferring a txt file into a csv. It's mostly cleaned up, but even after splitting there are still empty columns between some of my data.
Below is my messy CSV file
And here is my current code:
Sat_File = '/Users'
output = '/Users2'
import csv
import matplotlib as plt
import pandas as pd
with open(Sat_File,'r') as sat:
with open(output,'w') as outfile:
if "2004" in line:
line=line.split(' ')
writer=csv.writer(outfile)
writer.writerow(line)
Basically, I'm just trying to eliminate those gaps between columns in the CSV picture I've provided. Thank you!
You can use python Pandas library to clear out the empty columns:
import pandas as pd
df = pd.read_csv('path_to_csv_file').dropna(axis=1, how='all')
df.to_csv('path_to_clean_csv_file')
Basically we:
Import the pandas library.
Read the csv file into a variable called df (stands for data frame).
Than we use the dropna function that allows to discard empty columns/rows. axis=1 means drop columns (0 means rows) and how='all' means drop columns all of the values in them are empty.
We save the clean data frame df to a new, clean csv file.
$$$ Pr0f!t $$$

Using %s in Python strips leading zeroes in CSV to XML conversion

Take this test CSV file:
COLUMN1;COLUMN2;COLUMN3;COLUMN4;COLUMN5;COLUMN6;COLUMN7
CODE;1234;0123456789;0987654321;012345678987654321;012345;10110025
I want to convert this file to XML. To do it, I am using the code in this Stackoverflow answer. The complete test code is this:
import csv
import pandas as pd
df = pd.read_csv('test.csv', sep=';')
def convert_row(row):
return """<root>
<column1>%s</column1>
<column2>%s</column2>
<column3>%s</column3>
<column4>%s</column4>
<column5>%s</column5>
<column6>%s</column6>
<column7>%s</column7>
</root>""" % (
row.COLUMN1, row.COLUMN2, row.COLUMN3, row.COLUMN4, row.COLUMN5, row.COLUMN6, row.COLUMN7)
print '\n'.join(df.apply(convert_row, axis=1))
However, every column value starting with a zero gets stripped of the leading zero character. This is the output:
<root>
<column1>CODE</column1>
<column2>1234</column2>
<column3>123456789</column3>
<column4>987654321</column4>
<column5>12345678987654321</column5>
<column6>12345</column6>
<column7>10110025</column7>
</root>
I thought using %s would keep the original string intact without modifying it in any way, is this not the case?
How can I make sure that the XML output receives exactly the same value in the CSV file?
The problem doesn't lie with the string formatting, but with the CSV import. Pandas converts your data to int64's when importing.
Try df = pd.read_csv('test.csv', sep=';', dtype='str') to avoid this.
Hope this helps!

Categories