Split multiple times? - python

So I'm currently transferring a txt file into a csv. It's mostly cleaned up, but even after splitting there are still empty columns between some of my data.
Below is my messy CSV file
And here is my current code:
Sat_File = '/Users'
output = '/Users2'
import csv
import matplotlib as plt
import pandas as pd
with open(Sat_File,'r') as sat:
with open(output,'w') as outfile:
if "2004" in line:
line=line.split(' ')
writer=csv.writer(outfile)
writer.writerow(line)
Basically, I'm just trying to eliminate those gaps between columns in the CSV picture I've provided. Thank you!

You can use python Pandas library to clear out the empty columns:
import pandas as pd
df = pd.read_csv('path_to_csv_file').dropna(axis=1, how='all')
df.to_csv('path_to_clean_csv_file')
Basically we:
Import the pandas library.
Read the csv file into a variable called df (stands for data frame).
Than we use the dropna function that allows to discard empty columns/rows. axis=1 means drop columns (0 means rows) and how='all' means drop columns all of the values in them are empty.
We save the clean data frame df to a new, clean csv file.
$$$ Pr0f!t $$$

Related

importing a csv file with clean columns using pandas?

so i'm trying to import this csv file and each value is seperated by a comma but how do i make new rows and columns from the imported data?
I tried importing it as normal and printing the data frame in different ways.
try the same with
df = pd.read_csv('file_name.csv', sep = ',')
this might work

Extracting individual rows from dataframe

I am currently doing one of my final assignment and I have a CSV file with a few columns of different data.
Currently interested in extracting out a single column and converting the individual rows into a txt file.
Here is my code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"a", encoding="utf-8") as f:
f.write(df["content"][i])
No issue with extracting out the individual rows. But when I examine the txt files that was extracted and look at the content, I noticed that it copied out the text (which is what I want) but it did so twice (which is not what I want).
Example:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
This is what was copied to the txt file:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file.This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
Any advise on how to just copy the content once only?
Thanks! While thinking about how to rectify this, I came to the same conclusion as you. I made a switch from "a" to "w" and it solved that issue.
Too used to append so I tried that before I tried write.
The correct code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"w", encoding="utf-8") as f:
f.write(df["content"][i])

How do I prevent pandas from writing a new column when I save to csv

I wrote this code just so show the example that I'm having. I need to save the data I have to a csv then reopen it later but when I reload the data into a pandas dataframe from csv it now has an extra unnamed column at the front that I don't want and it's messing up my data when I try to do .drop_duplicates() because each row now has its own number and every I reopen it from a csv it will have a new row of number at the front, just making everything worse. How do I make it so it doesn't have this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.to_csv('data.csv')
print(df.head())
df1 = pd.read_csv('data.csv')
print(df1.head())
Its the dataframe index. You can turn that off with
df.to_csv('data.csv', index=False)
The docs are the first stop to learn the different options you have when writing. pandas.DataFrame.to_csv
While reading, you can prevent columns with empty rows like:
df = pd.read_csv("data.csv").dropna()
The solution was super easy. I needed to do
df.to_csv('data.csv', index= False)

Unable to parse string quoted csv data using pandas

I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)

Delete rows in CSV file after being read by pandas

So I want to have 1 script writing continually to a CSV file, and another script reading periodically from that same CSV file.
What I'm looking for is a way to delete the rows I've just read in from the CSV file (not from my pandas dataframe).
Can anybody help?
# Read data in to dataframe
deviceInfo = pd.read_csv("sampleData.csv", nrows = 100)
# Somehow delete those 100 rows from the CSV file
#JoseAngelSanchez is correct that you might want to read the whole csv into a dataframe, but I think this way lets you get a dataframe with the first 100 rows and still delete them from the csv file.
import pandas as pd
df = pd.read_csv("sampleData.csv")
deviceInfo = df.iloc[:100]
df.iloc[100:].to_csv("sampleData.csv")
Note: if you're doing this repetitively then you'll probably want to write to_csv(...,index=None) or a new index column will be created in the .csv file on each iteration.
You should read the whole document and then delete the rows you don't want
import pandas as pd
df = pd.read_csv("sampleData.csv")
df = df.iloc[100:]
df.to_csv("sampleData.csv")

Categories