csv Writer using Datafields returned by Pandas - python

Hello I'm working on a project that reads an excel worksheet, collects columns of data based on header title, and then writes that data to a much leaner csv file which I'll be using for more fun later.
I'm getting a syntax error while trying to write my new csv file, I think it has something to do with the datafields I'm using to get my columns in pandas.
I'm new to Python so any help you can provide would be great, thanks!
import pandas
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook("C:\\Python27\\Work\\spreadsheet.xlsx")
sh = wb.sheet_by_name('Sheet1')
spoofingFile = open('spoofing.csv', 'wb')
wr = csv.writer(spoofingFile, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
spoofingFile.close()
csv_from_excel()
df = pandas.read_csv('C:\\Python27\\Work\\spoofing.csv')
time = df["InviteTime (Oracle)"]
orignum = df["Orig Number"]
origip = df["Orig IP Address"]
destnum = df["Dest Number"]
sheet0bj = csv.writer(open("complete.csv", "wb")
sheet0bj.writerow([time,orignum,origip,destnum])
The syntax error is thus:
file c:\python27\work\formatsheettest.py, line36
sheet0bj.writerow([time, orignum, origip, destnum])
^
Syntax error: Invalid syntax

You're missing a closing paren on the second to last line.
sheet0bj = csv.writer(open("complete.csv", "wb")
should be
sheet0bj = csv.writer(open("complete.csv", "wb"))
I assume you've figured that out by now, though.

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

Python - Merge Excel Files with missing column names

Help & valuable advise needed from the experts please.
I have 100s of csv files which i would like to merge but table columns aren't same. For example
File 1: Header1, Header2, Header3
File 2: Header1, Header2
Files 3: Header3, Header4
Files 4: Header1, Header3, Header4
I would like to merge the data from all of these csv files.
I thought of a logic but I am struggling to implement it in a python code. I think it requires a Dataframe with pre-defined headers. (Header1, Header2, Header3, Header4). Then i should loop through each csv file to search for that header. If the header exists then the data is appended to the dataframe else skip it.
Can someone please advise if there is a function in Python that can make it simple ? I have tried to use read_csv but the data structure should be the same when looping throw the csv files.
import pandas as pd
import os
import glob
# Location of CSV Files
data_location='C:\\Tableau Reports\\ST Database\\Files\\Downloads\\CSV Files\\*.csv'
# This will give path location to all CSV files
csv_files=glob.glob(data_location)
for csv_file in csv_files:
# Encoding to be changed (UTF-8).
with open(csv_file, newline='', encoding='cp1252') as csvfile:
df_file = pd.read_csv(csvfile)
Combined_df = pd.concat(df_file)
print(Combined_df)
I tried to following advise given in this forum but getting error at line 12 df_file = pd.read_csv(csvfile).
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 114934: character maps to
Any help or advise is greatly appreciated. Thank you
EDIT
The following code you posted seems overly complex. You don't have to use open with pd.read_csv. You can directly pass in a file name as a string
for csv_file in csv_files:
# Encoding to be changed (UTF-8).
with open(csv_file, newline='', encoding='cp1252') as csvfile:
df_file = pd.read_csv(csvfile)
The following should work using a list-comprehension
df = pd.concat(pd.read_csv(csv_file) for csv_file in csv_files)
If you're not a fan of list-comprehensions then you can instead start with an empty df and keep concatenating your small dfs onto it
large_df = pd.DataFrame()
for csv_file in csv_files:
small_df = pd.read_csv(csv_file)
large_df = pd.concat((large_df,small_df))
ORIGINAL
pd.concat can already do this for you! I've made example data. f1_r1 means "file1, row1" just to show that it's working as expected
import pandas as pd
import io
#The io.StringIO makes these "pretend" .csv files for illustration
#All of this is just to create the example data, you won't do this
file1_name = io.StringIO("""
Header1,Header2,Header3
f1_r1,f1_r1,f1_r1
f1_r2,f1_r2,f1_r2
""")
file2_name = io.StringIO("""
Header1,Header2
f2_r1,f2_r1
f2_r2,f2_r2
""")
file3_name = io.StringIO("""
Header3,Header4
f3_r1,f3_r1
f3_r2,f3_r2
""")
file4_name = io.StringIO("""
Header1,Header3,Header4
f4_r1,f4_r1,f4_r1
f4_r2,f4_r2,f4_r2
""")
#this would be a list of your .csv file names
files = [file1_name, file2_name, file3_name, file4_name]
#pd.concat already handles this use-case for you!
combined_df = pd.concat(pd.read_csv(f_name,encoding='utf8') for f_name in files)
print(combined_df)

Proper convert from excel to csv

I have a excel (.xls) with 1 sheet (Sheet1) and I want to convert that into csv using python 3. I found this code:
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook('safir/fisier-safir.xls')
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open('safir.csv', 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
csv_from_excel()
But the problem with this is that this approach will not preserve Excel formatting of certain numbers, 1 will become 1.00, 001 -> 1.0, 0,9 ->1, and all of this format problems, if I do the conversion excel -> csv manually I won't found this issues, and I got similar issues with others script also, do someone have a proper one ? Thank you!
Can you please try the below code. It's keep the formatting as is
import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename='C:\\pscript\Copy.xlsx',ReadOnly=1)
wb = xl.Workbooks(1)
print(wb)
wb.SaveAs(Filename='C:\\pscript\out.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None
print("done")

Tablib export corrupting files

I'm writing a simple code to transform csv back to xls with Tablib on python.
As I understand, Tablib does conversion for you if you import the csv.
import tablib
imported_data = tablib.import_set(open('DB.csv',encoding='utf8').read())
f = open('workfile.xls', 'wb')
f.write(imported_data.xls)
f.close()
This code handles small sample of the database, but fails at one point (~600 lines) meaning that is compiles successfully but Excel cannot open the file at that point.
I'm not sure how to proceed - is this tablib failing or does Excel fail to read encoded data?
this two functions allow you to import from csv, after export to excel file
import csv
from xlsxwriter import Workbook
import operator
# This function for import from csv
def CSV2list_dict(file_name):
with open(file_name) as f:
a = [{k: int(v) for k, v in row.items()}
for row in csv.DictReader(f, skipinitialspace=True)]
return a
# file_name must be end with .xlsx
# The second parameter represente the header row of data in excel,
# The type of header is a list of string,
# The third paramater represente the data in list dictionaries form
# The last paramater represente the order of the key
def Export2excel(file_name, header_row, list_dict, order_by):
list_dict.sort(key=operator.itemgetter(order_by))
wb=Workbook(file_name)
ws=wb.add_worksheet("New Sheet") #or leave it blank, default name is "Sheet 1"
first_row=0
for header in header_row:
col=header_row.index(header) # we are keeping order.
ws.write(first_row,col,header) # we have written first row which is the header of worksheet also.
row=1
for art in list_dict:
for _key,_value in art.items():
col=header_row.index(_key)
ws.write(row,col,_value)
row+=1 #enter the next row
wb.close()
csv_data = CSV2list_dict('DB.csv')
header = ['col0','col1','col2']
order = 'col0' # the type of col0 is int
Export2excel('workfile.xlsx', header, csv_data, order)
As an alternative approach, you could just ask Excel to do the conversion as follows:
import win32com.client as win32
import os
excel = win32.gencache.EnsureDispatch('Excel.Application')
src_filename = r"c:\my_folder\my_file.csv"
name, ext = os.path.splitext(src_filename)
target_filename = name + '.xls'
wb = excel.Workbooks.Open(src_filename)
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False
wb.SaveAs(target_filename, FileFormat=56, ConflictResolution=2)
excel.Application.Quit()
Microsoft has a list of File formats that you can use, where 56 is used for xls.
If you are using the new openpyxl 2.5 this will not work. You need to remove 2.5 and instead pip install 2.4.9.
import tablib
Depending on whether it is a dataset(one page) or databook(multiple) you need to declare:(changes here)
imported_data = tablib.Dataset()
or
imported_data = tablib.Databook()
Then you can import your data.(changes here)
imported_data.csv = tablib.import_set(open('DB.csv', enconding='utf8').read())
without specifying the .csv in your example tablib doesn't know the format.
imported_data = tablib.import_set(open('DB.csv',encoding='utf8').read())
then you could print to see the various options you have.
print(imported_data)
print(imported_data.csv)
print(imported_data.xlsx)
print(imported_data.dict)
print(imported_data.db)
etc.
Then write your file.(No changes here)
f = open('workfile.xls', 'wb')
f.write(imported_data.xls) # or .xlsx
f.close()

Adding data frame to excel sheet

I am trying to write a dataframe to excel using panda.ExcelWriter after reading it from a huge csv file.
This code updates the excel sheet but it doesn't appends the data to the excel which I want
import pandas as pd
reader = pd.read_csv("H:/ram/temp/1.csv", delimiter = '\t' ,chunksize = 10000, names = ['neo_user_id',
'gender',
'age_range',
'main_geolocation', # (user identifier of the client)
'interest_category_1',
'interest_category_2',
'interest_category_3',
'first_day_identifier'
], encoding="utf-8")
ew = pd.ExcelWriter('H:/ram/Formatted/SynthExport.xlsx', engine='xlsxwriter', options={'encoding':'utf-8'})
for chunks in reader:
chunks.to_excel(ew, 'Sheet1' , encoding = 'utf-8')
print len(chunks)
ew.save()
I also tried to use data.append() and data.to_excel doing this result is memory error. Since I am reading data in chunks is there any way to write the data to excel
I got it working by this code
import pandas as pd
import xlsxwriter
reader = pd.read_csv("H:/ram/user_action_export.2014.01.csv", delimiter = '\t', chunksize = 1000, names = ['day_identifier',
'user_id',
'site_id',
'device', # (user identifier of the client)
'geolocation',
'referrer',
'pageviews',
], encoding="utf-8")
startrows = 0
ew = pd.ExcelWriter('H:/ram/Formatted/ActionExport.xlsx', engine='xlsxwriter', options={'encoding':'utf-8'})
for chunks in reader:
chunks.to_excel(ew, 'Sheet1' , encoding = 'utf-8', startrow = startrows)
startrows = startrows + len(chunks)
print startrows
ew.save()
But still take so much time
I don't know if it is causing the main issue but you shouldn't be calling save() between chunks since a single call to save() closes an xlsxwriter file.

Categories