How to remove double quotes in value reading from csv file - python

My csv file:
Mp4,Mp3,"1234554"
My code:
csv=csv=b''.join(csv).split(b'\n')
for index,row in enumerate(csv):
row=re.split(b''',(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''',row)
for records in row:
print(records)
when it printing the records ,for the 3rd element it prints with ""i need to ignore this doubles quotes.

I think this should do it:
records = records.replace("\"","")
Edit
Using pandas.read_csv is better for working with csv files
import pandas as pd
csv = pd.read_csv('data.csv', delimiter=',', names=['x', 'y', 'z'])
# iterate over the dataframe
for index, row in csv.iterrows():
print(row['x'], row['y'], row['z'])
Assuming content of data.csv looks like
Mp4,Mp3,"1234554"
The Output would look like this:
Mp4 Mp3 1234554
If your csv file includes column names e.g.
file_type1,file_type2,size
mp4,mp3,"1234554"
Just remove the names parameter if you read in the csv file:
csv = pd.read_csv('data.csv', delimiter=',')
print(csv)
Then the Output would look like this:
file_type1 file_type2 size
0 mp4 mp3 1234554
Read more about pandas or pandas.read_csv

You could easlily replace it with
print(records.replace('"',''))

Related

Join large set of CSV files where the header is the timestamp for the file

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.
Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.
Ex.
file1.csv
2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
file2.csv
2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.
Like this:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....
Any simple tools for this, or Python routines?
Thanks for the reply on my first question here!
I will also contribute with a solution for this problem.
I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.
import pandas as pd
import os
folder = 'in'
df_out = pd.DataFrame()
for filename in os.scandir(folder):
if filename.is_file():
print('Working on file' + filename.path)
df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])
# Transpose data with timestamp header to columns
df_tranposed = df.T
df_out = df_out.append(df_tranposed)
df_out.to_csv('output.csv')
Try the following Pandas approach:
import pandas as pd
import glob
import os
dfs = []
for csv_filename in glob.glob('./file*.csv'):
print('Working on file', csv_filename)
# Read the CSV file, assume no header and two columns
df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
# Transpose from the 2nd row (skip the timestamp)
df_transposed = df[1:].T
# Allocate the column names from the first row and 'Timestamp'
df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
# Copy the timestamp into the transposed dataframe as a datetime value
df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
# Remove the first row (containing the names)
df_transposed = df_transposed[1:]
dfs.append(df_transposed)
# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)
Alternatively, it could be done using standard Python:
from datetime import datetime
import csv
import glob
data = []
fieldnames = set()
for fn in glob.glob('file*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input, delimiter=';')
timestamp = next(csv_input)[0]
row = {'Timestamp' : timestamp}
for device, error_code in csv_input:
row[device] = error_code
fieldnames.add(device)
data.append(row)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
csv_output.writeheader()
csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))
This gives output.csv as:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0
How does this work
First iterate over all .csv files in a given folder.
For each file open it using a csv.reader()
Read the header row as a special case, storing the value as a Timestamp entry in a dictionary row.
For each row, store additional key value entries in the row dictionary.
Keep a note of each device name using a set.
Append the complete row into a data list.
It is now possible to create an output.csv file. The full list of columns names can be assigned as fieldnames and a csv.DictWriter() used.
Write the header.
Use writerows() to write all the data rows sorted by timestamp. To do this convert each row's Timestamp entry into a datetime value for sorting.
This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode.

How do I replace empty cells from this excel file while I turn it into csv?

I am trying to change this excel file into csv and I would like replace empty cells with Nan. Also do you have any advice on how to clean up the data from excel better? My Code so far:
sheet1 = wb.sheet_by_index(1)
with open("data%s.csv" %(sheet1.name.replace(" ","")), "w", encoding='utf-8') as file:
writer = csv.writer(file, delimiter = ",")
header = [cell.value for cell in sheet1.row(1)]
writer.writerow(header)
for row_idx in range(2, sheet1.nrows):
row = [int(cell.value) if isinstance(cell.value, float) else cell.value
for cell in sheet1.row(row_idx)]
writer.writerow(row)
You can try to use the data library Pandas in python to organize your data better and easier. It can help you change your data to dataframe. You can simply replace the empty value to something like
df.replace(r'^\s*$', np.nan, regex=True)
if you use this module. You can transfer your dataframe back to csv file again after you clean up your dataframe.
Pandas and numpy libraries have some great in-built functionality for working with csv's (and excel spreadsheets). You can load your excel sheet to a dataframe very easily using Pandas read_excel, then using a bit of regex replace whitespace characters with Nan's using numpy. Then save the datafram as a csv using to_csv.
import pandas as pd
import numpy as np
#read in your excel sheet, default is the first sheet
df=read_excel("data.xlsx",sheet_name='data_tab')
#regex for hidden vales e.g. spaces or empty strings
df=df.replace(r'^\s*$', np.nan, regex=True)
#now save this as a csv using to_csv
df.to_csv("csv_data.csv")

Python Pandas - Read data rows and non text in quotes from csv file

I am having issue trying to read csv file with pandas as the data are within quotes and whitespaces are present.
The header row in csv file is "Serial No,First Name,Last Name,Country".
Example data of each row is "1 ,""David, T "",""Barnes "",""USA """.
Below is the code I have tried thus far trying to remove the quotes and reading the text that are within 2 quotes.
import pandas as pd
import csv
df = pd.read_csv('file1.csv', sep=',', encoding='ansi', quotechar='"', quoting=csv.QUOTE_NONNUMERIC, doublequote=True, engine="python")
Is there a way to pre-process the file so that the result is as follows?
Serial No, First Name, Last Name, Country
1, David,T, Barnes, USA
Try using this.
file1 = pd.read_csv('sample.txt',sep=',\s+',skipinitialspace=True,quoting=csv.QUOTE_ALL,engine=python)
Closing this as I am using editpad to replace the commas and removing quotes as a walk-around.

exporting data frame to csv file in python with pandas

I want to export my dataframe to a csv file. normally I want my dataframe as 2 columns but when I export it, in csv file there is only one column and the data is separated with comma.
m is one column and s is another.
df = pd.DataFrame({'MSE':[m], 'SSIM': [s]})
to append new data frames I used below function and save data to csv file:.
with open('test.csv', 'a+') as f:
df.to_csv(f, header=False)
print(df)
when I print dataframe on console output looks like:
MSE SSIM
0 0.743373 0.843658
but in csv file a column looks like: here first is index, second is m and last one is s. I want them in 3 seperate columns
0,1.1264238582283046,0.8178900901529639
How can I solve this?
Your excel setting is most likely ; (semi-colon). Use:
df.to_csv(f, header=False, sep=';')

How to extract a single row from multiple CSV files to a new file

I have hundreds of CSV files on my disk, and one file added daily and I want to extract one row from each of them and put them in a new file. Then I want to daily add values to that same file. CSV files looks like this:
business_day,commodity,total,delivery,total_lots
.
.
20160831,CTC,,201710,10
20160831,CTC,,201711,10
20160831,CTC,,201712,10
20160831,CTC,Total,,385
20160831,HTC,,201701,30
20160831,HTC,,201702,30
.
.
I want to fetch the row that contains 'Total' from each file. The new file should look like:
business_day,commodity,total,total_lots
20160831,CTC,Total,385
20160901,CTC,Total,555
.
.
The raw files on my disk are named '20160831_foo.CSV', '20160901_foo.CSV etc..
After Googling this I have yet not seen any examples on how to extract only one value from a CSV file. Any hints/help much appreciated. Happy to use pandas if that makes life easier.
I ended up with the following:
import pandas as pd
import glob
list_ = []
filenames = glob.glob('c:\\Financial Data\\*_DAILY.csv')
for filename in filenames:
df = pd.read_csv(filename, index_col = None, usecols = ['business_day', 'commodity', 'total', 'total_lots'], parse_dates = ['business_day'], infer_datetime_format = True)
df = df[((df['commodity'] == 'CTC') & (df['total'] == 'Total'))]
list_.append(df)
df = pd.concat(list_, ignore_index = True)
df['total_lots'] = df['total_lots'].astype(int)
df = df.sort_values(['business_day'])
df = df.set_index('business_day')
Then I save it as my required file.
Read the csv files and process them directly like so:
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
# do something here with `row`
break
I would recommend appending rows onto a list after processing for the rows that you desire, and then passing it onto a pandas Dataframe that will simplify your data manipulations a lot.

Categories