renaming the header when using dictreader - python

I'm looking for the best way to rename my header using dictreader / dictwriter to add to my other steps already done.
This is what I am trying to do to the Source data example below.
Remove the first 2 lines
Reorder the columns (header & data) to 2, 1, 3 vs the source file
Rename the header to ASXCode, CompanyName, GISC
When I'm at
If I use 'reader = csv.reader.inf' the first lines are removed and columns reordered but as expected no header rename
Alternately when I run the dictreader line 'reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))' I receive the error 'dict contains fields not in fieldnames:' and shows the first row of data rather than the header.
I'm a bit stuck on how I get around this so any tips appreciated.
Source Data example
ASX listed companies as at Mon May 16 17:01:04 EST 2016
Company name ASX code GICS industry group
1-PAGE LIMITED 1PG Software & Services
1300 SMILES LIMITED ONT Health Care Equipment & Services
1ST AVAILABLE LTD 1ST Health Care Equipment & Services
My Code
import csv
import urllib.request
from itertools import islice
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
with open(temp_filename, 'r', newline='') as inf, \
open(local_filename, 'w', newline='') as outf:
# reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))
reader = csv.reader(inf)
fieldnames = ['ASX code', 'Company name', 'GICS industry group']
writer = csv.DictWriter(outf, fieldnames=fieldnames)
# 1. Remove top 2 rows
next(islice(reader, 2, 2), None)
# 2. Reorder Columns
writer.writeheader()
for row in csv.DictReader(inf):
writer.writerow(row)

IIUC here is a solution using pandas and its function read_csv:
import pandas as pd
#Considering that you have your data in a file called 'stock.txt'
#and it is tab separated, by default the blank lines are not read by read_csv,
#hence set the header=1
df = pd.read_csv('stock.txt', sep='\t',header=1)
#Rename the columns as required
df.columns= ['CompanyName', 'ASXCode', 'GICS']
#Reorder the columns as required
df = df[['ASXCode','CompanyName','GICS']]
And this is how you would do it in ipython and the output would look like:

Based on your tips I got it working in the end. I hadn't used pandas before so had to ready up a little first.
I eventually worked out pandas uses a data frame so I had to do a few things differently with tocsv function and eventually add index=False parameter to the tocsv function to remove the df index.
Now all great thankyou.
import csv
import os
import urllib.request
import pandas as pd
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
#using pandas dataframe
df = pd.read_csv(temp_filename, sep=',',header=1) #skip header
df.columns = ['CompanyName', 'ASXCode', 'GICS'] #rename columns
df = df[['ASXCode','CompanyName','GICS']] #reorder columns
df.to_csv(local_filename, sep=',', index=False)
os.remove(temp_filename) # clean up

Related

Join large set of CSV files where the header is the timestamp for the file

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.
Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.
Ex.
file1.csv
2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
file2.csv
2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.
Like this:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....
Any simple tools for this, or Python routines?
Thanks for the reply on my first question here!
I will also contribute with a solution for this problem.
I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.
import pandas as pd
import os
folder = 'in'
df_out = pd.DataFrame()
for filename in os.scandir(folder):
if filename.is_file():
print('Working on file' + filename.path)
df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])
# Transpose data with timestamp header to columns
df_tranposed = df.T
df_out = df_out.append(df_tranposed)
df_out.to_csv('output.csv')
Try the following Pandas approach:
import pandas as pd
import glob
import os
dfs = []
for csv_filename in glob.glob('./file*.csv'):
print('Working on file', csv_filename)
# Read the CSV file, assume no header and two columns
df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
# Transpose from the 2nd row (skip the timestamp)
df_transposed = df[1:].T
# Allocate the column names from the first row and 'Timestamp'
df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
# Copy the timestamp into the transposed dataframe as a datetime value
df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
# Remove the first row (containing the names)
df_transposed = df_transposed[1:]
dfs.append(df_transposed)
# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)
Alternatively, it could be done using standard Python:
from datetime import datetime
import csv
import glob
data = []
fieldnames = set()
for fn in glob.glob('file*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input, delimiter=';')
timestamp = next(csv_input)[0]
row = {'Timestamp' : timestamp}
for device, error_code in csv_input:
row[device] = error_code
fieldnames.add(device)
data.append(row)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
csv_output.writeheader()
csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))
This gives output.csv as:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0
How does this work
First iterate over all .csv files in a given folder.
For each file open it using a csv.reader()
Read the header row as a special case, storing the value as a Timestamp entry in a dictionary row.
For each row, store additional key value entries in the row dictionary.
Keep a note of each device name using a set.
Append the complete row into a data list.
It is now possible to create an output.csv file. The full list of columns names can be assigned as fieldnames and a csv.DictWriter() used.
Write the header.
Use writerows() to write all the data rows sorted by timestamp. To do this convert each row's Timestamp entry into a datetime value for sorting.
This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode.

Creating a single output file fom 3 csv files using Python

I have 3 CSV files. Names are below
AD.csv
ID.csv
MD.csv
AD.csv
A.Net ATVS
A&E HD 60 Days In
AXSTV 60 Days : Watch Along
BET HD Behind Bars: Rookie Year
Bloomberg Biggie: The Life of Notorious B.I.G.
ID.csv
I.Net ITvs
AETVHD 60 Days In
AXSTV 60 Days : Watch Along
BETHD Behind Bars: Rookie Year
BLOOMHD Dog the Bounty Hunter
MD.csv
A.Net I.Net
A&E HD AETVHD
AXSTV AXSTV
BET HD BETHD
Bloomberg BLOOMHD
In MD.csv, 'a.net' = 'I.net'
which means I have to map the data in 'atvs' with 'itvs' where MD.csv 'a.net = i.net'
I am new to write python script, can anyone help me to map this?
import csv
with open('E:/ad.csv', 'r') as lookuplist:
with open('E:/id.csv', 'r') as csvinput:
with open('vlookupout', 'w') as output:
reader = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
writer = csv.writer(output)
for 'itvs' in reader2:
for 'atvs' in reader:
if itvs[0] == atvs[0]:
itvs.append(atvs[1:])
writer.writerow(itvs)
If you don't have any dependency constraints, use DataFrame from the pandas library.
Using DataFrames, you can simply read and load the CSVs as tables.
ad = pd.read_csv('E:\ad.csv')
id = pd.read_csv('E:\id.csv')
... and perform joins/merge/aggregations on them.
result = pd.merge(ad,
id[['I.Net', 'ITVs']],
on='I.Net')
It'll be much easier and flexible as per your requirement.
You can do this using Pandas.
import pandas as pd
# read in the csv's
ad_df = pd.read_csv('AD.csv', sep=r'\s\s+', engine='python')
id_df = pd.read_csv('ID.csv', sep=r'\s\s+', engine='python')
md_df = pd.read_csv('MD.csv', sep=r'\s\s+', engine='python')
# Combine the csv's using MD.csv
result = pd.merge(ad_df,md_df[['A.Net', 'I.Net']], on='A.Net')
result = pd.merge(result,id_df[['I.Net', 'ITvs']], on='I.Net')
# in case you want to drop 'I.Net' add:
result.drop('I.Net', axis=1, inplace=True)
#export to csv:
result.to_csv('result.csv', index=False)
Note: Your CSV's have some inconsistencies in the header names. I used the names in my script exactly as provided.
As noted in my comment, your csv seperation looks off. I made one small change in the csv, by adding an extra space between "BLOOMHD" and "Dog the...".

Python Pandas performing operation on each row of CSV file

I have a 1million line CSV file. I want to do call a lookup function on each row's 1'st column, and append its result as a new column in the same CSV (if possible).
What I want is this is something like this:
for each row in dataframe
string=row[1]
result=lookupFunction(string)
row.append[string]
I Know i could do it using python's CSV library by opening my CSV, read each row, do my operation, write results to a new CSV.
This is my code using Python's CSV library
with open(rawfile, 'r') as f:
with open(newFile, 'a') as csvfile:
csvwritter = csv.writer(csvfile, delimiter=' ')
for line in f:
#do operation
However I really want to do it with Pandas because it would be something new to me.
This is what my data looks like
77,#oshkosh # tannersville pa,,PA,US
82,#osithesakcom ca,,CA,US
88,#osp open records or,,OR,US
89,#ospbco tel ord in,,IN,US
98,#ospwmnwithn return in,,IN,US
99,#ospwmnwithn tel ord in,,IN,US
100,#osram sylvania inc ma,,MA,US
106,#osteria giotto montclair nj,,NJ,US
Any help and guidance will be appreciated it. THanks
here is a simple example of adding 2 columns to a new column from you csv file
import pandas as pd
df = pd.read_csv("yourpath/yourfile.csv")
df['newcol'] = df['col1'] + df['col2']
create df and csv
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2], B=[3, 4]))
df.to_csv('test_add_column.csv')
read csv into dfromcsv
dfromcsv = pd.read_csv('test_add_column.csv', index_col=0)
create new column
dfromcsv['C'] = df['A'] * df['B']
dfromcsv
write csv
dfromcsv.to_csv('test_add_column.csv')
read it again
dfromcsv2 = pd.read_csv('test_add_column.csv', index_col=0)
dfromcsv2

How to extract a single row from multiple CSV files to a new file

I have hundreds of CSV files on my disk, and one file added daily and I want to extract one row from each of them and put them in a new file. Then I want to daily add values to that same file. CSV files looks like this:
business_day,commodity,total,delivery,total_lots
.
.
20160831,CTC,,201710,10
20160831,CTC,,201711,10
20160831,CTC,,201712,10
20160831,CTC,Total,,385
20160831,HTC,,201701,30
20160831,HTC,,201702,30
.
.
I want to fetch the row that contains 'Total' from each file. The new file should look like:
business_day,commodity,total,total_lots
20160831,CTC,Total,385
20160901,CTC,Total,555
.
.
The raw files on my disk are named '20160831_foo.CSV', '20160901_foo.CSV etc..
After Googling this I have yet not seen any examples on how to extract only one value from a CSV file. Any hints/help much appreciated. Happy to use pandas if that makes life easier.
I ended up with the following:
import pandas as pd
import glob
list_ = []
filenames = glob.glob('c:\\Financial Data\\*_DAILY.csv')
for filename in filenames:
df = pd.read_csv(filename, index_col = None, usecols = ['business_day', 'commodity', 'total', 'total_lots'], parse_dates = ['business_day'], infer_datetime_format = True)
df = df[((df['commodity'] == 'CTC') & (df['total'] == 'Total'))]
list_.append(df)
df = pd.concat(list_, ignore_index = True)
df['total_lots'] = df['total_lots'].astype(int)
df = df.sort_values(['business_day'])
df = df.set_index('business_day')
Then I save it as my required file.
Read the csv files and process them directly like so:
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
# do something here with `row`
break
I would recommend appending rows onto a list after processing for the rows that you desire, and then passing it onto a pandas Dataframe that will simplify your data manipulations a lot.

Merging CSV Files with missing columns in Pandas

I'm a new to pandas and python, so I hope this will make sense.
I have parsed multiple tables from a website to multiple CSV files, and unfortunately if the value was not available for the parsed data, it was omitted from the table. Hence, I now have CSV files with varying number of columns.
I've used the read_csv() and to_csv() in the past and it works like a charm when the data is clean, but I'm stumped here.
I figured there might be a way to "map" the read data if I first fed the pandas DF with all column headers, then I map each file against the columns in the main file.
E.g. Once i used read_csv(), then to_csv() will look at the main merged file and "map" the available fields to the correct columns in the merged file.
This is a short version of the data:
File 1:
ID, Price, Name,
1, $800, Jim
File 2:
ID, Price, Address, Name
2, $500, 1 Main St., Amanda
Desired Output:
ID, Price, Adress, Name
1, $800, , Jim
2, $500, 1 Main St., Amanda
This is the code I got so far.
mypath='I:\\Filepath\\'
#creating list of files to be read, and merged.
listFiles = []
for (dirpath, dirnames, filenames) in walk(mypath):
listFiles.extend(filenames)
break
# reading/writing "master headers" to new CSV using a "master header" file
headers = pd.read_csv('I:\\Filepath\\master_header.csv', index_col=0)
with open('I:\\Filepath\\merge.csv', 'wb') as f:
headers.to_csv(f)
def mergefile(filenames):
try:
# Creating a list of files read.
with open('I:\\Filepath\\file_list.txt', 'a') as f:
f.write(str(filenames)+'\n')
os.chdir('I:\\Filepath\\')
# Reading file to add.
df = pd.read_csv(filenames, index_col=0)
# Appending data (w/o header) to the new merged data CSV file.
with open('I:\\Filepath\\merge.csv', 'a') as f:
df.to_csv(f, header=False)
except Exception, e:
with open('I:\\Filepath\\all_error.txt', 'a') as f:
f.write(str(e)+'\n')
for eachfilenames in listFiles:
mergefile(eachfilenames)
This code merges the data, but since the number of columns vary, they are not in the right place...
Any help would be greatly appreciated.
Try using the pandas concat[1] function, which defaults to an outer join (all columns will be present, and missing values will be NaN). For example:
import pandas as pd
# you would read each table into its own data frame using read_csv
f1 = pd.DataFrame({'ID': [1], 'Price': [800], 'Name': ['Jim']})
f2 = pd.DataFrame({'ID': [2], 'Price': [500], 'Address': '1 Main St.', 'Name': ['Amanda']})
pd.concat([f1, f2]) # merged data frame
[1] http://pandas.pydata.org/pandas-docs/stable/merging.html
Here is a complete example that demonstrates how to load the files and merge them using concat:
In [297]:
import pandas as pd
import io
t="""ID, Price, Name
1, $800, Jim"""
df = pd.read_csv(io.StringIO(t), sep=',\s+')
t1="""ID, Price, Address, Name
2, $500, 1 Main St., Amanda"""
df1 = pd.read_csv(io.StringIO(t1), sep=',\s+')
pd.concat([df,df1], ignore_index=True)
Out[297]:
Address ID Name Price
0 NaN 1 Jim $800
1 1 Main St. 2 Amanda $500
Note that I pass ignore_index=True otherwise you will get duplicate index entries which I assume is not what you want, also I'm assuming that in your original data sample for 'File 1' you don't really have a trailing comma in your header line: ID, Price, Name, so I removed it from my code above

Categories