Merging CSV Files with missing columns in Pandas

Merging CSV Files with missing columns in Pandas - python

I'm a new to pandas and python, so I hope this will make sense.
I have parsed multiple tables from a website to multiple CSV files, and unfortunately if the value was not available for the parsed data, it was omitted from the table. Hence, I now have CSV files with varying number of columns.
I've used the read_csv() and to_csv() in the past and it works like a charm when the data is clean, but I'm stumped here.
I figured there might be a way to "map" the read data if I first fed the pandas DF with all column headers, then I map each file against the columns in the main file.
E.g. Once i used read_csv(), then to_csv() will look at the main merged file and "map" the available fields to the correct columns in the merged file.
This is a short version of the data:
File 1:
ID, Price, Name,
1, $800, Jim
File 2:
ID, Price, Address, Name
2, $500, 1 Main St., Amanda
Desired Output:
ID, Price, Adress, Name
1, $800, , Jim
2, $500, 1 Main St., Amanda
This is the code I got so far.
mypath='I:\\Filepath\\'
#creating list of files to be read, and merged.
listFiles = []
for (dirpath, dirnames, filenames) in walk(mypath):
listFiles.extend(filenames)
break
# reading/writing "master headers" to new CSV using a "master header" file
headers = pd.read_csv('I:\\Filepath\\master_header.csv', index_col=0)
with open('I:\\Filepath\\merge.csv', 'wb') as f:
headers.to_csv(f)
def mergefile(filenames):
try:
# Creating a list of files read.
with open('I:\\Filepath\\file_list.txt', 'a') as f:
f.write(str(filenames)+'\n')
os.chdir('I:\\Filepath\\')
# Reading file to add.
df = pd.read_csv(filenames, index_col=0)
# Appending data (w/o header) to the new merged data CSV file.
with open('I:\\Filepath\\merge.csv', 'a') as f:
df.to_csv(f, header=False)
except Exception, e:
with open('I:\\Filepath\\all_error.txt', 'a') as f:
f.write(str(e)+'\n')
for eachfilenames in listFiles:
mergefile(eachfilenames)
This code merges the data, but since the number of columns vary, they are not in the right place...
Any help would be greatly appreciated.

Try using the pandas concat[1] function, which defaults to an outer join (all columns will be present, and missing values will be NaN). For example:
import pandas as pd
# you would read each table into its own data frame using read_csv
f1 = pd.DataFrame({'ID': [1], 'Price': [800], 'Name': ['Jim']})
f2 = pd.DataFrame({'ID': [2], 'Price': [500], 'Address': '1 Main St.', 'Name': ['Amanda']})
pd.concat([f1, f2]) # merged data frame
[1] http://pandas.pydata.org/pandas-docs/stable/merging.html

Here is a complete example that demonstrates how to load the files and merge them using concat:
In [297]:
import pandas as pd
import io
t="""ID, Price, Name
1, $800, Jim"""
df = pd.read_csv(io.StringIO(t), sep=',\s+')
t1="""ID, Price, Address, Name
2, $500, 1 Main St., Amanda"""
df1 = pd.read_csv(io.StringIO(t1), sep=',\s+')
pd.concat([df,df1], ignore_index=True)
Out[297]:
Address ID Name Price
0 NaN 1 Jim $800
1 1 Main St. 2 Amanda $500
Note that I pass ignore_index=True otherwise you will get duplicate index entries which I assume is not what you want, also I'm assuming that in your original data sample for 'File 1' you don't really have a trailing comma in your header line: ID, Price, Name, so I removed it from my code above

Related

Python & Pandas: How to address NaN values in a loop?

With Python and Pandas I'm seeking to take values from CSV cells and write them as txt files via a loop. The structure of the CSV file is:
user_id, text, text_number
0, test text A, text_0
1,
2,
3,
4,
5, test text B, text_1
The script below successfully writes a txt file for the first row - it is named text_0.txt and contains test text A.
import pandas as pd
df= pd.read_csv("test.csv", sep=",")
for index in range(len(df)):
with open(df["text_number"][index] + '.txt', 'w') as output:
output.write(df["text"][index])
However, I receive an error when it proceeds to the next row:
TypeError: write() argument must be str, not float
I'm guessing the error is generated when it encounters values it reads as NaN. I attempted to add the dropna feature per the pandas documentation like so:
import pandas as pd
df= pd.read_csv("test.csv", sep=",")
df2 = df.dropna(axis=0, how='any')
for index in range(len(df)):
with open(df2["text_number"][index] + '.txt', 'w') as output:
output.write(df2["text"][index])
However, the same issue persists - a txt file is created for the first row, but a new error message is returned for the next row: KeyError: 1.
Any suggestions? All assistance greatly appreciated.

The issue here is that you are creating a range index which is not necessarily in the data frame's index. For your use case, you can just iterate through rows of data frame and write to the file.
for t in df.itertuples():
if t.text_number: # do not write if text number is None
with open(t.text_number + '.txt', 'w') as output:
output.write(str(t.text))

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!

Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Comparing large (~40GB) of textual data using Pandas or an alternative approach

I have a large body of csv data, around 40GB of size that I need to process (lets call it the 'body'). The data in each file in this body consists of single column CSV files. Each row is a keyword consisting of words and short sentences, e.g.
Dog
Feeding cat
used cars in Brighton
trips to London
.....
This data needs to be compared against another set of files (this one 7GB in size, which I will call 'Removals'), any keywords from the Removals need to be identified and removed from the body. The data for the Removals is similar to whats in the body, i.e:
Guns
pricless ming vases
trips to London
pasta recipes
........
While I have an approach that will get the job done, it is a very slow approach and could take a good week to finish. It is a multi-threaded approach in which every file from the 7GB body is compared in a for loop against files from the body. It casts the column from the Removals file as a list and then filters the body file to keep any row that is not in that list. The filtered data is then appended to an output file:
def thread_worker(file_):
removal_path="removal_files"
allFiles_removals = glob.glob(removal_path + "/*.csv", recursive=True)
print(allFiles_removals)
print(file_)
file_df = pd.read_csv(file_)
file_df.columns = ['Keyword']
for removal_file_ in allFiles_removals:
print(removal_file_)
vertical_df = pd.read_csv(vertical_file_, header=None)
vertical_df.columns = ['Keyword']
vertical_keyword_list = vertical_df['Keyword'].values.tolist()
file_df = file_df[~file_df['Keyword'].isin(vertical_keyword_list)]
file_df.to_csv('output.csv',index=False, header=False, mode='a')
Obviously, my main aim is to work out how to get this done faster.Is Pandas even the best way to do this? I tend to default to using it when dealing with CSV files.

IIUC you can do it this way:
# read up "removal" keywords from all CSV files, get rid of duplicates
removals = pd.concat([pd.read_csv(f, sep='~', header=None, names=['Keyword']) for f in removal_files]
ignore_index=True).drop_duplicates()
df = pd.DataFrame()
for f in body_files:
# collect all filtered "body" data (file-by-file)
df = pd.concat([df,
pd.read_csv(f, sep='~', header=None, names=['Keyword']) \
.query('Keyword not in #removals.Keyword')],
ignore_index=True)

You can probably read them in small chunks and make the text column as category to drop duplicates while reading
from pandas.api.types import CategoricalDtype
TextFileReader = pd.read_csv(path, chunksize=1000, dtype = {"text_column":CategoricalDtype}) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)

renaming the header when using dictreader

I'm looking for the best way to rename my header using dictreader / dictwriter to add to my other steps already done.
This is what I am trying to do to the Source data example below.
Remove the first 2 lines
Reorder the columns (header & data) to 2, 1, 3 vs the source file
Rename the header to ASXCode, CompanyName, GISC
When I'm at
If I use 'reader = csv.reader.inf' the first lines are removed and columns reordered but as expected no header rename
Alternately when I run the dictreader line 'reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))' I receive the error 'dict contains fields not in fieldnames:' and shows the first row of data rather than the header.
I'm a bit stuck on how I get around this so any tips appreciated.
Source Data example
ASX listed companies as at Mon May 16 17:01:04 EST 2016
Company name ASX code GICS industry group
1-PAGE LIMITED 1PG Software & Services
1300 SMILES LIMITED ONT Health Care Equipment & Services
1ST AVAILABLE LTD 1ST Health Care Equipment & Services
My Code
import csv
import urllib.request
from itertools import islice
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
with open(temp_filename, 'r', newline='') as inf, \
open(local_filename, 'w', newline='') as outf:
# reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))
reader = csv.reader(inf)
fieldnames = ['ASX code', 'Company name', 'GICS industry group']
writer = csv.DictWriter(outf, fieldnames=fieldnames)
# 1. Remove top 2 rows
next(islice(reader, 2, 2), None)
# 2. Reorder Columns
writer.writeheader()
for row in csv.DictReader(inf):
writer.writerow(row)

IIUC here is a solution using pandas and its function read_csv:
import pandas as pd
#Considering that you have your data in a file called 'stock.txt'
#and it is tab separated, by default the blank lines are not read by read_csv,
#hence set the header=1
df = pd.read_csv('stock.txt', sep='\t',header=1)
#Rename the columns as required
df.columns= ['CompanyName', 'ASXCode', 'GICS']
#Reorder the columns as required
df = df[['ASXCode','CompanyName','GICS']]
And this is how you would do it in ipython and the output would look like:

Based on your tips I got it working in the end. I hadn't used pandas before so had to ready up a little first.
I eventually worked out pandas uses a data frame so I had to do a few things differently with tocsv function and eventually add index=False parameter to the tocsv function to remove the df index.
Now all great thankyou.
import csv
import os
import urllib.request
import pandas as pd
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
#using pandas dataframe
df = pd.read_csv(temp_filename, sep=',',header=1) #skip header
df.columns = ['CompanyName', 'ASXCode', 'GICS'] #rename columns
df = df[['ASXCode','CompanyName','GICS']] #reorder columns
df.to_csv(local_filename, sep=',', index=False)
os.remove(temp_filename) # clean up

Read CSV file that needs data sanitization prior to loading into dataframe

I'm reading a CSV file into pandas. The issue is that the file needs removal of rows and calculated values on the other rows. My current idea starts like this
with open(down_path.name) as csv_file:
rdr = csv.DictReader(csv_file)
for row in rdr:
type = row['']
if type == 'Summary':
current_ward = row['Name']
else:
name = row['Name']
count1 = row['Count1']
count2 = row['Count2']
count3 = row['Count3']
index_count += 1
# write to someplace
,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0
The end result needs to end up in a dataframe that i can concatenate to an existing dataframe.
Braindead way to do this is simply do my conversions and create a new CSV file, then read that in. Seems like a non-pythonic way to go.
Need to take out the summary lines, combine those with similar names (Aloha 1 and Aloha I), remove the individual stat info and put the Aloha 1 label on each of the individuals. Plus i need to add which month this data is from. As you can see the data needs some work :)
desired output would be
Jan-16, Aloha 1, John, 1,2,3
Where the Aloha 1 comes from the summary line above it

My personal preference would be to do everything in Pandas.
Perhaps something like this...
# imports
import numpy as np
import pandas as pd
from StringIO import StringIO
# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))
# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)
# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'
# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)
# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)
# get rid of the ward summary lines
df = df.ix[~ws_mask]
# get rid of the Desc column
df.drop('Desc', axis=1)
Yes; you pass over the data more than once, so you could potentially do better with a smarter single pass algorithm. But, if performance isn't your main concern, I think this has benefits in terms of conciseness and readability.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging CSV Files with missing columns in Pandas - python

Related

Python & Pandas: How to address NaN values in a loop?

Selectin Dataframe columns name from a csv file

Comparing large (~40GB) of textual data using Pandas or an alternative approach

renaming the header when using dictreader

Read CSV file that needs data sanitization prior to loading into dataframe

Categories

Resources