I'm trying to create a combined dataframe from a series of 12 individual CSVs (12 months to combine for the year). All the CSVs have the same format and column layout.
When I first ran it, it appeared to work and I was left with a combined dataframe with 6 columns (as expected). Upon looking at it, I found that the header row was applied as actual data in all the files, so I had some bad rows I needed to eliminate. I could manually make these changes but I'm looking to have the code take care of this automatically.
So to that end, I updated the code so it only read in the first CSV with headers and the remaining CSVs without headers and concatenate everything together. This appears to work BUT I end up with 12 columns instead of 6 with the first 6 columns having NaNs for the first CSV and the last 6 columns having NaNs for the other 11 CSVs, which is obviously NOT what I want (see image below).
The code is similar, I just use the header=None parameter in pd.read_csv() for the 11 CSVs after the first (and I don't use that parameter for the first CSV). Can anyone give me a hint as to why I'm getting 12 columns (with the data placement as described) when I run this code? The layout of the CSV file is shown below.
Appreciate any help.
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
dfSD = pd.concat([dfSD, df])
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))
One simple solution is the following.
import pandas as pd
import numpy as np
import os
totrows = 0
files = os.listdir('c:/data/datasets')
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
columns = []
dfSD = []
for file in files:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
if not columns:
columns = df.columns
df.columns = columns
dfSD.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat(dfSD, axis = 0)
dfSD = dfSD.reset_index(drop = True)
Another possibility is:
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
df.columns = dfSD.columns
df_comb.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)
Related
So, I am quite new to python and have been googling a lot but have not found a good solution. What I am looking to do is automate text to columns using python in an excel document without headers.
Here is the excel sheet I have
it is a CSV file where all the data is in one column without headers
ex. hi ho loe time jobs barber
jim joan hello
009 00487 08234 0240 2.0348 20.34829
delimeter is space and comma
What I want to come out is saved in another excel with the first two rows deleted and seperated into columns
( this can be done using text to column in excel but i would like to automate this for several excel sheets)
009 | 00487 | 08234 | 0240 | 2.0348 | 20.34829
the code i have written so far is like this:
import pandas as pd
import csv
path = 'C:/Users/ionan/OneDrive - Universiteit Utrecht/Desktop/UCU/test_excel'
os.chdir(path)
for root, dirs, files in os.walk(path):
for f in files:
df = pd.read_csv(f, delimiter='\t' + ';', engine = 'python')
Original file with name as data.xlsx:
This means all the data we need is under the column Data.
Code to split data into multiple columns for a single file:
import pandas as pd
import numpy as np
f = 'data.xlsx'
# -- Insert the following code in your `for f in files` loop --
file_data = pd.read_excel(f)
# Since number of values to be split is not known, set the value of `num_cols` to
# number of columns you expect in the modified excel file
num_cols = 20
# Create a dataframe with twenty columns
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
# Change the column name of the first column in new_file to "Data"
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
# Add the value of the first cell in the original file to the first cell of the
# new excel file
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
# Loop through all rows of original excel file
for index, row in file_data.iterrows():
# Skip the first row
if index == 0:
continue
# Split the row by `space`. This gives us a list of strings.
split_data = file_data.loc[index, "Data"].split(" ")
print(split_data)
# Convert each element to a float (a number) if we want numbers and not strings
# split_data = [float(i) for i in split_data]
# Make sure the size of the list matches to the number of columns in the `new_file`
# np.NaN represents no value.
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
# Store the list at a given index using `.loc` method
new_file.loc[index] = split_data
# Drop all the columns where there is not a single number
new_file.dropna(axis=1, how='all', inplace=True)
# Get the original excel file name
new_file_name = f.split(".")[0]
# Save the new excel file at the same location where the original file is.
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)
This creates a new excel file (with a single sheet) of name data_modified.xlsx:
Summary (code without comments):
import pandas as pd
import numpy as np
f = 'data.xlsx'
file_data = pd.read_excel(f)
num_cols = 20
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
for index, row in file_data.iterrows():
if index == 0:
continue
split_data = file_data.loc[index, "Data"].split(" ")
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
new_file.loc[index] = split_data
new_file.dropna(axis=1, how='all', inplace=True)
new_file_name = f.split(".")[0]
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)
I have a cycle in which on every iteration I export the pandas dataframe to a CSV file. The problem is that i got an output as you see in the first picture, but i need to get something similar to the second one.
I also tried with some encoding type, such as utf-8, utf-16, but nothing changed.
The only difference between my solution and the ones found online is that my dataframe is built from a pickle file, but I don't think this is the problem.
for pickle_file in files:
key = pickle_file.split('/')[5].split('\\')[1] + '_' + pickle_file.split('/')[5].split('\\')[4]
with lz4.frame.open(pickle_file, "rb") as f:
while True:
try:
diz[key].append(pickle.load(f))
except EOFError:
break
for key in diz.keys():
a = diz[key]
for j in range(len(a)):
t = a[j]
for index,row in t.iterrows():
if row['MODE'] != 'biflow':
w = row['W']
feature = row['FEATURE']
mean = row['G-MEAN']
rmse = row['RMSE']
df.loc[-1] = [w] + [feature] + [rmse] + [mean] + [key]
df.index = df.index + 1
df = df.sort_values(by = ['W'])
df.to_csv(path + key + '.csv', index = False)
df = df[0:0]
The data is correctly formed. What you need to do is split each row into columns. In MS Excel it's Data > Text to Columns and then follow the function wizard.
If you are using a different application for opening the data, just google how to split text row data into columns for that application.
I have a number of csv files. I need to extract all respective rows from each file and save it as a new file.
i.e. first output file must contain first rows of all input files and so on.
I have done the following.
import pandas as pd
import os
import numpy as np
data = pd.DataFrame('', columns =['ObjectID', 'SPI'], index = np.arange(1,100))
path = r'C:\Users\bikra\Desktop\Pandas'
i = 1
for files in os.listdir(path):
if files[-4:] == '.csv':
for j in range(0,10, 1):
#print(files)
dataset = pd.read_csv(r'C:\Users\bikra\Desktop\Pandas'+'\\'+files)
spi1 = dataset.loc[j,'SPI']
data.loc[i]['ObjectID'] = files[:]
data.loc[i]['SPI'] = spi1
data.to_csv(r'C:\Users\bikra\Desktop\Pandas\output\\'+str(j)+'.csv')
i + 1
It works well when index (i.e. 'j' ) is specified. But when I tried to loop, the output csv file contains only first row. Where am I wrong?
You better use append:
data = data.append(spi1)
How can I add data to an existing empty column in a dataframe?
I have an empty dataframe with column names (stock tickers)
I am trying to add data to each stock, basically, populate the dataframe column by column, from left to right based on the header name.
I am pulling the data from another CSV file which looks like this (CSV file name = column name in the dataframe Im trying to populate):
PS aditional issue may arise due to the length of data available for each stock, eg. I may have a list of 10 values for the first stock, 0 for the second, and 25 for third. I plan to save this in a CSV, so perhaps it could not cause too big of an issue.
I have tried the following idea but without luck. any suggestions are welcome.
import pandas as pd
import os
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.read_csv(path + Russell3k_Tickers)
divFls = os.listdir(path + Russell3k_Divs)
for i in divFls:
df = pd.read_csv(path + Russell3k_Divs + i)
Div = df['Dividends']
i = i[0].split('.')
df_tickers[i] = df_tickers.append(Div)
print(df_tickers)
break
import pandas as pd
import os
from tqdm import tqdm
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.DataFrame()
divFls = os.listdir(path + Russell3k_Divs)
for i in tqdm(divFls):
df = pd.read_csv(path + Russell3k_Divs + i)
i = i.split('.')[0]
df[str(i)] = df['Date']
df_tickers = df_tickers.join(df[str(i)], how='outer')
df_tickers.to_csv('Russell-3000-Stock-Tickers-List1.csv', encoding='utf-8', index=False)
This answer was posted as an edit to the question adding data to an existing empty dataframe containing only column names by the OP Mr.Riply under CC BY-SA 4.0.
I've got a number of Excel workbooks, each with multiple worksheets, that I'd like to combine.
I've set up two sets of loops (one while, one for) to read in rows for each sheet in a given workbook and then do the same for all workbooks.
I tried to do it on a subset of these, and it appears to work until I try to combine the two sets using the pd.concat function. Error given is
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea what I'm doing incorrectly?
import pandas as pd
d = 2013
numberOfSheets = 5
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'Table '+str(i), header=None)
print(i)
df.append(data)
print(str(d) + ' complete')
print(df)
d += 1
df = pd.concat(df)
print(df)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df.to_excel(final)
As the error says, pd.concat() requires an iterable, like a list: pd.concat([df1, df2]) will concatenate df1 and df2 along the default axis of 0, which means df2 is appended to the bottom of df1.
Two issues need fixing:
The for loop refers to df before assigning anything to it.
The variable df is overwritten with each iteration of the for loop.
One workaround is to create an empty list of DataFrames before the loops, then append DataFrames to that list, and finally concatenate all the DataFrames in that list. Something like this:
import pandas as pd
d = 2013
numberOfSheets = 5
dfs = []
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1, numberOfSheets + 1):
data = pd.read_excel(f, sheetname='Table ' + str(i), header=None)
print(i)
dfs.append(data)
print(str(d) + ' complete')
print(df)
d += 1
# ignore_index=True gives the result a default IntegerIndex
# starting from 0
df_final = pd.concat(dfs, ignore_index=True)
print(df_final)
final_path = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final_path)
Since I can't comment, I'll leave this as an answer: you can speed up this code by opening the file once then parsing the workbook to get each sheet. Should save a second or two off each iteration, since opening the Excel file takes the longest. Here's some code that might help.
Note: setting sheet_name=None will return ALL the sheets in the workbook:
dfs = {<sheetname1>: <DataFrame1>, <sheetname2>: <DataFrame2>, etc.}
Here's the code:
xl = pd.ExcelFile(fpath)
dfs = xl.parse(sheetname=None, header=None)
for i, df in enumerate(dfs):
<do stuff with each, if you want>
print('Sheet {0} looks like:\n{1}'.format(i+1, df))
Thank you, both. I accepted the answer that addressed the specific question, but was able to use the second answer and some additional googling thereafter (eg, glob) to amend the original code, and automate more fully independent of number of workbooks or worksheets.
Final version of the above now below:
import pandas as pd
import glob
#import numpy as np
#import os, collections, csv
#from os.path import basename
fpath = "H:/MyDocuments/Z Project Work/"
dfs = []
files = glob.glob(fpath+'*.xlsx')
for f in files:
xl = pd.ExcelFile(f)
xls = xl.parse(sheetname=None, header=0)
for i, df in enumerate(xls):
print(i)
dfs.append(xls[df])
print(f+ ' complete')
df_final = pd.concat(dfs, ignore_index=True)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final)