Using pandas Combining/merging 2 different Excel files/sheets - python

I am trying to combine 2 different Excel files. (thanks to the post Import multiple excel files into python pandas and concatenate them into one dataframe)
The one I work out so far is:
import os
import pandas as pd
df = pd.DataFrame()
for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.to_excel("c:\\all.xls")
Here is how they look like.
However I want to:
Exclude the last rows of each file (i.e. row4 and row5 in File1.xls; row7 and row8 in File2.xls).
Add a column (or overwrite Column A) to indicate where the data from.
For example:
Is it possible? Thanks.

For num. 1, you can specify skip_footer as explained here; or, alternatively, do
data = data.iloc[:-2]
once your read the data.
For num. 2, you may do:
from os.path import basename
data.index = [basename(f)] * len(data)
Also, perhaps would be better to put all the data-frames in a list and then concat them at the end; something like:
df = []
for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
data = pd.read_excel(f, 'Sheet1').iloc[:-2]
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)

import os
import os.path
import xlrd
import xlsxwriter
file_name = input("Decide the destination file name in DOUBLE QUOTES: ")
merged_file_name = file_name + ".xlsx"
dest_book = xlsxwriter.Workbook(merged_file_name)
dest_sheet_1 = dest_book.add_worksheet()
dest_row = 1
temp = 0
path = input("Enter the path in DOUBLE QUOTES: ")
for root,dirs,files in os.walk(path):
files = [ _ for _ in files if _.endswith('.xlsx') ]
for xlsfile in files:
print ("File in mentioned folder is: " + xlsfile)
temp_book = xlrd.open_workbook(os.path.join(root,xlsfile))
temp_sheet = temp_book.sheet_by_index(0)
if temp == 0:
for col_index in range(temp_sheet.ncols):
str = temp_sheet.cell_value(0, col_index)
dest_sheet_1.write(0, col_index, str)
temp = temp + 1
for row_index in range(1, temp_sheet.nrows):
for col_index in range(temp_sheet.ncols):
str = temp_sheet.cell_value(row_index, col_index)
dest_sheet_1.write(dest_row, col_index, str)
dest_row = dest_row + 1
dest_book.close()
book = xlrd.open_workbook(merged_file_name)
sheet = book.sheet_by_index(0)
print "number of rows in destination file are: ", sheet.nrows
print "number of columns in destination file are: ", sheet.ncols

Change
df.to_excel("c:\\all.xls")
to
df.to_excel("c:\\all.xls", index=False)
You may need to play around with the double quotes, but I think that will work.

Related

How to use python to seperate a one column CSV file if the columns have no headings, then save this into a new excel file?

So, I am quite new to python and have been googling a lot but have not found a good solution. What I am looking to do is automate text to columns using python in an excel document without headers.
Here is the excel sheet I have
it is a CSV file where all the data is in one column without headers
ex. hi ho loe time jobs barber
jim joan hello
009 00487 08234 0240 2.0348 20.34829
delimeter is space and comma
What I want to come out is saved in another excel with the first two rows deleted and seperated into columns
( this can be done using text to column in excel but i would like to automate this for several excel sheets)
009 | 00487 | 08234 | 0240 | 2.0348 | 20.34829
the code i have written so far is like this:
import pandas as pd
import csv
path = 'C:/Users/ionan/OneDrive - Universiteit Utrecht/Desktop/UCU/test_excel'
os.chdir(path)
for root, dirs, files in os.walk(path):
for f in files:
df = pd.read_csv(f, delimiter='\t' + ';', engine = 'python')
Original file with name as data.xlsx:
This means all the data we need is under the column Data.
Code to split data into multiple columns for a single file:
import pandas as pd
import numpy as np
f = 'data.xlsx'
# -- Insert the following code in your `for f in files` loop --
file_data = pd.read_excel(f)
# Since number of values to be split is not known, set the value of `num_cols` to
# number of columns you expect in the modified excel file
num_cols = 20
# Create a dataframe with twenty columns
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
# Change the column name of the first column in new_file to "Data"
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
# Add the value of the first cell in the original file to the first cell of the
# new excel file
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
# Loop through all rows of original excel file
for index, row in file_data.iterrows():
# Skip the first row
if index == 0:
continue
# Split the row by `space`. This gives us a list of strings.
split_data = file_data.loc[index, "Data"].split(" ")
print(split_data)
# Convert each element to a float (a number) if we want numbers and not strings
# split_data = [float(i) for i in split_data]
# Make sure the size of the list matches to the number of columns in the `new_file`
# np.NaN represents no value.
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
# Store the list at a given index using `.loc` method
new_file.loc[index] = split_data
# Drop all the columns where there is not a single number
new_file.dropna(axis=1, how='all', inplace=True)
# Get the original excel file name
new_file_name = f.split(".")[0]
# Save the new excel file at the same location where the original file is.
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)
This creates a new excel file (with a single sheet) of name data_modified.xlsx:
Summary (code without comments):
import pandas as pd
import numpy as np
f = 'data.xlsx'
file_data = pd.read_excel(f)
num_cols = 20
new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
for index, row in file_data.iterrows():
if index == 0:
continue
split_data = file_data.loc[index, "Data"].split(" ")
split_data = [np.NaN] + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
new_file.loc[index] = split_data
new_file.dropna(axis=1, how='all', inplace=True)
new_file_name = f.split(".")[0]
new_file.to_excel(new_file_name + "_modified.xlsx", index=False)

Column appended to dataframe coming up empty

I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?
I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)

Word count of a row in excel file using python

I have an excel file with multiple columns. In one column I have different comments. I want to create a column just beside it to find the number of words in the comment columns using python code. Is there any possibility.
Try this:
import xlrd
import os
from string import punctuation, translate
from collections import Counter
filename = u'test.xlsx'
sheet_no = 1 # To get the first sheet of the workbook
path = 'C:\Users\myUsername\Directory for Excel files'
punctuation_map = dict((ord(c), u' ') for c in punctuation)
for filename in os.listdir(path):
if filename.endswith('.xlsx'):
print filename
workbook = xlrd.open_workbook(filename)
sheet = workbook.sheet_by_index(sheet_no)
values = []
for row in range(sheet.nrows):
for col in range(sheet.ncols):
c = sheet.cell(row, col)
if c.ctype == xlrd.XL_CELL_TEXT:
cv = unicode(c.value)
wordlist = cv.translate(punctuation_map).split()
values.extend(wordlist)
numberWords = Counter(wordlist)
print sum(numberWords.values()), ' words for that column'
count = Counter(values)
print sum(count.values()), ' total words counted (from all columns)'
import pandas as pd
df #is your dataframe
counter = [] #future column you want
for string in df.Comments.values: #for each string in your "Comments"
counter.append(string.count(' ') + 1) #num of spaces + 1
df['num_words'] = counter #add the column
df = df[['num_words', 'Comments']] #change the order of columns
my df was
my df
and I finally got this df

Python, Pandas from data frame to create new data

Original spreadsheets have 2 columns. I want to pick the rows by given criteria (according to months), and put them into new files.
The original files looked like:
The codes I am using:
import os
import pandas as pd
working_folder = "C:\\My Documents\\"
file_list = ["Jan.xlsx", "Feb.xlsx", "Mar.xlsx"]
with open(working_folder + '201703-1.csv', 'a') as f03:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201703 = df[df.ARRIVAL.between(20170301, 20170331)]
df_201703.to_csv(f03, header = True)
with open(working_folder + '201702-1.csv', 'a') as f02:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201702 = df[df.ARRIVAL.between(20170201, 20170231)]
df_201702.to_csv(f02, header = True)
with open(working_folder + '201701-1.csv', 'a') as f01:
for fl in file_list:
df = pd.read_excel(working_folder + fl)
df_201701 = df[df.ARRIVAL.between(20170101, 20170131)]
df_201701.to_csv(f01, header = True)
The results are like:
Improvements I want to make:
Save them as xlsx files instead of .csv
Not to have the first index columns
Keeping only 1 row (top) headers (now each csv has 3 rows of headers)
How can I do that? Thank you.
I think need create list of DataFrames, concat together and then write to file:
dfs1 = []
for fl in file_list:
df = pd.read_excel(working_folder + fl)
dfs1.append(df[df.ARRIVAL.between(20170101, 20170131)] )
pd.concat(dfs1).to_excel('201701-1.xlsx', index = False)
What should be simplify by list comprehension:
file_list = ["Jan.xlsx", "Feb.xlsx", "Mar.xlsx"]
dfs1 = [pd.read_excel(working_folder + fl).query('20170101 >= ARRIVAL >=20170131') for fl in file_list]
pd.concat(dfs1).to_excel('201701-1.xlsx', index = False)

imported csv to dataframe objects not recognized

I have imported multiple csv files from a folder. First I created a list of all the csv files in the folder and then I provide the length of the list to my function.
The csv files have rows with different column lengths so that is why I think I have to use readlines.
The problem is that when I try to filter the DataFrame the values are not recognized.
I saved it to a sqlite table and pulled it in to R and a value that looks like "H"
appears to be like this in r --- "\"H\""
How can I prevent those extra characters from being added to my object "H"
Or do I have another problem?
x = []
count = 0
while (count < len(filelist) ):
for file in filelist:
filename = open(filelist[count])
count = count + 1
for line in filename.readlines():
x.append(line.split(','))
df = pd.DataFrame(x)
For example I am just trying to create a mask. But I am getting all False. The DataFrame appears to contain "H"?
data['V1'] == "H"
Try this
df_list =[]
file_list = []
path = 'file_path'
for file in file_list:
df_name = 'df_%s' %file
df_list.append(df_name)
('df_%s' % file) = pd.read_csv(path+file)
new_df = pd.concat(df_list)
Answer: This code fixed the problem by removing the quotes throughout. Now the mask works.
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')

Categories