I posted part of this question couple of days ago with a good answer but that solved just part of my problem.
So, I have a excel file on which needs to be done some data mining and after that needs to get out another excel file with the same format .xlsx
The problem is that I get a strange column after I write the file, which cannot be seen before the writing using Anaconda. And that makes it harder to develop a strategy to counter it's appearance. initially I though I solved the problem by reducing the width to 0 but apparently at some point the file needs to be converted in text and then the columns reappears.
For more details here is part of my code:
import os
import pandas as pd
import numpy as np
import xlsxwriter
# Retrieve current working directory (`cwd`)
cwd = os.getcwd()
cwd
# Change directory
os.chdir("/Users/s7c/Documents/partsstop")
# Assign spreadsheet filename to `file`
file = 'file = 'SC daily inventory retrieval columns for reports'.xlsx
# Load spreadsheet
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df
df = xl.parse('Sheet1')
#second file code:
#select just the columns we need and rename them:
df2 = df.iloc[:, [1, 3, 6, 9]]
df2.columns = ['Manufacturer Code', 'Part Number', 'Qty Available', 'List Price']
#then select just the rows we need:
df21 = df2[df2['Manufacturer Code'].str.contains("DRP")]#13837 entries
#select just the DRP, first 3 characters and dropping the ones after:
df21['Manufacturer Code'] = df21['Manufacturer Code'].str[:3]
#add a new column:
#in order to do that we need to convert the next column to numeric:
df21['List Price'] = pd.to_numeric(df21['List Price'], errors='coerce')
df21['Dealer Price'] = df21['List Price'].apply(lambda x: x*0.48) #new column equals half of other column
writer = pd.ExcelWriter('example2.xlsx', engine='xlsxwriter')
# Write your DataFrames to a file
df21.to_excel(writer, 'Sheet1')
The actual view of the problem:
Any constructive idea is appreciated. Thanks!
This column seems to be the index of your DataFrame. You can exclude it by passing index=False to to_excel().
Related
currently i have two cvs files, one is the temporary file (df1) which has 10+ columns, the other is the Master file (df2) and has only 2 columns. I would like to iterate over rows and compare values from a column that is in both files (UserName) and if the value of UserName is already present in the Master File, add the value of the other column that appears in both files (Score) to the value of Score in Master File (df2) for that specific user. If on the other hand, the value of UserName from the temporary file is not present in the Master File, just append row to the Master Table as new row.
example Master file (df2):
example temp file (df1):
new master file i would like to obtain after comparison:
i have the following code but currently it appends all rows every time a comparison is made between the 2 files, could use some help to determine if it's even a good approach for the described problem:
import os
import win32com.client
import pandas as pd
import numpy as np
path = os.path.expanduser("Attachments")
MasterFile=os.path.expanduser("MasterFile.csv")
fields = ['UserName', 'Score']
df1 = pd.read_csv(zipfilepath, skipinitialspace=True, usecols=fields)
df2 = pd.read_csv(MasterFile, skipinitialspace=True, usecols=fields)
comparison_values = df1['UserName'].values == df2['UserName'].values
print(comparison_values)
rows = np.where(comparison_values == False)
for item in comparison_values:
if item==True:
df2['Score']=df2['Score']+df1['Score']
else:
df2 = df2.append(df1[{'UserName', 'Score'}], ignore_index=True)
df2.to_csv(MasterFile, mode='a', index=False, header=False)
EDIT**
what about a mix of integers and strings in the 2 files? Example
Example Master File (df2):
Example temp file (df1):
new master file i would like to obtain after comparison:
IIUC, you can use
df = pd.concat([df1, df2]).groupby('UserName', as_index=False).sum()
I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:
Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')
The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.
Running dataframe.to_excel() automatically saves the dataframe as the last sheet in the Excel file.
Is there a way to save a dataframe as the very first sheet, so that, when you open the spreadsheet, Excel shows it as the first on the left?
The only workaround I have found is to first export an empty dataframe to the tab with the name I want as first, then export the others, then export the real dataframe I want to the tab with the name I want. Example in the code below. Is there a more elegant way? More generically, is there a way to specifically choose the position of the sheet you are exporting to (first, third, etc)?
Of course this arises because the dataframe I want as first is the result of some calculations based on all the others, so I cannot export it.
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
pd.DataFrame().to_excel(writer,'this should be the 1st')
other_df.to_excel(writer,'other df')
first_df.to_excel(writer,'this should be the 1st')
writer.save()
writer.close()
It is possible to re-arrange the sheets after they have been created:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
other_df.to_excel(writer,'Sheet2')
first_df.to_excel(writer,'Sheet1')
writer.save()
This will give you this output:
Add this before you save the workbook:
workbook = writer.book
workbook.worksheets_objs.sort(key=lambda x: x.name)
I just need advice on how to approach a problem. I need to create a new file for each spreadsheet of an excel file (it has about 80 sheets), that contains the corresponding spreadsheets data.
Is it possible to use xlrd library to do something like this?
You can solve your problem by using Pandas library:
#Import pandas module (if you don't have it open the command line and run: pip install pandas)
import pandas as pd
#Read the Excel file to a Pandas DataFrame
#Note: change "file_name" to the name of your Excel file
#This will give you a dictionary containing one DataFrame per each sheet on the Excel file
data = pd.read_excel("file_name.xlsx", sheet_name=None)
#Iterate through the dictionary
count = 0
for d in data:
#Give different names according to the sheet where data came from
count = count + 1
name = "sheet_" + str(count)
#Save to an Excel file
d.to_excel(name + ".xlsx")
Ok here is one solution
import pandas as pd
df1 = pd.read_excel('C:\\Users\\xy\\Desktop\\Report.xls', 'Sheet 1',decimal=',')
df2 = pd.read_excel('C:\\Users\\xy\\Desktop\\Report.xls', 'Sheet 2',decimal=',')
df3 = pd.read_excel('C:\\Users\\xy\\Desktop\\Report.xls', 'Sheet 3',decimal=',')
df4 = pd.read_excel('C:\\Users\\xy\\Desktop\\Report.xls', 'Sheet 4',decimal=',')
df5 = pd.merge(df1, df2, df3, df4)