Finding the difference between the columns of two excel sheets - python

I have two excel files that both have multiples sheets. The two files have some sheets in common i.e they have the same sheet name but different data and values. However, these sheets with the same name have more columns in one file than the other. What I want to do is copy the extra columns from the sheet that has extra columns to the sheet (in other excel file) that has them missing. Again the data in the common columns is different so I cant just simply copy the bigger sheet into the smaller one.
First reading the two files:
v8 = pd.read_excel('Revised_V8.xlsx', sheet_name=None)
v9 = pd.read_excel('Revised_V9.xlsx', sheet_name=None)
Now reading one common sheet in both files
MAP_8 = v8['MAP']
MAP_9 = v9['MAP']
Now both MAP_8 and MAP_9 are oredreddict. I use this line to get the names of the extra columns in V9
d=set(MAP_9)-set(MAP_8)
I'm stuck here. My idea is to retrieve the data in those columns in d and then add that to v8 dataframe
xtracol = MAP_9[d] # I want to return the values of those columns saved in d
I get an error here TypeError: unhashable type: 'set'
Sorry but I have no idea how to fix this or get the extar columns without using set.
to summarize, lets say MAP_9 has three columns A,B, C where MAP_8 has only two columns A, B. The data in A and B is different between the two sheets. I only want to copy columns C from MAP_9 and add it to MAP_8 without changing the values of A and B in MAP_8.
This is just a simple case but I have more than dozen of common sheets, and some have tens extra columns than the other
Thank you in advance

I do not know the syntax of operating Excel with Python, but I do know a fair bit about Excel and Python. Now you have the names of the columns that are missing in the other sheet, for every extra column add an empty column to the sheet that is missing it, under the same name. Then load the data from the extra column into Python and write it into the new empty column. To repeat the process automatically, do some simple Python looping such as:
For sheet in sheets:
MAP_8 = v8[sheet]
MAP_9 = v9[sheet]
Etc. I can expand on this in comments if needs be.

Related

Reading Excel files and detect column name in python

I have some excel files that includes some rows(it could be 1 or more rows) at the top for description and below it, there are the tables with the column names and values. Also, some column names are in two rows that I need to merge them. Also, there are cases that includes three rows for the column name.
I would like to go through it, skip the first lines to detect rows that include the column name. What would be your suggestions for it?

How to check if all the required columns are present in excel using python/pyspark

I have multiple Excel files which I'm receiving in my blob storage.
I have certain columns which I need for sure in each of those excel
for ex;
excel1= ['a','b','c']
excel2=['d','e','f']
I want to fetch only the column names from all these excel and check whether the required columns are present or not and assert if not present.
how to achieve this using pyspark?
See if this helps you:
listColumns=df.columns
"column_name" in listColumns
You can read more here.

Way to refer a column within a same name under difference merged cell?

im kinda new to pandas and stuck at how to refer a column within same name under different merged column. here some example which problem im stuck about. i wanna refer a database from worker at company C. but if im define this excel as df and
dfcompanyAworker=df[Worker]
it wont work
is there any specific way to define a database within identifical column like this ?
heres the table
https://i.stack.imgur.com/8Y6gp.png
thanks !
first read the dataset that will be used, then set the shape for example I use excel format
dfcompanyAworker = pd.read_excel('Worker', skiprows=1, header=[1,2], index_col=0, skipfooter=7)
dfcompanyAworker
where:
skiprows=1 to ignore the title row in the data
header=[1, 2] is a list because we have multilevel columns, namely Category (Company) and other data
index_col=0 to make the Date column an ​​index for easier processing and analysis
skipfooter=7 to ignore the footer at the end of the data line
You can follow or try the steps as I made the following

How to make the pandas row a column name?

When I create the Pandas dataframe, it detects the empty line at the top of the excel file as the column name and shows it as unnamed. But my column names should be the concentration names on the bottom line of it. How can I do this in a pandas? (Editing in Excel is a solution, but I want to automatically edit multiple excel files with python)
I think the column over there is not representing any column it is simply indication that there are many number of columns there. If it is a column and u don't want it u can simply drop it
df.drop("...")
if still it is still not resolved do comment.

loop through a list of dataframes in python and wirte each df into different excel sheets

I have a large dataset of almost 4 million records. I'd like to export them in excel but since each sheets of an excel file can contain only 1 million records, I decided to split the dataframe and put each subset into an excel sheet.
I used the below code:
df_split = np.array_split(promotion1, 4)
for i in df_split:
i.to_excel("result_promotion1.xlsx", index = False, sheet_name = i)
but that raised the below error:
"'DataFrame' objects are mutable, thus they cannot be hashed"
any help would be appreciated.
The issue is with sheet_name = i. The sheet_name argument is expecting a string, but you're passing it the whole dataframe that you're trying to output to Excel.
The easiest way to resolve would probably to omit the argument and use the defaults (Sheet1, Sheet2, etc.). Alternatively, you could use enumerate to easily number the dataframes and split them into several excel files like so:
df_split = np.array_split(promotion1, 4)
for index, i in enumerate(df_split):
filename = "result_promotion" + str(index) + ".xlsx"
i.to_excel(filename, index = False)
Alternatively, this post (How to save a new sheet in an existing excel file, using Pandas?) goes into how to add a new sheet to an existing Excel file using pd.ExcelWriter.
Just to explain the error: since sheet_name expects a string and you're giving it a different object, pandas will attempt to hash the object to get a unique string representation of it instead. However, since DataFrames are mutable - you can change values in it, unlike a tuple - they cannot be hashed. See this post for a more detailed explanation on why hashable objects must be immutable.

Categories