Reading Excel files and detect column name in python - python

I have some excel files that includes some rows(it could be 1 or more rows) at the top for description and below it, there are the tables with the column names and values. Also, some column names are in two rows that I need to merge them. Also, there are cases that includes three rows for the column name.
I would like to go through it, skip the first lines to detect rows that include the column name. What would be your suggestions for it?

Related

Python, Pandas: How to automatically skip Excel header cells and add the rest to a dataframe

Greetings I would like to transform an excel document into a dataframe, but unfortunately the excel documents are made by someone else and they will always headers like so:
Excel example
I would like to ignore the "made by stevens" and "made 04/02/21" parts and just read the relevant information like name, age, file.
How would I skip it using pandas
Is there a way to always skip those header information, even if the relevant info (name, age, file) starts at a different line on different documents? (IE in one document age is at row 4 and in another age is at row 7)
Thanks!
The function pandas.read_excel has a parameter called skiprows, if you feed it an integer it will simply skip the n first lines at the start of the file.
In your case just use:
df = pd.read_excel(filepath, skiprows=4)
The second part of your question this is trickier. Depending on your business use cases you might have different solutions. If the columns are always the same (Name, Age, file) you could import the excel file without skipping lines but with fixing the column names, then by dropping rows with empty data and the additional header row you didn't use.
If you want to skip header which is on row = 1, then you can try this
pandas.read_excel(skiprows=1, skipfooter=0)
you can specify the value in integer to skiprows=1 to skip header and skipfooter=1 to skip footer, the number can depending upon how many rows you want to skip.

How to make the pandas row a column name?

When I create the Pandas dataframe, it detects the empty line at the top of the excel file as the column name and shows it as unnamed. But my column names should be the concentration names on the bottom line of it. How can I do this in a pandas? (Editing in Excel is a solution, but I want to automatically edit multiple excel files with python)
I think the column over there is not representing any column it is simply indication that there are many number of columns there. If it is a column and u don't want it u can simply drop it
df.drop("...")
if still it is still not resolved do comment.

How can I append the column values in multiple rows in into one row and combine multiple rows into one in python

I have this data :
I need the Output as follows :
Basically I want to merge all the Risk Statements in to one on the basis of field ID1 and want to do it in python. Can someone please help me

Finding the difference between the columns of two excel sheets

I have two excel files that both have multiples sheets. The two files have some sheets in common i.e they have the same sheet name but different data and values. However, these sheets with the same name have more columns in one file than the other. What I want to do is copy the extra columns from the sheet that has extra columns to the sheet (in other excel file) that has them missing. Again the data in the common columns is different so I cant just simply copy the bigger sheet into the smaller one.
First reading the two files:
v8 = pd.read_excel('Revised_V8.xlsx', sheet_name=None)
v9 = pd.read_excel('Revised_V9.xlsx', sheet_name=None)
Now reading one common sheet in both files
MAP_8 = v8['MAP']
MAP_9 = v9['MAP']
Now both MAP_8 and MAP_9 are oredreddict. I use this line to get the names of the extra columns in V9
d=set(MAP_9)-set(MAP_8)
I'm stuck here. My idea is to retrieve the data in those columns in d and then add that to v8 dataframe
xtracol = MAP_9[d] # I want to return the values of those columns saved in d
I get an error here TypeError: unhashable type: 'set'
Sorry but I have no idea how to fix this or get the extar columns without using set.
to summarize, lets say MAP_9 has three columns A,B, C where MAP_8 has only two columns A, B. The data in A and B is different between the two sheets. I only want to copy columns C from MAP_9 and add it to MAP_8 without changing the values of A and B in MAP_8.
This is just a simple case but I have more than dozen of common sheets, and some have tens extra columns than the other
Thank you in advance
I do not know the syntax of operating Excel with Python, but I do know a fair bit about Excel and Python. Now you have the names of the columns that are missing in the other sheet, for every extra column add an empty column to the sheet that is missing it, under the same name. Then load the data from the extra column into Python and write it into the new empty column. To repeat the process automatically, do some simple Python looping such as:
For sheet in sheets:
MAP_8 = v8[sheet]
MAP_9 = v9[sheet]
Etc. I can expand on this in comments if needs be.

Iterate Through Folder and Add One Column of Each CSV to Dataframe

I have a folder that contains ~90 CSV files. Each relevant file is named xxxxx-2012 and has the same column names.
I would like to create a single DataFrame with a specific column power(MW) from each file, i.e. 90 columns in total, naming the column in the resulting DataFrame by the file name.
My objective with problems like this is to get to a simple datastructure as quickly as possible. In this case, that could be a dictionary of filenames to DataFrames.
frames = {filename: pd.read_csv(filename) for filename is os.listdir()}
You may have to filter out bad filenames, e.g. by extension, or you may be better off using glob... in either case it breaks up the problem, this shouldn't be too bad.
Then the question becomes much easier*:
How do I get one column from a DataFrame. df[colname].
How do I concat a list of columns to a DataFrame.
*Assuming you know your way around python datastructure e.g. list comprehensions.
Another option is to just concat the entire dict:
pd.concat(frames)
(which gives you a MultiIndex with all the information.)

Categories