Force Pandas to keep multiple columns with the same name - python

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?

you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)

You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Related

Python Panda DataFrame add column headers to data from clipboard dynamically

I am copying data from my clipboard that contains no headers. I dont want the index column and I want to name the columns dynamically skipping the first column by count(ie 1,2,3...). The output data set would like the following below.
1 2 3 4 5 6 7 8 9 10
1981 5012.0 8269.0 10907.0 11805.0 13539.0 16181.0 18009.0 18608.0 18662.0 18834.0
Here is the code I'm starting with. The codes works but the column headers aren't dynamic and the data set may not always have the same number of columns. I'm not sure how to make the column headers be dynamic
import pandas as pd
df = pd.read_clipboard(index_col = 0, names = ["","1","2","3","4","5","6","7","8","9","10"])
To get exactly what you are looking for you can use:
df = pd.read_clipboard(header=None, index_col=0).rename_axis(None)

Iterating on Pandas DataFrame to pass data into API

I am creating a script that reads a GoogleSheet, transforms the data and passes it into my ERP API to automate the creation of Purchase Orders.
I have got as far as outputting the data in a dataframe but I need help on how I can iterate through this and pass it in the correct format to the API.
DataFrame Example (dfRow):
productID vatrateID amount price
0 46771 2 1 1.25
1 46771 2 1 2.25
2 46771 2 2 5.00
Formatting of the API data:
vatrateID1=dfRow.vatrateID[0],
amount1=dfRow.amount[0],
price1=dfRow.price[0],
productID1=dfRow.productID[0],
vatrateID2=dfRow.vatrateID[1],
amount2=dfRow.amount[1],
price2=dfRow.price[1],
productID2=dfRow.productID[1],
vatrateID3=dfRow.vatrateID[2],
amount3=dfRow.amount[2],
price3=dfRow.price[2],
productID3=dfRow.productID[2],
I would like to create a function that would iterate thru the DataFrame and return the data in the correct format to pass to the API.
I'm new at Python and struggle most with iterating / loops so any help is much appreciated!
First, you can always loop over the rows of a dataframe using df.iterrows(). Each step through this iterator yields a tuple containing the row index and the row contents as a pandas Series object. So, for example, this would do the trick:
for ix, row in df.iterrows():
for column in row.index:
print(f"{column}{ix}={row[column]}")
You can also do it without resorting to loops. This is great if you need performance, but if performance isn't a concern then it is really just a matter of taste.
# first, "melt" the data, which puts all of the variables on their own row
x = df.reset_index().melt(id_vars='index')
# now join the columns together to produce the rows that we want
s = x['variable'] + x['index'].map(str) + '=' + x['value'].map(str)
print(s)
0 productID0=46771.0
1 productID1=46771.0
2 productID2=46771.0
3 vatrateID0=2.0
...
10 price1=2.25
11 price2=5.0

pandas read excel sheet with multiple sheets and different header offsets

I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Reading values in column x from specific worksheets using pandas

I am new to python and have looked at a number of similar problems on SO, but cannot find anything quite like the problem that I have and am therefore putting it forward:
I have an .xlsx dataset with data spread across eight worksheets and I want to do the following:
sum the values in the 14th column in each worksheet (the format, layout and type of data (scores) is the same in column 14 across all worksheets)
create a new worksheet with all summed values from column 14 in each worksheet
sort the totaled scores from highest to lowest
plot the summed values in a bar chart to compare
I cannot even begin this process because I am struggling at the first point. I am using pandas and am having trouble reading the data from one specific worksheet - I only seem to be able to read the data from the first worksheet only (I print the outcome to see what my system is reading in).
My first attempt produces an `Empty DataFrame':
import pandas as pd
y7data = pd.read_excel('Documents\\y7_20161128.xlsx', sheetname='7X', header=0,index_col=0,parse_cols="Achievement Points",convert_float=True)
print y7data
I also tried this but it only exported the entire first worksheet's data as opposed to the whole document (I am trying to do this so that I can understand how to export all data). I chose to do this thinking that maybe if I exported the data to a .csv, then it might give me a clearer view of what went wrong, but I am nonethewiser:
import pandas as pd
import numpy as np
y7data = pd.read_excel('Documents\\y7_20161128.xlsx')
y7data.to_csv("results.csv")
I have tried a number of different things to try and specify which column within each worksheet, but cannot get this to work; it only seems to produce the results for the first worksheet.
How can I, firstly, read the data from column 14 in every worksheet, and then carry out the rest of the steps?
Any guidance would be much appreciated.
UPDATE (for those using Enthought Canopy and struggling with openpyxl):
I am using Enthought Canopy IDE and was constantly receiving an error message around openpyxl not being installed no matter what I tried. For those of you having the same problem, save yourself lots of time and read this post. In short, register for an Enthought Canopy account (it's free), then run this code via the Canopy Command Prompt:
enpkg openpyxl 1.8.5
I think you can use this sample file:
First read all columns in each sheet to list of columns called y7data:
y7data = [pd.read_excel('y7_20161128.xlsx', sheetname=i, parse_cols=[13]) for i in range(3)]
print (y7data)
[ a
0 1
1 5
2 9, a
0 4
1 2
2 8, a
0 5
1 8
2 5]
Then concat all columns together, I add keys which are used for axis x in graph, sum all columns, remove second level of MultiIndex (a, a, a in sample data) by reset_index and last sort_values:
print (pd.concat(y7data, axis=1, keys=['a','b','c']))
a b c
a a a
0 1 4 5
1 5 2 8
2 9 8 5
summed = pd.concat(y7data, axis=1, keys=['a','b','c'])
.sum()
.reset_index(drop=True, level=1)
.sort_values(ascending=False)
print (summed)
c 18
a 15
b 14
dtype: int64
Create new DataFrame df, set column names and write to_excel:
df = summed.reset_index()#.
df.columns = ['a','summed']
print (df)
a summed
0 c 18
1 a 15
2 b 14
If need add new sheet use this solution:
from openpyxl import load_workbook
book = load_workbook('y7_20161128.xlsx')
writer = pd.ExcelWriter('y7_20161128.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, "Main", index=False)
writer.save()
Last Series.plot.bar:
import matplotlib.pyplot as plt
summed.plot.bar()
plt.show()
From what I understand, your immediate problem is managing to load the 14th column from each of your worksheets.
You could be using ExcelFile.parse instead of read_excel and loop over your sheets.
xls_file = pd.ExcelFile('Documents\\y7_20161128.xlsx')
worksheets = ['Sheet1', 'Sheet2', 'Sheet3']
series = [xls_file.parse(sheet, parse_cols=[13]) for sheet in worksheets]
df = pd.DataFrame(series)
And from that, sum() your columns and keep going.
Using ExcelFile and then ExcelFile.parse() has the advantage to load your Excel file only once, and iterate over each worksheet. Using read_excel makes your Excel file to be loaded in each iteration, which is useless.
Documentation for pandas.ExcelFile.parse.

Categories