From Excel to splitted Python structures - python

I have the following Excel file:
ID Name Budget
... ... ...
... ... ...
... some unfilled blank cells
ID Name Budget
... ... ...
... some unfilled blank cells
ID Name Budget
... ... ...
I want to read this Excel sheet using Pandas (for instance ExcelFile) into separated structures(each table before the unfilled cells constitutes a dataframe/dictionary/...).
I need to do this so that I can process the data in the same structure as well as between multiple stuctures (like summing the budget of a repeating ID or Name in each structure)
What is the easiest way to do this while keeping a reasonable memory performance ?

Here is the code that read all the data by read_excel() and split:
import pandas as pd
df = pd.read_excel("c:\\tmp\\book1.xlsx", "Sheet1")
mask = df["ID"] == "ID"
nmask = ~mask
s = mask.astype(int).cumsum()
dfs = [x.dropna() for _,x in df[nmask].groupby(s[nmask])]
for df in dfs:
print df
The values in dfs are all object.

Related

parse xlsx file having merged cells using python or pyspark

I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.
Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset.
OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.
I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.
from json import dumps
from xlrd import open_workbook
# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')
# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')
# get total rows
total_rows = sheet.nrows
# get total columns
total_columns = sheet.ncols
# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
row = {}
for j in range(0, total_columns):
if i + 1 < total_rows:
column_name = sheet.cell(rowx=0, colx=j)
row_data = sheet.cell_value(rowx=i+1, colx=j)
row.update(
{
column_name.value: row_data
}
)
if len(row):
lst.append(row)
# convert into json
json_data = dumps(lst)
print(json_data)
After executing the above code I received following type of output:
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FELIX PARTY.MIX",
"": 2.9969042460942
},
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FRISKIES ESTERILIZADOS",
"": 2.0046260994622
},
Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :(
Hence seeking help from the community.
For more clarity on the question I have added sample input/output screenshot as following.
Input dataset:
Expected Output1:
You can download the actual dataset and expected output from the following link
Dataset
To convert get the month column as per requirement, you can use the following code:
import pandas as pd
for_cols = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl', skiprows=2,nrows=1)
main_cols = [for_cols[req][0] for req in for_cols if type(for_cols[req][0])==type('x')] #getting main header column names
#print(main_cols)
for_dates = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4,usecols="C:R")
dates = for_dates.columns.to_list() #getting list of month names to be used
#print(dates)
pdf = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4) #reading the file without main headers
#pdf
#all the columns i.e., 2021 Jan will be labeled differently like 2021 Jan, 2021 Jan.1, 2021 Jan.2 and so on. So the following code will create an array of arrays where each of the child array will be used to create a new small dataframe. All these new dataframes will be combined to a single dataframe (union).
req_cols=[]
for i in range(len(main_cols)):
current_dates = ['Market','Product']
if(i!=0):
for d in dates:
current_dates.append(d+f'.{i}')
else:
current_dates.extend(dates)
req_cols.append(current_dates)
print(req_cols)
#the following code will combine the dataframe to remove multiple yyyy MMM columns. Also added a column `stype` whose name would help identify to which main header column does the month belongs to for each product.
mydf = pdf[req_cols[0]]
mydf['stype']= main_cols[0]
#display(mydf)
for i in range(1,len(req_cols)):
temp = pdf[req_cols[i]]
#print(temp.columns)
temp['stype'] = main_cols[i]
rename_cols={'Market': 'Market', 'Product': 'Product','stype':'stype'} #renaming columns i.e., changing 2021 Jan.1 and such to just 2021 Jan.
for j in req_cols[i][2:]:
rename_cols[j]= j[:8] #if j is 2021 Jan.3 then we only take until j[:8] to get the actual name (2021 Jan)
#print(rename_cols)
temp.rename(columns = rename_cols, inplace = True)
mydf = pd.concat([mydf,temp]) #combining the child dataframes to main dataframe.
mydf
tp = mydf[['Market','Product','2021 Jan','stype']]
req_df = tp.pivot(index=['Product','Market'],columns='stype', values='2021 Jan') #now pivoting the `stype` column
req_df['month'] = ['2021 Jan']*len(req_df) #initialising the month column
req_df.reset_index(inplace=True) #converting index columns to actual columns.
req_df #required data format for 2021 Jan.
#using the following code to get required result. Do it separately for each of the dates and then combine it to `req_df`
for dt in dates[1:]:
tp = mydf[['Market','Product',dt,'stype']]
tp1 = tp.pivot(index=['Product','Market'],columns='stype', values=dt)
tp1['month'] = [dt]*len(tp1)
tp1.reset_index(inplace=True)
req_df = pd.concat([req_df,tp1])
display(req_df[(req_df['Product'] != 'Nestle Purina')]) #selecting only data where product name is not Nestle Purina
To create a new column called Nestle Purina for one of the main columns (Penetration) you can use the following code:
nestle_purina = req_df[(req_df['Product'] == 'Nestle Purina')] #where product name is Nestle Purina
b = req_df[(req_df['Product'] != 'Nestle Purina')] #where product name is not nestle purina
a = b[['Product','Market','month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting required columns along with main column Penetration
n = nestle_purina[['month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting only required columns from nestle_purina df.
import numpy as np
a['Nestle Purina'] = np.nan #creating empty column to populate using code below
for dt in dates:
val = [i for i in n[(n['month'] == dt)]['Penetration % (% of Households who bought a product atleast once in the given time period)']] #getting the corresponding Nestle Purina value for Penetration column
a.loc[a['month'] == dt, 'Nestle Purina'] = val[0] #updating the `Nestle Purina` column value from nan to value extracted above.
a

Substituting column value if particular column exists in two DataFrames with Pandas

I have 2 data frames representing CSV files as such:
# 1.csv
id,email
1,someone#email.com
2,someoneelse#email.com
...
# 2.csv
id,email
3,someone#otheremail.com
4,someone#email.com
...
What I'm trying to do is to merge both tables into one DataFrame using Pandas based on whether a particular column (in this case column 2, email) is identical in both DataFrames.
I need the merged DataFrame to choose the id from 2.csv.
For example, using the sample data above, since the email column value someone#email.com exists in both CSVs, I need the merged DataFrame to output the following:
# 3.csv
id,email
4,someone#gmail.com
2,someoneelse#email.com
3,someone#otheremail.com
What I have so far is as follows:
df1 = pd.read_csv('/path/to/1.csv')
print("df1 has {} rows".format(len(df1.index)))
# "df1 has 14072 rows"
df2 = pd.read_csv('/path/to/2.csv')
print("df2 has {} rows".format(len(df2.index)))
# "df2 has 56766 rows"
join = pd.merge(df1, df2, on="email", how="inner")
print("join has {} rows".format(len(join.index)))
# "join has 321 rows"
The problem is that the join DataFrame produces only the rows where the email field exists in both DataFrames. What I expect is that the output DataFrame contain 56766 + 14072 - 321 = 70517 rows with the id values be the ones from 2.csv when the email field is identical in both source DataFrames. I tried to change the merge(how="left|right") but the results are identical.
from datatable import dt, f, by
df1 = dt.Frame("""
id,email
1,someone#email.com
2,someoneelse#email.com
""")
df1['csv'] = 1
df2 = dt.Frame("""
id,email
3,someone#otheremail.com
4,someone#email.com
""")
df2['csv'] = 2
df_all = dt.rbind(df1, df2)
df_result = df_all[-1, ['id'], by('email')]
Resolved it by uploading the files to Google Spreadsheet and usingVLOOKUP

How to extract Excel PivotCache into Pandas Data Frame?

First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl are:
Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let's call it my_pivot_table)
Get cached fields and their values from my_pivot_table.cache.cacheFields
Get rows data as a dict in two sub-steps:
3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
3.2) Replace cache fields from each record by their actual values, by "joining" cache.records.r and cache.cacheFields
Convert dict with rows into a pandas DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
"Bare" code
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map (from cache.cacheFields)
This is a dict with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict (before mapping)
Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
Object row_dict (after mapping)
After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Building on #PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)

python numpy csv header in column not row

I have a script which produces a 15x1096 array of data using
np.savetxt("model_concentrations.csv", model_con, header="rows:','.join(sources), delimiter=",")
Each of the 15 rows corresponds to a source of emissions, while each column is 1 day over 3 years. If at all possible I would like to have a 'header' in column 1 which states the emssion source. When i use the option "header='source1,source2,...'" these labels get placed in the first row (like expected). ie.
2per 3rd_pvd 3rd_unpvd 4rai_rd 4rai_yd 5rmo 6hea
2.44E+00 2.12E+00 1.76E+00 1.33E+00 6.15E-01 3.26E-01 2.29E+00 ...
1.13E-01 4.21E-02 3.79E-02 2.05E-02 1.51E-02 2.29E-02 2.36E-01 ...
My question is, is there a way to inverse the header so the csv appears like this:
2per 7.77E+00 8.48E-01 ...
3rd_pvd 1.86E-01 3.62E-02 ...
3rd_unpvd 1.04E+00 2.65E-01 ...
4rai_rd 8.68E-02 2.88E-02 ...
4rai_yd 1.94E-01 8.58E-02 ...
5rmo 7.71E-01 1.17E-01 ...
6hea 1.07E+01 2.71E+00 ...
...
Labels for rows and columns is one of main reasons for the existence of pandas.
import pandas as pd
# Assemble your source labels in a list
sources = ['2per', '3rd_pvd', '3rd_unpvd', '4rai_rd',
'4rai_yd', '5rmo', '6hea', ...]
# Create a pandas DataFrame wrapping your numpy array
df = pd.DataFrame(model_con, index=sources)
# Saving it a .csv file writes the index too
df.to_csv('model_concentrations.csv', header=None)

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories