How does one use the mapdict argument in pyexcel? - python

I'm having some trouble with the mapdict argument in the save_to_database function in pyexcel.
It seems that I still need to have a row of column names in the beginning of my files otherwise I get an error. Does mapdict not specify the names to use for each column once they have been converted to a dictionary?
I'm very unsure of what this argument actually does...
Any help would be appreciated!!

Look, it's simple
if you have CSV like this:
brand,sku,description,quantity,price
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
you don't need mapdict
but if your CSV without first row
you need it. For exmple one peace from my flask project:
def article_init_func(row):
warehouse = Warehouse.query.filter_by(id=id).first()
a = Article()
a.pricelist_id = p.id
a.sku=row['sku']
a.description=row['description']
a.brand=row['brand']
a.quantity=row['quantity']
a.city=warehouse.city
a.price=row['price']
return a
map_row = ['brand', 'sku', 'description', 'quantity', 'price']
request.save_to_database(
field_name='file', session=db.session,
initializer = article_init_func,
table=Article,
mapdict=map_row)

Related

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

Fast Parsing a huge 12 GB JSON file with Python

I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html
Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]

removing rows with given criteria

I am beginer with both python and pandas and I came across an issue I can't handle on my own.
What I am trying to do is:
1) remove all the columns except three that I am interested in
2) remove all rows which contains serveral strings in column "asset number". And here is difficult part. I removed all the blanks but I can't remove other ones because nothing happens (example with string "TECHNOLOGIES" - tried part of the word and whole word and both don't work.
Here is the code:
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19')
df = df[['asset number','Cost','accumulated depr']] #removing other columns
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace = False)
df = df[~df['asset number'].str.contains("TECHNOLOGIES, INC", na=False)]
df.to_excel("abi_output.xlsx")
And besides that, file has 600k rows and it loads so slow to see the output. Do you have any advice for it?
Thank you!
#Kenan - thank you for your answer. Now the code looks like below but it still doesn't remove rows which contains in chosen column specified stirngs. I also attached screenshot of the output to show you that the rows still exist. Any thoughts?
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
several_strings = ['', 'TECHNOLOGIES', 'COST CENTER', 'Account', '/16']
df = df[~df['asset number'].isin(several_strings)]
df.to_excel("abi_output.xlsx")
rows still are not deleted
#Andy
I attach sample of the input file. I just changed the numbers in two columns because these are confidential and removed not needed columns (removing them with code wasn't a problem).
Here is the link. Let me know if this is not working properly.
enter link description here
You can combine your first two steps with:
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
I assume this is what your trying to remove
several_strings = ['TECHNOLOGIES, INC','blah','blah']
df = df[~df['asset number'].isin(several_string)]
df.to_excel("abi_output.xlsx")
Update
Based on the link you provided this might be a better approach
df = df[df['asset number'].str.len().eq(7)]
the code your given is correct. so I guess maybe there is something wrong with your strings in columns 'asset number', can you give some examples for a code check?

How to extract Excel PivotCache into Pandas Data Frame?

First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl are:
Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let's call it my_pivot_table)
Get cached fields and their values from my_pivot_table.cache.cacheFields
Get rows data as a dict in two sub-steps:
3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
3.2) Replace cache fields from each record by their actual values, by "joining" cache.records.r and cache.cacheFields
Convert dict with rows into a pandas DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
"Bare" code
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map (from cache.cacheFields)
This is a dict with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict (before mapping)
Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
Object row_dict (after mapping)
After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Building on #PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)

Function to print dataframe, that uses df name as an argument

In function, I can't use argument to define the name of the df in df.to_csv().
I have a long script to pull apart and understand. To do so I want to save the different dataframes it uses and store them in order. I created a function to do this and add the order number 01 (number_of_interim_exports) to the name (from argument).
My problem is that I need to use this for multiple dataframe names, but the df.to_csv part won't accept an argument in place of df...
def print_interim_results_any(name, num_exports, df_name):
global number_of_interim_exports
global print_interim_outputs
if print_interim_outputs == 1:
csvName = str(number_of_interim_exports).zfill(2) + "_" +name
interimFileName = "interim_export_"+csvName+".csv"
df.to_csv(interimFileName, sep=;, encoding='utf-8', index=False)
number_of_interim_exports += 1
I think i just screwed something else up: this works fine:
import pandas as pd
df = pd.DataFrame({1:[1,2,3]})
def f(frame):
frame.to_csv("interimFileName.csv")
f(df)

Categories