I am having trouble with the following and would very much appreciate your help.
What I am trying to do:
I am trying to run a function I created called ‘get_nav’ which takes inputs ‘Identifiers’ and ‘asofdate’ from a csv file, ‘data’, and queries a third-party API for 3 additional metrics for each identifier as of the given date .
The source file ‘data’ has 31 columns and 49 rows (this will fluctuate over time). For each row in ‘data’, I need to
select the value in the ‘IDnumber’ column and its associated as of date in the ‘Date’ column
run the function ‘get_nav’
add the results of the query for each line back to the csv file ‘Data’ on correct row
save the updated csv file
The problem:
Parts 1 and 2 seem to be working as expected. However, part three is not working – when it does work, only the results for the last IDnumber are saved to the output file.
What I’ve Tried:
I have tried using pandas and cvs writer to accomplish this, but each have had different issues and ultimately failed. I had difficulty iterating over the source file using pandas.
This approach has gotten me the furthest so far and when I print the results in the console, they are accurate – I just need to be able to store/access them.
I think question is similar to mine, but unfortunately I cannot quite translate the answer to my problem. https://stackoverflow.com/a/46718677/12840483
I’ve included samples of my data, my code, and what I would like the output to look like for your reference.
Thank you for taking the time to review.
import pandas as pd
import csv
def get_nav(identifiers, asofdate):
if not isinstance(identifiers, list):
identifiers = [identifiers]
columns = [
["Column", "Expression", "Function", "Parameter", "Display"],
["Name", None, None, None, "Name"],
["Height", None, None, None, "Height"],
["Weight", None, None, None, "Weight"],
["Age", None, None, None, "Age"],
]
options={"asof": asofdate}
df = thirdparty.apiquery(ids=identifiers, columns=columns, options=options).as_dataframe()
records = df.to_dict('records')
return {rec['Name']: rec for rec in records}
data = csv.DictReader(open("data.csv", 'rU'))
for row in data:
results = get_nav(row['IDnumber'], row['Date'])
output = pd.DataFrame.from_dict(results)
#sends results to csv file
output.to_csv(r'F:\backup\holding\Access\Runs\data.csv', index = False)
desired output which included the 3 new columns "Height", "Weight", and "Age" and the related data.
Related
I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?
I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html
Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]
I believe Python is the best choice but I can be wrong.
Below is a sample from a data source in text format in Linux:
TUI,39832020:09:01,10.56| TUI,39832020:10:53,11.23| TUI,39832020:15:40,23.20
DIAN,39832020:09:04,11.56| TUI,39832020:11:45,11.23| DIAN,39832020:12:30,23.20| SLD,39832020:11:45,11.22
The size is unknown, let's presume a million rows.
Each line contains three or more sets delimited by |, and each set has fields separated by ,.
The first field in each set is the product ID. For example, on the sample above, TUI, DIAN, and SLD are products ID.
I need to find out how many types of products I have on file. For example, the first line contains 1: TUI, the second line contains 3: DIAN, TUI, and SLD.
In total, on those two lines, we can see there are three unique products.
Can anyone help?
Thank you very much. Any enlightening is appreciated.
UPDATE
I prefer a solution based in Python with Spark, i.e. pySpark.
I'm also looking for statistics like:
total amount of each product;
all records for a given time (the second field in each set, like 39832020:09:01);
minimum and maximum price for each product.
UPDATE 2
Thank you all for the code, I really appreciate. I wonder if anyone can write the data into a RDD and/or dataframe. I know that in SparkSQL it is very simple to obtain those statistics.
Thanks a lot in advance.
Thank you very much.
Similar to Accdias answer: Use a dictionary, read your file in line by line, split the data by | then by , and total up the counts in your dictionary.
myFile="lines_to_read.txt"
productCounts = dict()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
productCode=myItem.split(",")
productCode=productCode[0].strip()
if productCode in productCounts:
productCounts[productCode]+=1
else:
productCounts[productCode]=1
print(productCounts)
**** Update ****
Dataframe use with Pandas so that we can query stats on the data afterwords:
import pandas as pd
myFile="lines_to_read.txt"
myData = pd.DataFrame (columns=['prodID', 'timeStamp', 'prodPrice'])
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
thisItem=myItem.strip('\n, " "').split(",")
myData = myData.append({'prodID':thisItem[0],'timeStamp':thisItem[1],'prodPrice':thisItem[2]}, ignore_index=True)
print(myData) # Full Table
print(myData.groupby('prodID').agg({'prodID':'count'})) # Total of prodID's
print(myData.loc[myData['timeStamp'] == '39832020:11:45']) # all lines where time = 39832020:11:45
print(myData.groupby('prodID').agg({'prodPrice':['min', 'max']})) # min/max prices
First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl are:
Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let's call it my_pivot_table)
Get cached fields and their values from my_pivot_table.cache.cacheFields
Get rows data as a dict in two sub-steps:
3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
3.2) Replace cache fields from each record by their actual values, by "joining" cache.records.r and cache.cacheFields
Convert dict with rows into a pandas DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
"Bare" code
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map (from cache.cacheFields)
This is a dict with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict (before mapping)
Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
Object row_dict (after mapping)
After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Building on #PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
For a given path, i process many GigaBytes of files inside, and yield dataframes for every processed one.
For every dataframe that is yield, which includes two string columns of varying size, I want to dump them to disk using the very efficient HDF5 format. The error is raised when the HDFStore.append procedure is called, for the 4th or 5th iteration.
I use the following routine(simplified) to build the dataframes:
def build_data_frames(path):
data = df({'headline': [],
'content': [],
'publication': [],
'file_ref': []},
columns=['publication','file_ref','headline','content'])
for curdir, subdirs, filenames in os.walk(path):
for file in filenames:
if (zipfile.is_zipfile(os.path.join(curdir, file))):
with zf(os.path.join(curdir, file), 'r') as arch:
for arch_file_name in arch.namelist():
if re.search('A[r|d]\d+.xml', arch_file_name) is not None:
xml_file_ref = arch.open(arch_file_name, 'r')
xml_file = xml_file_ref.read()
metadata = XML2MetaData(xml_file)
headlineTokens, contentTokens = XML2TokensParser(xml_file)
rows= [{'headline': " ".join(headlineTokens),
'content': " ".join(contentTokens)}]
rows[0].update(metadata)
data = data.append(df(rows,
columns=['publication',
'file_ref',
'headline',
'content']),
ignore_index=True)
arch.close()
yield data
Then I use the following method to write these dataframes to disk:
def extract_data(path):
hdf_fname = extract_name(path)
hdf_fname += ".h5"
data_store = HDFStore(hdf_fname)
for dataframe in build_data_frames(path):
data_store.append('df', dataframe, data_columns=True)
## passing min_itemsize doesn't work either
## data_store.append('df', dataframe, min_itemsize=8000)
## trying the "alternative" command didn't help
## dataframe.to_hdf(hdf_fname, 'df', format='table', append=True,
## min_itemsize=80000)
data_store.close()
->
%time load_data(publications_path)
And the ValueError I get is:
...
ValueError: Trying to store a string with len [5761] in [values_block_0]
column but this column has a limit of [4430]!
Consider using min_itemsize to preset the sizes on these columns
I tried all the options, went through all the documentation necessary for this task, and tried all the tricks I saw on the Internet. Yet, no idea why it happens.
I use pandas ver: 0.17.0
Appreciate your help very much!
Have you seen this post? stackoverflow
data_store.append('df',dataframe,min_itemsize={ 'string' : 5761 })
Change 'string' to your type.