Python extract data from a semi-structured .xlsx file

Python extract data from a semi-structured .xlsx file - python

I have a .xlsx file which looks as the attached file. What is the most common way to extract the different data parts from this excel file in Python?
Ideally there would be a method that is defined as :
pd.read_part_csv(columns=['data1', 'data2','data3'], rows=['val1', 'val2', 'val3'])
and returns an iterator over pandas dataframes which hold the values in the given table.

here is a solution with pylightxl that might be a good fit for your project if all you are doing is reading. I wrote the solution in terms of rows but you could just as well have done it in terms of columns. See docs for more info on pylightxl https://pylightxl.readthedocs.io/en/latest/quickstart.html
import pylightxl
db = pylightxl.readxl('Book1.xlsx')
# pull out all the rowIDs where data groups start
keyrows = [rowID for rowID, row in enumerate(db.ws('Sheet1').rows,1) if 'val1' in row]
# find the columnIDs where data groups start (like in your example, not all data groups start in col A)
keycols = []
for keyrow in keyrows:
# add +1 since python index start from 0
keycols.append(db.ws('Sheet1').row(keyrow).index('val1') + 1)
# define a dict to hold your data groups
datagroups = {}
# populate datatables
for tableIndex, keyrow in enumerate(keyrows,1):
i = 0
# data groups: keys are group IDs starting from 1, list: list of data rows (ie: val1, val2...)
datagroups.update({tableIndex: []})
while True:
# pull out the current group row of data, and remove leading cells with keycols
datarow = db.ws('Sheet1').row(keyrow + i)[keycols[tableIndex-1]:]
# check if the current row is still part of the datagroup
if datarow[0] == '':
# current row is empty and is no longer part of the data group
break
datagroups[tableIndex].append(datarow)
i += 1
print(datagroups[1])
print(datagroups[2])
[[1, 2, 3, ''], [4, 5, 6, ''], [7, 8, 9, '']]
[[9, 1, 4], [2, 4, 1], [3, 2, 1]]
Note that output of table 1 has extra '' on it, that is because the size of the sheet data is larger than your group size. You can easily remove these with list.remove('') if you like

Related

Appending data to a Google Sheet using Python

I have 3 different tables I'm looking to directly push to 3 separate tabs in a Google Sheet. I set up the GSpread connection and that's working well. I started to adjust my first print statement into what I thought would append the information to Tab A (waveData), but no luck.
I'm looking to append the information to the FIRST blank row in a tab. Basically, so that the data will be ADDED to what is already in there.
I'm trying to use append_rows to do this, but am hitting a "gspread.exceptions.APIError: {'code': 400, 'message': 'Invalid value at 'data.values' (type.googleapis.com/google.protobuf.ListValue).
I'm really new to this, just thought it would be a fun project to evaluate wave sizes in NJ across all major surf spots, but really in over my head (no pun intended).
Any thoughts?
import requests
import pandas as pd
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
tideData = sh.get_worksheet(1)
lightData = sh.get_worksheet(2)
# AddValue = ["Test", 25, "Test2"]
# lightData.insert_row(AddValue, 3)
id_list = [
'/Belmar-Surf-Report/3683/',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
]
for x in id_list:
waveData.append_rows(pd.read_html(requests.get('http://magicseaweed.com' + x).text)
[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].to_json(), value_input_option="USER_ENTERED")
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[0])
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[1])

From your following reply,
there really is no relationship between the 3. When I scrape with IMPORTHTML into Google sheets, those are just Tables at the locations 0,1, and 2. I'm basically just trying to have an output of each table on a separate tab
I understood that you wanted to retrieve the values with pd.read_html(requests.get('http://magicseaweed.com' + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]] from id_list, and wanted to put the values to a sheet in Google Spreadsheet.
In this case, how about the following modification?
At append_rows, it seems that JSON data cannot be directly used. In this case, it is required to use a 2-dimensional array. And, I'm worried about the value of NaN in the datafarame. When these points are reflected in your script, how about the following modification?
Modified script 1:
In this sample, all values are put into a sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
# I modified the below script.
res = []
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [[x], df.columns.values.tolist(), *df.values.tolist()]
res.extend(values)
res.append([])
waveData.append_rows(res, value_input_option="USER_ENTERED")
When this script is run, the retrieved values are put into the 1st sheet as follows. In this sample modification, the path and a blank row are inserted between each data. Please modify this for your actual situation.
Modified script 2:
In this sample, each value is put into each sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
obj = {e.title: e for e in sh.worksheets()}
for e in id_list:
if e not in obj:
obj[e] = sh.add_worksheet(title=e, rows="1000", cols="26")
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [df.columns.values.tolist(), *df.values.tolist()]
obj[x].append_rows(values, value_input_option="USER_ENTERED")
When this script is run, the sheets are checked and created with the sheet names of the values in id_list, and each value is put to each sheet.
Reference:
append_rows

How to compare 2 huge CSV files, based on column names specified at run time and ignoring few columns?

I need to write a program that compares 2 CSV files and reports the differences in an excel file. It compares the records based on a Primary key (and sometimes a few Secondary keys) ignoring a list of other columns specified. All these parameters are read from an excel.
I have written a code that does this and works okay for small files but the performance is very poor for huge files (some files that are to be compared have way more than 200K rows).
The current logic uses csv.DictReader to read the files. I iterate over the rows of first file reading row by row, each time finding the corresponding record in the second file (comparing Primary and Secondary keys). If the record is found, I then compare all the columns ignoring those specified in the excel. If there is a difference in any of the columns, I write both records in the excel report highlighting the difference.
Below is the code I have so far. It would be very kind if someone could provide any tips to optimize this program or suggest a different approach.
primary_key = wb['Parameters'].cell(6,2).value #Read Primary Key
secondary_keys = [] #Read Secondary Keys into a list
col = 4
while wb['Parameters'].cell(6,col).value:
secondary_keys.append(wb['Parameters'].cell(6,col).value)
col += 1
len_secondary_keys = len(secondary_keys)
ignore_col = [] #Read Columns to be ignored into a list
row = 8
while wb['Parameters'].cell(row,2).value:
ignore_col.append(wb['Parameters'].cell(row,2).value)
row += 1
with open (filename1) as csv_file_1, open (filename2) as csv_file_2:
file1_reader = csv.DictReader(filename1, delimiter='~')
for row_file1 in file1_reader:
record_found = False
file2_reader = csv.DictReader(filename2, delimiter='~')
for row_file2 in file2_reader:
if row_file2[primary_key] == row_file1[primary_key]:
for key in secondary_keys:
if row_file2[key] != row_file1[key]:
break
compare(row_file1, row_file2)
record_found = True
break
if not record_found:
report_not_found(sheet_name1, row_file1, row_no_file1)
def compare(row_file1, row_file2):
global row_diff
data_difference = False
for key in row_file1:
if key not in ignore_col:
if (row_file1[key] != row_file2[key]):
data_difference = True
break
if data_difference:
c = 1
for key in row_file1:
wb_report['DW_Diff'].cell(row = row_diff, column = c).value = row_file1[key]
wb_report['DW_Diff'].cell(row = row_diff+1, column = c).value = row_file2[key]
if (row_file1[key] != row_file2[key]):
wb_report['DW_Diff'].cell(row = row_diff+1, column = c).fill = PatternFill(patternType='solid',
fill_type='solid',
fgColor=Color('FFFF0000'))
c += 1
row_diff += 2

You are running into speed issues because of the structure of your comparison. You are using a nested loop comparing each entry in one collection to every entry in another, which is O(N^2) slow.
One way you could modify your code slightly is to redo the way you ingest the data and instead of using csv.DictReader to make a list of dictionaries for each file, would be to create a single dictionary of each file manually using the the primary & secondary keys as dictionary keys. This way you could compare entries between the two dictionaries very easily, and with constant time.
This construct assumes that you have unique primary/secondary keys in each file, which it seems like you are assuming from above.
Here is a toy example. In this I'm just using an integer and animal type as a tuple for the (primary key, secondary key) key
In [7]: file1_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, '
...: cat'): [6, 8, 90]}
In [8]: file2_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}
In [9]: file1_dict
Out[9]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, 'cat'): [6, 8, 90]}
In [10]: file2_dict
Out[10]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}
In [11]: for k in file1_dict:
...: if k in file2_dict:
...: if file1_dict[k] == file2_dict[k]:
...: print('matched %s' % str(k))
...: else:
...: print('different %s' % str(k))
...: else:
...: print('no corresponding key for %s' % str(k))
...:
matched (1, 'dog')
different (3, 'bird')
no corresponding key for (15, 'cat')

I was able to achieve this using the Pandas library as suggested by #Vaibhav Jadhav using the below steps:
1. Import the 2 CSV files into dataframes.
e.g.:
try:
data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-8', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
print (data1[keys[0]])
except:
data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-16', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
Delete the columns not to be compared from both the dataframes.
for col in data1.columns:
if col in ignore_col:
del data1[col]
del data2[col]
Merge the 2 dataframes with indicator=True
merged = pd.merge(data1, data2, how='outer', indicator=True)
From the merged dataframe, delete the rows that were available in both dataframes.
merged = merged[merged._merge != 'both']
Sort the dataframe with the key(s)
merged.sort_values(by = keys, inplace = True, kind = 'quicksort')
Iterate the rows of the dataframe, compare keys of the first 2 rows. If the keys are different row1 exists only in one of the 2 CSV files. If keys are same iterate over individual columns and compare to find which column value is different.

It is a good use case for Apache Beam.
Features like "groupbykey" will make matching by keys more efficient.
Using an appropriate runner you can efficiently scale to much larger datasets.
Possibly there is no Excel IO, but you could output to a csv, database etc.
https://beam.apache.org/documentation/
https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey/
https://beam.apache.org/documentation/runners/capability-matrix/
https://beam.apache.org/documentation/io/built-in/

Python: Add rows with different column names to dict/dataframe

I want to add data (dictionaries) to a dictionary, where every added dictionary represent a new row. It is a iterative process and it is not known what column names a new added dictionary(row) could have. In the end I want a pandas dataframe. Furthermore I have to write the dataframe every 1500 rows to a file ( which is a problem, because after 1500 rows, it could of course happen that new data is added which has columns that are not present in the already written 1500 rows to the file).
I need a approach which is very fast (maybe 26ms per row). My approach is slow, because it has to check every data if it has new column names and in the end it has to reread the file, to create a new file where all columns have the same lengths. The data comes from a queue which is processed in another process.
import pandas as pd
def writingData(exportFullName='path', buffer=1500, maxFiles=150000, writingQueue):
imagePassed = 0
with open(exportFullName, 'a') as f:
columnNamesAllList = []
columnNamesAllSet = set()
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
columnNamesUpdated = False
for data in iter(writingQueue.get, "STOP"):
print(imagesPassed)
dfTemp = pd.DataFrame([data],index=[imagesPassed])
if set(dfTemp).difference(columnNamesAllSet):
columnNamesAllSet.update(set(dfTemp))
columnNamesAllList.extend(list(dfTemp))
columnNamesUpdated = True
else:
columnNamesUpdated = False
if columnNamesUpdated:
print('Updated')
dfTempAll = dfTemp.combine_first(dfTempAll)
else:
dfTempAll.iloc[imagesPassed - 1] = dfTemp.iloc[0]
imagesPassed += 1
if imagesPassed == buffer:
dfTempAll.dropna(how='all', inplace=True)
dfTempAll.to_csv(f, sep='\t', header=True)
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
imagePassed = 0
Reading it in again:
dfTempAll = pd.DataFrame( index=range(maxFiles), columns=columnNamesAllList)
for number, chunk in enumerate(pd.read_csv(exportFullName, delimiter='\t', chunksize=buffer, low_memory=True, memory_map=True,engine='c')):
dfTempAll.iloc[number*buffer:(number+1*buffer)] = pd.concat([chunk, columnNamesAllList]).values#.to_csv(f, sep='\t', header=False) # , chunksize=buffer
#dfTempAll = pd.concat([chunk, dfTempAll])
dfTempAll.reset_index(drop=True, inplace=True).to_csv(exportFullName, sep='\t', header=True)
Small example with dataframes
So to make it clear. Lets say I have a 4 row already existent dataframe (in the real case it could have 150000 rows like in the code above), where 2 rows are already filled with data and I add a new row it could look like this with the exception that the new data is a dictionary in the raw input:
df1 = pd.DataFrame(index=range(4),columns=['A','B','D'], data={'A': [1, 2, 'NaN', 'NaN'], 'B': [3, 4,'NaN', 'NaN'],'D': [3, 4,'NaN', 'NaN']})
df2 = pd.DataFrame(index=[2],columns=['A','C','B'], data={'A': [0], 'B': [0],'C': [0] })#

Convert json to array in python (value,id,value,id....)

Am trying to create an array from json object, I can print the required values but couldn't push them into array in Python, how can I do that?
data={"wc":[{"value":8,"id":0},{"value":9,"id":1}]}
dataset = []
test=[]
for i in data['wc']:
print(i['value'],',',i['id'])
test=i['value'],i['id']
dataset.append(test)
print(dataset)
Am getting correct values as required but with '(' and ')'
How can I remove them and get final output as
[8,0,9,1]
Like [value,id,value,id....]

You already have a nested dictionary. Just iterate over the values of the nested dicts:
dataset = []
for entry in data['wc']:
for value in entry.values():
dataset.append(value)
>>> dataset
[0, 8, 1, 9]
with order value first id second:
dataset = []
for entry in data['wc']:
dataset.extend([entry['value'], entry['id']])
dataset
[0, 8, 1, 9]

Adding Values to an Array and getting distinct values using Python

I have python code below that will loop through a table and print out values within a particular column. What is not shown is the form in which the user selects a Feature Layer. Once the Feature Layer is selected a second Dropdown is populated with all the Column Headings for that Feature and the user chooses which Column they want to focus on. Now within the python script, I simply print out each value within that column. But I want to store each value in a List or Array and get Distinct values. How can I do this in Python?
Also is there a more efficient way to loop through the table than to go row by row? That is very slow for some reason.
many thanks
# Import system modules
import sys, string, os, arcgisscripting
# Create the Geoprocessor object
gp = arcgisscripting.create(9.3)
gp.AddToolbox("E:/Program Files (x86)/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
# Declare our user input args
input_dataset = sys.argv[1] #This is the Feature Layer the User wants to Query against
Atts = sys.argv[2] #This is the Column Name The User Selected
#Lets Loop through the rows to get values from a particular column
fc = input_dataset
gp.AddMessage(Atts)
rows = gp.searchcursor(fc)
row = rows.next()
NewList = []
for row in gp.SearchCursor(fc):
##grab field values
fcValue = fields.getvalue(Atts)
NewList.add(fcValue)

You can store distinct values in a set:
>>> a = [ 1, 2, 3, 1, 5, 3, 2, 1, 5, 4 ]
>>> b = set( a )
>>> b
{1, 2, 3, 4, 5}
>>> b.add( 5 )
>>> b
{1, 2, 3, 4, 5}
>>> b.add( 6 )
>>> b
{1, 2, 3, 4, 5, 6}
Also you can make your loop more pythonic, although I'm not sure why you loop over the row to begin with (given that you are not using it):
for row in gp.searchcursor( fc ):
##grab field values
fcValue = fields.getvalue(Atts)
gp.AddMessage(fcValue)
And btw, """ text """ is not a comment. Python only has single line comments starting with #.

One way to get distinct values is to use a set to see if you've seen the value already, and display it only when it's a new value:
fcValues = set()
for row in gp.searchcursor(fc):
##grab field values
fcValue = fields.getvalue(Atts)
if fcValue not in fcValues:
gp.AddMessage(fcValue)
fcValues.add(fcValue)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python extract data from a semi-structured .xlsx file - python

Related

Appending data to a Google Sheet using Python

How to compare 2 huge CSV files, based on column names specified at run time and ignoring few columns?

Python: Add rows with different column names to dict/dataframe

Convert json to array in python (value,id,value,id....)

Adding Values to an Array and getting distinct values using Python

Categories

Resources