Converting excel data to nested dict and list - python

This is almost the same from my yesterday's question. But I took it for granted to use a unique value list to create the nested dict & list structure. But then, I came to the question of how to build this dict & list structure (refer as data structure) row by row from the excel data.
The excel files (multiple files in a folder) all look like the following:
Category Subcategory Name
Main Dish Noodle Tomato Noodle
Main Dish Stir Fry Chicken Rice
Main Dish Soup Beef Goulash
Drink Wine Bordeaux
Drink Softdrink Cola
My desired structure of dict & list structure is:
data = [0:{'data':0, 'Category':[
{'name':'Main Dish', 'Subcategory':[
{'name':'Noodle', 'key':0, 'data':['key':1, 'title':'Tomato Noodle']},
{'name':'Stir Fry', 'key':1, 'data':['key':2, 'title':'Chicken Rice']},
{'name':'Soup', 'key':2, 'data':['key':3, 'title':'Beef Goulash']}]},
{'name':'Drink', 'Subcategory':[
{'name':'Wine', 'key':0, 'data':['key':1, 'title':'Bordeaux']},
{'name':'Softdrink', 'key':1, 'data':['key':2, 'title':'cola'}]}]},
1:{'data':1, 'Category':.........#Same structure as dataset 0}]
So, for each excel file, it is fine, just loop through and set {'data':0, 'Category':[]}, {'data':1, 'Category':[]} and so on. The key is, for each Category and Subcategory values, Main Dish has three entries in excel, but only needs 1 in the data structure, and Drink has two entries in excel, but only 1 in the data structure. For each subcategory nested in the category list, they follow the same rule, only unique values should be nested to category. Then, each corresponding Name of dishes, they go into the data structure depending on their category and subcategory.
The issue is, I cannot find a better way to convert the data to this data structure. Plus, there are other columns after the Name column. So it is kind of sophisticated. I was thinking to first extract the unique values from the entire column of category and subcategory, this simplifies the process, but leads to problems when filling in the corresponding Name values. If I am doing this from a row by row approach, then designing a if subcategory exist or category exit test to keep unique values are somehow difficult based on my current programming skills...
Therefore, what would be the best approach to convert this excel file into such a data structure? Thank you very much.

One way could be to read the excelfile into a dataframe using pandas, and then build on this excellent answer Pandas convert DataFrame to Nested Json
import pandas as pd
excel_file = 'path-to-your-excel.xls'
def fdrec(df):
drec = dict()
ncols = df.values.shape[1]
for line in df.values:
d = drec
for j, col in enumerate(line[:-1]):
if not col in d.keys():
if j != ncols-2:
d[col] = {}
d = d[col]
else:
d[col] = line[-1]
else:
if j!= ncols-2:
d = d[col]
return drec
df = pd.read_excel(excel_file)
print(fdrec(df))

Related

Openpyxl and Binary Search

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

Identifying elements in a dataframe

I have a dictionary of dataframes called names_and_places in pandas that looks like the below.
names_and_places:
Alfred,,,
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
Brett,,,
Date,F_1,F_2,Key
4/1/2020,202,404,NAN
4/2/2020,101,401,NAN
4/3/2020,102,403,"[USA,CT, Fairfield, Stamford] "
Claire,,,
Date,F_1,F_2,Key
4/1/2020,NAN,12,NAN
4/2/2020,NAN,45,NAN
4/3/2020,7,78,"[USA,CT, Fairfield, Darian] "
Dane,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, New Haven] "
Edward,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, Milford] "
(text above or image below)
The key column is either going to be NAN or of the form [Country, State, County, City], but can be of length 3 or 4 elements (sometimes County is absent). I need to find all the elements with a given element that is contained in a key. For instance if the element = "CT", the script returns Edward, Brett, Dane and Claire (order is not important). If the element = "Stamford" then only Brett is returned. However I am going about the identification process in a way that seems very inefficient. I basically have variables that iterate through each possible combination of State, County, City (all of which I am currently manually inputting into variables) to identify which names to extract like below:
country = 'USA' #this never needs to change
element = 'CT'
#These next two are actually in .txt files that I create once I am asked for
#a given breakdown but I would like to not have to manually input these
middle_node = ['Fairfield','Bridgeport']
terminal_nodes = ['Stamford','Darian','New Haven','Milford']
names=[]
for a in middle_node:
for b in terminal_nodes:
my_key = [country,key_of_interest,a,b]
for s in names_and_places:
for z in names_and_places[s]['Key']:
if my_key == z:
names.append(s)
#Note having "if my_key in names_and_places[s]['Key']": was causing sporadic failures for
#some reason
display(names)
Output:
Edward, Brett, Dane, Claire
What I would like is to be able to input only the variable element and this can either be a level 2 (State), 3 (County), or 4 (City) node. However short of adding additional for loops and going into the Key column, I don't know how to do this. The one benefit (for a novice like myself) is that the double for loops allow me to keep bucketing intact and makes it easier for people to see where names are coming from when that is also needed.
But is there a better way? For bonus points if there is a way to handle the case when the key_of_interest is 'NY' and values in the Keys column can be like [USA, NY, NY, NY] or [USA, NY, NY, Queens].
Edit: names_and_places is a dictionary with names as the index, so
display(names_and_places['Alfred'])
would be
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
I do have the raw dataframe that has columns:
Date, Field name, Value, Names,
Where Field Name is either F_1, F_2 or Key and Value is the associated value of that field. I then pivot the data on Name with columns of Field Name to make my extraction easier.
Here's a way to do that in a somewhat more effective way. You start by building a single dataframe out of the dictionary, and then do the actual work on that dataframe.
single_df = pd.concat([df.assign(name = k) for k, df in names_and_places.items()])
single_df["Key"] = single_df.Key.replace("NAN", np.NaN)
single_df.dropna(inplace=True)
# Since the location is a string, we have to parse it.
location_df = pd.DataFrame(single_df.Key.str.replace(r"[\[\]]", "").str.split(",", expand=True))
location_df.columns = ["Country", "State", "County", "City"]
single_df = pd.concat([single_df, location_df], axis=1)
# this is where the actual query goes.
single_df[(single_df.Country == "USA") & (single_df.State == "CT")].name
The output is:
2 Brett
2 Claire
2 Dane
2 Edward
Name: name, dtype: object

Working with data from CSV with Python without using Pandas

I am very new to using python to process data on CSV files. I have a CSV file with the data below. I want to take the averages of the time stamps for each Sprint, Jog, and Walk column by session. The below example has the subject John Doe and Session2 and Session3 that I would like to find the averages of separately and write them to a new CSV file. Is there a way not using PANDAS but other modules like CSV or Numpy to gather the data by the person (subject) and then by session. I have tried to make a dictionary but the keys get overwritten. I have also tried using a List but I cannot figure out how to target the sessions to average them out. Not sure what I am doing wrong. I also tried using dictReader to read the fieldnames and then to process the data but I cannot figure out how to group all the John Doe Session2 data to find the average of the times.
Subject, Session, Course, Size, Category, Sprint, Jog, Walk
John Doe, Session2, 17, 2, Bad, 25s, 36s, 55s
John Doe, Session2, 3, 2, Good, 26s, 35s, 45s
John Doe, Session2, 1, 2, Good, 22s, 31s, 47s
John Doe, Session3, 5, 2, Good, 16s, 32s, 55s
John Doe, Session3, 2, 2, Good, 13s, 24s, 52s
John Doe, Session3, 16, 2, Bad, 15s, 26s, 49s
PS I say no PANDAS because my groupmates are not adding this module since we have so many other dependencies.
Given your input, these built-in Python libraries can generate the output you want:
import csv
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
with open('input.csv','r',newline='') as fin,open('output.csv','w',newline='') as fout:
# skip needed because sample data had spaces after comma delimiters.
reader = csv.DictReader(fin,skipinitialspace=True)
# Output file will have these fieldnames
writer = csv.DictWriter(fout,fieldnames='Subject Session Sprint Jog Walk'.split())
writer.writeheader()
# for each subject/session, groupby returns a 2-tuple of sort key and an
# iterator over the rows of that key. Data must be sorted by the key already!
for (subject,session),group in groupby(reader,key=itemgetter('Subject','Session')):
# built the row to output. defaultdict(int) assumes integer(0) if key doesn't exist.
row = defaultdict(int)
row['Subject'] = subject
row['Session'] = session
# Count the items for average.
count = 0
for item in group:
count += 1
# sum the rows, removing the 's'
for col in ('Sprint','Jog','Walk'):
row[col] += int(item[col][:-1])
# produce the average
for col in ('Sprint','Jog','Walk'):
row[col] /= count
writer.writerow(row)
Output:
Subject,Session,Sprint,Jog,Walk
John Doe,Session2,24.333333333333332,34.0,49.0
John Doe,Session3,14.666666666666666,27.333333333333332,52.0
Function links: itemgetter
groupby
defaultdict
If your data is not pre-sorted, you can use the following replacement lines to read in and sort the data by using the same key used in groupby. However, in this implementation the data must be small enough to load it all into memory at once.
sortkey = itemgetter('Subject','Session')
data = sorted(reader,key=sortkey)
for (subject,session),group in groupby(data,key=sortkey):
...
As you want the average grouped by subject and session, just compose unique keys out of that information:
import csv
times = {}
with open('yourfile.csv', 'r') as csvfile[1:]:
for row in csv.reader(csvfile, delimiter=','):
key = row[0]+row[1]
if key not in times.keys():
times[key] = row[-3:]
else:
times[key].extend(row[-3:])
average = {k: sum([int(entry[:-1]) for entry in v])/len(v) for k, v in times.items()}
This assumes that the first two entries do have regular structure as in your example and there is no ambiguity when composing the two first entries per row. To be sure one could insert a special delimiter between them in the key.
If you are also the person storing the data: Writing the unit of a column in the column header saves transformation effort later and avoids redundant information storage.

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

Python CSV search for specific values after reading into dictionary

I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name

Categories