complete newbie here.
I want to read data from two columns in several excel sheets in a nested dictionary. In the end, I'd like to have a dictionary looking like this:
{SheetName1:{Index1: Value1, Index2: Value2,...}, SheetName2:{Index1: Value1, Index2: Value2} ...}
So far I read in the data using pandas and figured out how to combine the two columns I need into the inner dictionary {Index: Value}, which afterwards gets assigned the name of the sheet as a key to from the outer dictionary:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#read in the different excel sheet names in a List
SHEETNAMES = []
SHEETNAMES = ExcelWorkbook.sheet_names
#nested dictionary
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].loc[0:87,:]
dic = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
dic = {Sheet: dic}
Now when I run this, it only returns the last sheet with its corresponding {Index: Value} pair:
{'LastSheetName': {Key1: Value1, Key2: Value2,...}
Now it seems to me that I've done the "harder" part but I can't seem to figure out, how to fill a new dictionary with the dictionaries generated by this loop....
Any help is greatly appreciated!
Best regards,
Jan
You are assigning dic as a new variable each time you iterate through your for loop. Instead, instantiate dic as an empty list [] outside of the loop and then append the dictionaries you define inside the loop to it, such as:
#read excel sheet into dataframe
df = ExcelWorkbook.parse(sheet_name = None, header= 1, usecols= 16, skiprows= 6)
#nested dictionary
dic = []
for Sheet in ExcelWorkbook.sheet_names:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = {Sheet: dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))}
dic.update(out)
Also, you want you to use .iloc in place of .loc considering you are specifying index locations inside of the dataframe.
I just figured it out after tweaking #rahlf23 response a bit. So for anyone looking this up:
dic.append() does not work for dictionaries, instead I used dic.update():
#nested dictionary
dic1 = {}
for Sheet in SHEETNAMES:
df[Sheet] = df[Sheet].iloc[0:87,:]
out = dict(zip(df[Sheet].index, df[Sheet]['ColumnName']))
out2 = {Sheet: out}
dic1.update(out2)
Now one can access the values with:
print(dic1[SheetName][Index])
Thanks for your help #rahlf23, without your comment I'd still be trapped in the loop :)
Related
I want to create a nested dictionary. From a csv file (see pic) where i want to keep keys same
e.g
{'name':'john' , 'sname':'doe' , 'address':'120 Jefferson st'} ,
{'name':'jack' , 'sname':'McGinnis', 'address':'202 hobo'}}
all the row data in one dictionay with keys as their column name.
stuck here
I understand you want a list of dictionaries.
import pandas as pd
data = pd.read_csv("dataset.csv")
dict_list = []
for i in range (len(data)):
dict = {}
for col in data.columns :
dict[col] = data[col].iloc[i]
dict_list.append(dict)
print(dict_list)
I have a directory containing several excel files. I want to create a DataFrame with a list of the filenames, a count of the number of rows in each file, and a min and max column.
Example file 1:
Example file 2:
Desired result:
This is as far as I've gotten:
fileslist = os.listdir(folder)
for file in fileslist:
str = file
if not str.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
From here, I can't figure out how to create the final DataFrame as shown in the above "Desired Result." I'm very new at this and would appreciate any nudge in the right direction.
You're using str wrong. It is a function in Python, but you don't need it at all. Here, you just mean to write file.startswith. Now, to store the data, at each iteration you'll want to append to a list. What you can do is use dictionaries to create the data:
import pandas as pd
fileslist = os.listdir(folder)
data = [] # store the intermediate data in the loop
for file in fileslist:
# no need to assign file to str
if not file.startswith('~$'):
df = pd.read_excel(os.path.join(folder, file), header=0,
sheet_name='Main', usecols=['Name', 'Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
data.append(
{ # the dict keys will become pandas column names
'Filename': file, # you probably want to remove the extension here
'Count': NameCount,
'MinNumber': NumMin,
'MaxNumber': NumMax
})
df = pd.DataFrame(data)
From here, you just need to write the data frame to your excel file.
First of all, I would just like to point out that you shouldn't name any variable as "str" as you did here:
str = file
This can cause issues in the future if you ever try to convert some object to a string using the str(object) as you are redefining the method. Also, this redefinition of "file" is unnecessary, so you can just take that out. You did something similar with "file" as that is also a keyword that you are redefining. A name like "file_name" would be better.
As for how to create the final dataframe, it is somewhat simple. I would recommend you use a list and dictionaries and add all the data to that, then create the dataframe. Like this:
fileslist = os.listdir(folder)
# temporary list to store data
data = []
for file_name in fileslist:
if not file_name.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file_name), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].()
# appending row of data with appropriate column names
data.append({'Filename': file_name, 'Count': NameCount, 'MinNumber': NumMin, 'MaxNumber': NumMax})
# creating actual dataframe
df = pd.DataFrame(data)
I am using openpyxl to read a column (A) from an excel spreadsheet. I then iterate through a dictionary to find the matching information and then I want to write this data back to column (C) of the same Excel spreadsheet.
I have tried to figure out how to append data back to the corresponding row but without luck.
CODE
from openpyxl import load_workbook
my_dict = {
'Agriculture': 'ET_SS_Agriculture',
'Dance': 'ET_FA_Dance',
'Music': 'ET_FA_Music'
}
wb = load_workbook("/Users/administrator/Downloads/Book2.xlsx") # Work Book
ws = wb['Sheet1'] # Work Sheet
column = ws['A'] # Column
write_column = ws['C']
column_list = [column[x].value for x in range(len(column))]
for k, v in my_dict.items():
for l in column_list:
if k in l:
print(f'The dict for {l} is {v}')
# append v to row of cell index of column_list
So, if my excel spreadsheet looks like this:
I would like Column C to look like this after I have matched the data dictionary.
In order to do this with your method you need the index (ie: row) to assign the values to column C, you can get this with enumerate when running over your column_list
for i, l in enumerate(column_list):
if k in l:
print(f'The dict for {l} is {v}')
# append v to row of cell index of column_list
write_column[i].value = v
After writing all the values you will need to run
wb.save("/Users/administrator/Downloads/Book2.xlsx")
To save your changes
That said, you do a lot of unnecessary iterations of the data in the spreadsheet, and also make things a little bit difficult for yourself by dealing with this data in columns rather than rows. You already have your dict with the values in column A, so you can just do direct lookups using split.
You are adding to each row, so it makes sense to loop over rows instead, in my opinion.
my_dict = {
'Agriculture': 'ET_SS_Agriculture',
'Dance': 'ET_FA_Dance',
'Music': 'ET_FA_Music'
}
wb = load_workbook("/Users/administrator/Downloads/Book2.xlsx") # Work Book
ws = wb['Sheet1'] # Work Sheet
for row in ws:
try:
# row[0] = column A
v = my_dict[row[0].value.split("-")[0]] # get the value from the dict using column A
except KeyError:
# leave rows which aren't in my_dict alone
continue
# row[2] = column C
row[2].value = v
wb.save("/Users/administrator/Downloads/Book2.xlsx")
I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.
I have an Excel workbook with 8 sheets in it. They all follow the same column header structure. The only difference is, the first sheet starts at row 1, but the rest of the sheets start at row 4.
I am trying to run a command like this, but this is giving me the wrong data - and I recognize that because I wrote sheet_name=None this will give me issues as the sheets start at different rows:
df = pd.concat(pd.read_excel(xlsfile, sheet_name=None, skiprows=4), sort=True)
My next attempt was to:
frames = []
df = pd.read_excel(xlsfile, sheet_name='Questionnaire')
for sheet in TREND_SHEETS:
tmp = pd.read_excel(xlsfile, sheet_name=sheet, skiprows=4)
# append tmp dynamically to frames, then use concat frames at the end.. ugly
df.append(tmp, sort=False)
return df
Note, Questionnaire is the first sheet in the Excel workbook. I know the logic here is off, and I do not want to create dynamic variables holding the 'tmp', appending it to a list, and then concatenating the frames.
How can I go about solving this, so that I achieve a dataframe which incorporates all the sheet data?
Consider a list comprehension to build a list of data frames for concatenating once outside the loop. To borrow #Jenobi's dictionary approach:
sheets = {'sheet1': 1, 'sheet2': 4, 'sheet3': 4, 'sheet4': 4}
df_list = [pd.read_excel(xlsfile, sheetname=k, skiprows=v) \
for k,v in sheets.items()]
final_df = pd.concat(df_list, ignore_index=True)
What I would do is have a config file, like a Python dictionary with the sheetnames as keys, and the values can be the number_of_rows to skip.
Thanks to #parfait for proving a better solution, it is best to concatenate outside of the for loop as its more memory efficient. What you can do it append the dfs to a list within the for loop, then concatenate outside.
import pandas as pd
sheets = {
'Sheet1': 1,
'Sheet2': 4,
'Sheet3': 4,
'Sheet4': 4
}
list_df = list()
for k, v in sheets.items():
tmp = pd.read_excel(xlsfile, sheetname=k, skiprows=v)
list_df.append(tmp)
final_df = pd.concat(list_df, ignore_index=True)