I am new to coding and currently i want to create individual dataframes from each excel tab. It works out so far by doing a search in this forum (i found a sample using dictionary), but then i need one more step which i can't figure out.
This is the code i am using:
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
d[f'{sheet}'] = pd.read_excel(xls, sheet_name=sheet)
Let's say i have 3 excel tabs called 'alpha', 'beta' and 'charlie'.
the code above will gave me 3 dataframes and i can call them by typing: d['alpha'], d['beta'] and d['charlie'].
What i want is to rename the dataframes so instead of calling them by typing (for example) d['alpha'], i just need to write alpha (without any other extras).
Edit: The excel i want to parse has 50+ tabs and it can grow
Edit 2: Thank you all for the links and the answers! it is a great help
Don't rename them.
I can think of two scenarios here:
1. The sheets are fundamentally different
When people ask how to dynamically assign to variable names, the usual (and best) answer is "Use a dictionary". Here's one example.
Indeed, this is the reason Pandas does it this way!
In this case, my opinion is that your best move here is to do nothing, and just use the dictionary you have.
2. The sheets are roughly the same
If the sheets are all basically the same, and only differ by one attribute (e.g. they represent monthly sales and the names of the sheets are 'May', 'June', etc), then your best move is to merge them somehow, adding a column to reflect the sheet name (month, in my example).
Whatever you do, don't use exec or eval, no matter what anyone tells you. They are not options for beginner programmers.
I think you are looking for the build-in exec method, which executes strings.
But I do not recommend using exec, it is really widely discussed why it shouldn't be used or at least should be used cautiously.
As I do not have your data, I think it is achievable using the following code:
import pandas as pd
excel='sample.xlsx'
xls=pd.ExcelFile(excel)
for sheet in xls.sheet_names:
print(sheet)
code_to_execute = f'{sheet} = pd.read_excel(xls,sheet_name={sheet})'
exec(code_to_execute)
But again, I highlight that it is not the cleanest way to do that. Your approach is definitely cleaner, to be more precise, I would always use dicts for those kinds of assignments. See here for more about exec.
In general, you want to generate a string.
possible_string = 'a=10'
exec(possible_string)
print(a) # 10
You need to create variables which correspond to the three dataframes:
alpha, beta, charlie = d.values()
Edit:
Since you mentioned that the excel sheet could have 50+ tabs and could grow, you may prefer to do it your original loop. This can be done dynamically using exec
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
exec(f'{sheet}' + " = pd.read_excel(xls, sheet_name=sheet)")
It might be better practice, however, to simply index your sheets and access them by index. A 50+ length collection of excel sheets is probably better organized by appending to a list and accessing by index:
d = []
for sheet in xls.sheet_names:
print(sheet)
d.append(pd.read_excel(xls, sheet_name=sheet))
#d[0] = alpha; d[1] = beta, and so on...
Related
my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).
This question already has an answer here:
Removing repetitive/duplicate occurance in excel using python
(1 answer)
Closed 3 years ago.
Good evening. I have an excel file with zip codes and associated information. Those zip codes have a lot of duplicates. I'd like to figure out which zip codes I have by putting them all in a list without duplicates. This code works, but runs very slowly (took over 100 seconds), and was wondering what I could do to improve the efficiency of it.
I know that having to check the whole list for duplicates each time is contributing a lot to the inefficiency, but I'm not sure how to fix that. I also know that going through every row is probably not the best answer, but again I am pretty new and am now stuck.
Thanks in advance.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
for i in range(sheet.nrows):
if str(sheet.cell(i,0).value) in zipsInSheet:
pass
else:
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()
If you're looking to avoid duplicates then you should definitely consider using Sets in python. See here
What I would do is to create a set and simply add all your elements to a set; note that, a set is an unordered, unique collection of items. Once all data has been added you can then just add all elements in the set it to your sheet. This, therefore, avoids redundant data.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
data = set()
for i in range(sheet.nrows):
data.add(str(sheet.cell(i,0).value)
#now add all elements in the set to your sheet
for i in range(len(data)):
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()
I usually just convert it to a set. Sets are your friend. They are much faster than lists. Unless you intentionally need or want duplicates, use sets.
https://docs.python.org/3.7/tutorial/datastructures.html?highlight=intersection#sets
For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.
If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D
I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.
I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic:
say my worksheet was loaded as "ws", the code:
A = np.zeros((37,3))
for i in range(2,39):
for j in range(1,4):
A[i-2,j-1]= ws.cell(row = i, column = j).value
loads the contents of "ws" into array A.
There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?
Thank you in advance for your answers.
PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.
You could do
A = np.array([[i.value for i in j] for j in ws['C1':'E38']])
EDIT - further explanation.
(firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)
the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
each row is a tuple (even if only one column wide) of
Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.
The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns
A = np.array([[i.value for i in j[2:5]] for j in ws.rows])
if you don't know how many columns then you will have to loop and check values more like your original idea
If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.
import numpy as np
from tableconvert.converter import convert
array = convert("""
123 456 3.14159
SOMETEXT 2,71828 0
""")
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 123. 456. 3.14159]
[ nan 2.71828 0. ]]
I can set up an autofilter using pyWin32, but I wondered if it's possible to set a default filter and what the syntax would be.
For example, I'd like to set a filter on a year column and set the default for the current year.
xl = Dispatch("Excel.Application")
xl.Workbooks.Open(file_path)
xl.ActiveWorkbook.Worksheets(sheetname).Range("A2:A6").AutoFilter(1)
xl.ActiveWorkbook.Close(SaveChanges=1)
I've looked on the web for documentation on pywin32, and also Microsofts site, but can't work out how to translate the MS syntax to pywin32
Range("A2:A6").AutoFilter Field:=1, Criteria1:=rng.Value
I bumped into the same problem and after a bit of experimentation, I found that it was possible to set a range on the Columns attribute. Since I wanted to autofilter on columns A thru I, I set the criteria as follows:
xl.ActiveWorkbook.ActiveSheet.Columns("A:I").AutoFilter(1)
This worked for me. I'm assuming that you want to filter on Columns B thru F since AutoFilter is enabled only for columns. Perhaps the following criteria will work for you:
xl.ActiveWorkbook.ActiveSheet.Columns("B:F").AutoFilter(1)
Alok
The rather cryptic documentation is available at: http://msdn.microsoft.com/en-us/library/office/bb242013(v=office.12).aspx.
Each of the Excel VBA parameters translates to a function argument in pywin32. For example, if you want to filter all years that aren't equal to "2012" you would do this by specifying the Criteria1 parameter as follows:
MyYearRange.AutoFilter(Field=1, Criteria1="2012")
I'm just throwing an answer here for future people who want to use a different but similar solution. It is a lot more simple though. You will need to install xlwings and have pywin32. With xlwings, you can access the api functions of the pywin32 giving you a lot of flexibility on top of its own functions.
import xlwings
#puts the excel window into focus or opens it up. It evens work on csv files.
wb = xlwings.Book('C:\\Users\\yourusername\\Desktop\\Excel.xlsx')
#Choose the sheet you want to focus
datasht = wb.sheets['Sheet1']
#Pay attention to where you the .api. part. It matters if you are trying to achieve something specific. AND MAKE SURE to that you follow case-sensensitive typing for 'Range' and 'Autofilter'.
datasht.api.Range('A1:J10').AutoFilter(3,'SomeFilterValue')
Unfortunately, I'm not sure how to bring about the rest of the arguments. You pretty much just have to figure out how to translate the arguments into python. I did get it to work, but I'm unsure if you would run into any issues. here is one that would work
datasht.api.Range('A1:J10').AutoFilter(3,'filtervalue1',2,'filtervalue1',1)
Read the 2nd link specifically if you need to call on the Operator Parameter:
https://msdn.microsoft.com/en-us/vba/excel-vba/articles/range-autofilter-method-excel
https://msdn.microsoft.com/en-us/vba/excel-vba/articles/xlautofilteroperator-enumeration-excel
If you need to select multiple filter values in the same column:
ws.Columns('ColumnLetter:ColumnLetter').AutoFilter(column_number, value_list, 7)
From https://learn.microsoft.com/en-us/office/vba/api/excel.xlautofilteroperator:
xlFilterValues | 7 | Filter values
This works:
Excel = win.Dispatch("Excel.Application")
Excel.visible = True
wb = Excel.Workbooks.open('path to xlsx')
ws = wb.Worksheets(1)
#use Range("A:A") for autofilter
ws.Columns("A:I").AutoFilter(1,"criteria string")
This will apply AutoFilter on Column A with Criteria1 is "criteria string"