I have a question regarding how to perform what would be the equivalent of returning a value using the INDEX MATCH functions in Excel and applying it in Python.
As an Excel user performing data analytics and manipulation on large data-sets I have moved to Python for efficiency. What I am attempting to do is to populate the column cells within a pandas dataframe based on the value returned from the value stored within a dictionary.
In an attempt to do this I have used the following code:
# imported csv DataFrames
crew_data = pd.read_csv(r'C:\file_path\crew_data.csv')
export_template = pd.read_csv(r'C:\file_path\export_template.csv')
#contract number dictionary
contract = {'Northern':'046-2019',
'Southern':'048-2015D',}
#function that attempts to perform a INDEX MATCH equivalent
def contract_num():
for x, y in enumerate(crew_data.loc[:, 'Region']):
if y in contract.keys():
num = contract[y]
else:
print('ERROR')
return(num)
#for loop which prepares then exports the load data
for i, r in enumerate(export_template):
export_template.loc[:, 'Contract'] = contract_num()
export_template.to_csv(r'C:\file_path\export_files\UPLOADER.csv')
print(export_template)
To summarise what the code is intended to do is as follows:
The for loop contained in the contract_num function begins by iterating over the Region column in the crew_data DataFrame
if the value y from the DataFrame matches the key in the contract dictionary (Note: the Region column only contains 2 values, 'Southern' and 'Northern') it will return the corresponding value from the value in the contract dictionary
The for loop which prepares then exports the load data calls on the contract_num() function to populate the Contract column in the export_template DataFrame
Please note that there are 116 additional columns which are populated in this loop which have been excluded from the code above to save space.
When the code is executed it produces the result as intended, however, the issue is that when the function is called in the second for loop it only returns a single value of 048-2015D instead of the value which corresponds to the correct Region.
As mentioned previously this would have typically been carried out in Excel using INDEX MATCH, however doing so is not as efficient as using a script such as that above.
Being a beginner, I suspect the example code may appear con-deluded and unnecessary and could be performed using a more concise method.
If anyone could provide a solution or guidance that would be greatly appreciated.
df = pd.DataFrame({'Region': ['Northern', 'Northern', 'Northern',
'Northern', 'Southern', 'Southern',
'Northern', 'Eastern']})
contract = {'Northern':'046-2019',
'Southern':'048-2015D'}
# similar to INDEX MATCH
df['Contract'] = df.Region.map(contract)
out:
Region Contract
0 Northern 046-2019
1 Northern 046-2019
2 Northern 046-2019
3 Northern 046-2019
4 Southern 048-2015D
5 Southern 048-2015D
6 Northern 046-2019
7 Eastern NaN
you can add print if Contract has not matched:
if df.Contract.isna().any():
print("ERROR")
or make an assertion:
assert not df.Contract.isna().any(), "found empty contract field"
and out in this case:
AssertionError: found empty contract field
Related
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).
New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.
def data_search(security, statistic):
"""takes in specific asset and desired statistic (e.g. 'Gross ER', 'Asset Class', etc),
returns the answer from the CME or Equity Ranking Template spreadsheet"""
securities = xlrd.open_workbook('Desktop\\CME.xlsx').sheet_by_index(1)
securities_array = securities.col_values(1, start_rowx = 1, end_rowx = securities.nrows)
desired_security_row = securities_array.index(security) + 1
statistics_array = securities.row_values(0, start_colx = 0, end_colx = securities.ncols)
statistic_index = statistics_array.index(statistic)
return securities.row(desired_security_row)[statistic_index].value
I'm using python's xlrd to read an excel spreadsheet to gather information and form a list of financial data for different assets. The "CME" spreadsheet referenced in the code above is essentially a spreadsheet that has about 500 different financial assets with their ticker/symbol (e.g. 'AAPL') running down column B on the spreadsheet. The x-axis has a bunch of headers for the desired statistic such as "Gross Expected Return" and "Yield". I want to be able to feed a particular ticker and desired statistic into the function and return the statistic for that ticker. The code currently uses xlrd to look at column B and generate a list out of all 500 securities (eg. ['AAPL', 'SCHB', 'DB', ...])then finds the index of that security which is how it identifies the right row. It then generates a list of all the statistic types on the spreadsheet (e.g. ['Gross Expected Return', 'Yield', 'Beta', ...]) and finds the index of the desired statistic. It will look at the correct row and return the value of whatever's in the cell of the correct index.
When I run a for-loop calling this function x amount of times on a list of tickers, it appears to be highly inefficient. I was curious to see if anyone had any ideas on a faster way to do this either involving or not involving xlrd.
I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.
A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.