I'm searching for a way to apply an arbitrage algorithm across multiple exchanges, multiple currencies. and multiple trading amounts. I've seen examples of using BellmanFord and FloydWarshall, but the one's I've tried all seem to be assuming the graph data set is made up of prices for multiple currencies on one single exchange. I've tried tinkering and making it support prices across multiple exchanges but I haven't found any success.
One article I read said that I use BellmanFord and simply put only the best exchange's price in the graph (as opposed to all the exchange's prices). While it sounds like that should work, I feel like that could be missing out on value that way. Is this the right way to go about it?
And regarding multiple amounts, should I just make one graph per trade amount? So say I want to run the algorithm for $100 and for $1000, do I just literally populate the graph twice once for each set of data? The prices will be different at $100 than for $1000 so one exchange that has the best price at $100 may be different then that of the $1000 amount.
Examples:
The graph would look like this:
rates = [
[1, 0.23, 0.26, 17.41],
[4.31, 1, 1.14, 75.01],
[3.79, 0.88, 1, 65.93],
[0.057, 0.013, 0.015, 1],
]
currencies = ('PLN', 'EUR', 'USD', 'RUB')
REFERENCES:
Here is the code I've been using, but this assumes one exchange and one single trade quantity
Here is where someone mentions you can just include the best exchange's price in the graph in order to support multiple exchanges
Trying for accuracy over speed, there's a way to represent the whole order book of each exchange inside of a linear program. For each bid, we have a real-valued variable that represents how much we want to sell at that price, ranging between zero and the amount bid for. For each ask, we have a real-valued variable that represents how much we want to buy at that price, ranging between zero and the amount asked for. (I'm assuming that it's possible to get partial fills, but if not, you could switch to integer programming.) The constraints say, for each currency aside from dollars (or whatever you want more of), the total amount bought equals the total amount sold. You can strengthen this by requiring detailed balance for each (currency, exchange) pair, but then you might leave some opportunities on the table. Beware counterparty risk and slippage.
For different amounts of starting capital, you can split dollars into "in-dollars" and "out-dollars" and constrain your supply of "in-dollars", maximizing "out-dollars", with a one-to-one conversion with no limit from in- to out-dollars. Then you can solve for one in-dollars constraint, adjust the constraint, and use dual simplex to re-solve the LP faster than from scratch.
Related
I'm currently trying to gain market betas from tickers gained through yahoo finance datasreader. I was wondering if there is a way to calculate each stocks market beta, and put it in a dataframe?
This is what I have for my code so far:
import pandas_datareader.data as pdr
Tickers=['SBUX','TLRY']
SD='2005-01-31'
ED='2018-12-31'
TickerW=pdr.datareader(Tickers,'yahoo',SD,ED)
TickerW.head()
Okay, to make sure we're on the same page, we use the formula and definition of market beta from here: https://www.investopedia.com/terms/b/beta.asp
Beta = Covariance(Stock Returns, Market Returns) / Variance(Market Returns)
So first of all, we need the tickers for the market as well as the tickers for the stock. Which ticker you use here depends a lot on what market you want to compare against: Total stock market? Just the S&P 500? Maybe some other international equity index? There's no 100% right answer here, but a good way to pick is think about who the "movers" of your stock are, and what other stocks they hold. (Check out Damodaran's course on valuation, free on the interwebs if you google it).
So now your question becomes: How do I compute the covariance and variance of stock returns?
First, the pandas tickers have a bunch of information. The one we want is the "Adjusted Close". That's the daily closing price of the stock, retroactively adjusted for any "special" events like stock splits, reverse splits, and dividends. Because let's say a stock trades for $1000 a pop one day, but then undergoes a 1 for 2 stock split, so now instead of 1 share for $1000, you have 2 shares for $500 each. In a "raw" price chart, it would appear as if your stock just lost 50% value in a single day when in reality nothing happened. The Adjusted Close time series takes care of that to make sure that only "real" changes to the stock's value are reflected.
You can get that by calling prices = TickerW['Adj. Close'] or whatever key yahoo finance uses these days. By just looking at the TickerW dataframe you should be able to figure that out on your own :)
Next, we'd be changing prices into returns. That's just prices.shift(1) / prices (or maybe the other way round :D consult the documentation and try it out yourself). (Nerd note: Instead of these returns, it is mathematically more sound to use the logarithmic returns, because they have certain reasonable properties. If you want, throw a "log" around the returns.
Finally, we now have a series of returns (or log returns). One series for the stock returns, one for the market returns (e.g. from SPY, for the S&P 500). Now we just need to use them in the formula for beta.
Well, the way to go here is to do what I just did: Hit up google for "pandas covariance between two series" and that gets us to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cov.html
So basically, cov = stock_returns.cov(market_returns) and var = market_returns.var and then beta = cov / var.
I'd say that should be enough info to send you on your way. Good luck.
I have this daily stats churned out from a system which outputs total sales and units sold per region group. For my analysis, I want to breakdown the entries into regions instead of region group. I'm trying to look for a way to split each row into per region with the respective measures.
I have historical percentages on the market share per region which I'll use to come up with the estimated sales and units sold.
I can do this manually in excel but given how i'll be doing this on a weekly basis, I'm looking for a way to automate it via python.
My data: https://imgur.com/a/pBr3y4D
Goal: https://imgur.com/a/Uc56PVR
Well, first of all, when you're doing DS researches try to find the most appropriate way in your personal case. There's nothing bad in using all Excel functionality to solve your issue, scripting, etc.
However, if you really-really want to use pandas, then what I would do in your case - just .append() and then split on regions and grouping by sales or made up a function with for..loop.
I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?
Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk
I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).
Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.
Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big.
I.E.;
demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1
Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.
The averaging, grouping and so forth is modestly complicated by having differing participants each quarter
Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?
Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.
You can integrate this analysis into Python with RPy2 as needed.
This is a Data Warehouse. Small, but a data warehouse.
You have a Star Schema.
You have Facts:
response values are the measures
You have Dimensions:
time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.
question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.
participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.
But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.
Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.
There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.
On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.
One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.
The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.
I'm new to the whole traveling-salesman problem as well as stackoverflow so let me know if I say something that isn't quite right.
Intro:
I'm trying to code a profit/time-optimized multiple-trade algorithm for a game which involves multiple cities (nodes) within multiple countries (areas), where:
The physical time it takes to travel between two connected cities is always the same ;
Cities aren't linearly connected (you can teleport between some cities in the same time);
Some countries (areas) have teleport routes which make the shortest path available through other countries.
The traveler (or trader) has a limit on his coin-purse, the weight of his goods, and the quantity tradeable in a certain trade-route. The trade route can span multiple cities.
Question Parameters:
There already exists a database in memory (python:sqlite) which holds trades based on their source city and their destination city, the shortest-path cities inbetween as an array and amount, and the limiting factor with its % return on total capital (or in the case that none of the factors are limiting, then just the method that gives the highest return on total capital).
I'm trying to find the optimal profit for a certain preset chunk of time (i.e. 30 minutes)
The act of crossing into a new city is actually simultaneous
It usually takes the same defined amount of time to travel across the city map (i.e. 2 minutes)
The act of initiating the first or any new trade takes the same time as crossing one city map (i.e. 2 minutes)
My starting point might not actually have a valid trade ( I would have to travel to the first/nearest/best one )
Pseudo-Solution So Far
Optimization
First, I realize that because I have a limit on the time it takes, and I know how long each hop takes (including -1 for intiating the trade), I can limit the graph to all trades whose hops are under or equal to max_hops=int(max_time/route_time) -1. I cut elements of the trade database that don't fall within this time limit, pruning cities that have shortest-path lengths greater than max_hops.
I make another entry into the trades database that includes the shortest-paths between my current city and the starting cities of all the existing trades that aren't my current city, and give them a return of 0%. I would limit these to where the number of city hops is less than max_hops, and I would also calculate whether the current city to the starting city plus that trades shortest-path-hops would excede max_hops and remove those that exceded this limit.
Then I take the remaining trades that aren't (current_city->starting_city) and add trade routes with return of 0% between all destination and starting cities wheres the hops doesn't excede max_hops
Then I make one last prune for all cities that aren't in the trades database as either a starting city, destination city, or part of the shortest path city arrays.
Graph Search
I am left with a (much) smaller graph of trades feasible within the time limit (i.e. 30 mins).
Because all the nodes that are connected are adjacent, the connections are by default all weighted 1. I divide the %return over the number of hops in the trade then take the inverse and add + 1 (this would mean a weight of 1.01 for a 100% return route). In the case where the return is 0%, I add ... 2?
It should then return the most profitable route...
The Question:
Mostly,
How do I add the ability to take multiple routes when I have left over money or space and keep route finding through path discrete to single trade routes? Due to the nature of the goods being sold at multiple prices and quantities within the city, there would be a lot of overlapping routes.
How do I penalize initiating a new trade route?
Is graph search even useful in this situation?
On A Side Note,
What kinds of prunes/optimizations to the graph should I (or should I not) make?
Is my weighting method correct? I have a feeling it will give me disproportional weights. Should I use the actual return instead of percentage return?
If I am coding in python are libraries such as python-graph suitable for my needs? Or would it save me a lot of overhead (as I understand, graph search algorithms can be computationally intensive) to write a specialized function?
Am I best off using A* search ?
Should I be precalculating shortest-path points in the trade database and maxing optimizations or should I leave it all to the graph-search?
Can you notice anything that I could improve?
If this is a game where you are playing against humans I would assume the total size of the data space is actually quite limited. If so I would be inclined to throw out all the fancy pruning as I doubt it's worth it.
Instead, how about a simple breadth-first search?
Build a list of all cities, mark them unvisited
Take your starting city, mark the travel time as zero
for each city:
if not finished and travel time <> infinity then
attempt to visit all neighbors, only record the time if city is unvisited
mark the city finished
repeat until all cities have been visited
O(): the outer loop executes cities * maximum hops times. The inner loop executes once per city. No memory allocations are needed.
Now, for each city look at what you can buy here and sell there. When figuring the rate of return on a trade remember that growth is exponential, not linear. Twice the profit for a trade that takes twice as long is NOT a good deal! Look up how to calculate the internal rate of return.
If the current city has no trade don't bother with the full analysis, simply look over the neighbors and run the analysis on them instead, adding one to the time for each move.
If you have CPU cycles to spare (and you very well might, anything meant for a human to play will have a pretty small data space) you can run the analysis on every city adding in the time it takes to get to the city.
Edit: Based on your comment you have tons of CPU power available as the game isn't running on your CPU. I stand by my solution: Check everything. I strongly suspect it will take longer to obtain the route and trade info than it will be to calculate the optimal solution.
I think you've defined something that fits into a class of problems called inventory - routing problems. I assume since you have both goods and coin, the traveller is both buying and selling along the chosen route. Let's first assume that EVERYTHING is deterministic - all quantities of goods in demand, supply available, buying and selling prices, etc are known in advance. The stochastic version gets more difficult (obviously).
One objective would be to maximize profits with a constraint on the purse and the goods. If the traveller has to return home its looks like a tour, if not, it looks like a path. Since you haven't required the traveller to visit EVERY node, it is NOT a TSP. That's good - shortest path problems are generally much easier than TSPs to solve.
Because of the side constraints and the limited choice of next steps at each node - I'd consider using dynamic programming first attempt at a solution technique. It will help you enumerate what you buy and sell at each stage and there's a limited number of stages. Also, because you put a time constraint on the decision, that limits the state space of choices.
To those who suggested Djikstra's algorithm - you may be right - the labelling conventions would need to include the time, coin, and goods and corresponding profits. It may be that the assumptions of Djikstra's may not work for this with the added complexity of profit. Haven't thought through that yet.
Here's a link to a similar problem in capital budgeting.
Good luck !