Make loop iterate through range but limited to input - python

I'm new to python so I apologize in advance if the question is too easy.
I'm trying to make a simulation to find the optimization point on a dataframe. This is what I have so far:
import random
import pandas as pd
import math
import numpy as np
loops = int(input('Q of simulations: '))
cost = 175
sell_price = 250
sale_price = 250/2
# order = 1000
simulation = 0
profit = 0
rows = []
order = range(1000, 3000)
ordenes = []
for i in order:
ordenes.append(i)
for i in ordenes:
demand = math.trunc(1000 + random.random() * (2001))
if demand >= i:
profit = (sell_price - cost)* i
rows.append([simulation, demand, i, profit, (demand - i)])
else:
profit = (sell_price - cost)* demand - (i - demand)* (sale_price - cost)
rows.append([simulation, demand, i, profit, (demand - i)])
DataFrame = pd.DataFrame(rows, columns = ['#Simulation', 'Demand', 'Order', 'Utility', 'Shortage'])
print(DataFrame)
DataFrame.loc[DataFrame['Utility'].idxmax()]
The current output (for any number specified in tis:
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 0 1392 1001 75075.0 391
2 0 1042 1002 75150.0 40
3 0 1457 1003 75225.0 454
4 0 1930 1004 75300.0 926
... ... ... ... ... ...
1995 0 1823 2995 195325.0 -1172
1996 0 2186 2996 204450.0 -810
1997 0 1384 2997 184450.0 -1613
1998 0 1795 2998 194775.0 -1203
1999 0 1611 2999 190225.0 -1388
[2000 rows x 5 columns]
#Simulation 0.0
Demand 2922.0
Order 2989.0
Utility 222500.0
Shortage -67.0
Name: 1989, dtype: float64
Desired Output (writing 5 in the input):
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 1 1392 1001 75075.0 391
2 2 1042 1002 75150.0 40
3 3 1457 1003 75225.0 454
4 4 1930 1004 75300.0 926
[5 rows x 5 columns]
#Simulation 4.0
Demand 1930.0
Order 1004.0
Utility 75300.0
Shortage 926.0
Name: 1989, dtype: float64
I really don't know how to make it happen, I've tried everything that comes to my mind but the outcome either fails on the 'order' column or as shown above.

Related

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Python iteration slow

I have made the following code:
from math import sqrt
import time
def factors(n):
x=(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
return sorted(x,reverse=True)
n=10**7
m=0
start_time = time.time()
for i in xrange(1,int(sqrt(n))+1):
l=0
x=factors(i)
for d in xrange(i,n/i+1):
if i==d:
l+=i
else:
for b in x:
if d%b==0:
l+=2*b
break
m+=l
print i
elapsed_time = time.time() - start_time
print elapsed_time
print m
I think what the code does is add the greatest common divisor of i and d for all id≤n
Due to the "print i", I have realized that when i is small the second loop is slow. Why is this?, and how do I optimize?
I see that the iteration over d will be larger, but shouldn't it essentially just be iterating over all the values, whereas for the larger i, the third loop should take a longer time because of the greater size of x.
Could the second loop be slower for small values of i just because the xrange "spans" a larger amount of values for small i?
I mean, in the second loop declaration we have:
for d in xrange(i,n/i+1):
And the maximum value of the xrange (this is, n/i+1) is larger for small i (the quotient n/i is maximum at i=1, then it decreases
Your ideas about how long each loop should take to run a single time relative to one another are accurate, but the differences in them is minimal.
Your assumptions about how many times each loop is running are off by several orders of magnitude.
Loop i is executing ~3000 times. The total number of loop executions called per i varies but on average drops at an high rate. At the start, the d loop is getting called ~ 10,000,000 per i and then it drops off very quickly:
The total number of loops you run for i[0:215] is greater than i[215:3161]
i d_loops b_loops running_mean avg_last_10_loops
1 10000001 1 10000001.0 10000001.0
2 5000001 2 10000001.5 10000001.5
3 3333334 2 8888890.33333 8888890.33333
4 2500001 3 8541668.5 8541668.5
5 2000001 2 7633335.2 7633335.2
6 1666667 4 7472224.0 7472224.0
7 1428572 2 6812926.85714 6812926.85714
8 1250001 4 6586311.5 6586311.5
9 1111112 3 6224869.77778 6224869.77778
10 1000001 4 6002383.2 6002383.2
99 101011 6 1653200.16162 637628.2
199 50252 2 1035550.34171 324231.5
299 33445 4 779296.658863 203848.2
399 25063 8 634848.313283 192922.4
499 20041 2 540089.59519 149790.4
599 16695 2 472549.51586 114461.6
699 14307 4 421785.891273 103772.2
799 12516 4 382086.017522 100739.8
899 11124 4 349883.460512 80518.2
999 10011 8 323351.570571 80530.4
1099 9100 4 300961.77434 67638.0
1199 8341 4 281811.0 61978.2
1299 7699 4 265260.015396 65681.9
1399 7148 2 250684.336669 54528.4
1499 6672 2 237863.799199 49524.2
1599 6254 8 226449.282051 56452.4
1699 5886 2 216141.950559 47237.4
1799 5559 4 206859.735964 43485.8
1899 5266 6 198471.47762 49653.2
1999 5003 2 190769.076538 38112.8
2099 4765 2 183702.581706 34396.0
2199 4548 4 177231.36653 36467.0
2299 4350 6 171250.213136 35741.6
2399 4169 2 165683.010838 34256.8
2499 4002 12 160541.983994 39293.2
2599 3848 4 155707.039246 35478.6
2699 3706 2 151193.218229 27470.6
2799 3573 6 146973.628796 30790.8
2899 3450 4 143019.365643 29714.8
2999 3335 2 139271.870957 28064.8
3099 3227 4 135755.08777 27799.4
3153 3172 4 133946.542341 33037.6
3154 3171 8 133912.116677 30484.8
3155 3170 4 133873.691284 29208.8
3156 3169 12 133843.321926 29196.8
3157 3168 8 133808.95407 30460.0
3158 3167 4 133770.594047 29820.6
3159 3166 12 133740.27477 32349.4
3160 3165 16 133713.977215 25983.4
3161 3164 4 133675.679848 25979.4
3162 3163 16 133649.409235 27867.2

zip(,) string to float? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I am trying to compute a daily P&L, with 10 min prices in a .csv (there are 42 times for each date)---where number of buys and number of sells in a day could be unequal. If they are unequal, the program should use the closing price for that unique date df["price"][t] to subtract (from/by) depending on whether it's a buy or sell.
import pandas as pd
df=pd.read_csv("file.csv", names="date time price mag signal".split())
s=df["signal"]=="S"
b=df["signal"]=="B"
ns=df["signal"]!="S"
nb=df["signal"]!="B"
t=df["time"]=="1620"
a1=df["price"][buy|(nb & t)]
b1=df["date"][buy|(nb & t)]
h=df["price"][s|(ns & t)]
g=df["date"][s|(ns & t)]
c1=zip(b1,a1)
c=zip(g,h)
c1, c are lists containing number of buys and sells, alongside its respective date. The problem here is c1 & c are strings--once they're zipped; hence cannot be subtracted. Is it possible to make a1, h floating point numbers so that I can difference them?
I want to match dates in c, c1 to subtract the prices at the Sells-Buys: S_i-B_i, for all i on a given day, then sum all and return that one value, for every date. I'd like to difference the prices at h-a1, only when the dates match.
Some sample data:
date time price mag signal
1/3/2007 930 1422.8
1/3/2007 940 1423.2 0
1/3/2007 950 1422.8 0
1/3/2007 1000 1420.5 0
1/3/2007 1010 1422.8 0
1/3/2007 1020 1426.2 1 S
.
.
.
1/3/2007 1230 1424.2 -1 B
1/3/2007 1240 1424.8 0
1/3/2007 1250 1425.8 1 S
1/3/2007 1300 1426 0
1/3/2007 1310 1425 0
1/3/2007 1320 1423.5 -1 B
1/3/2007 1330 1421.8 0
1/3/2007 1340 1421.5 0
1/3/2007 1350 1420.5 0
1/3/2007 1400 1421 0
1/3/2007 1410 1417.2 -1 B
1/3/2007 1420 1412.8 -1 B
1/3/2007 1430 1414.8 0
1/3/2007 1440 1413.5 0
1/3/2007 1450 1410 0
1/3/2007 1500 1407.2 -1 B
1/3/2007 1510 1410.2 1 S
1/3/2007 1520 1409.5 -1 B
1/3/2007 1530 1410.5 1 S
1/3/2007 1540 1412.5 0
...
1/3/2007 1610 1415.5 1 S
1/3/2007 1620 1414 -1 B
1/4/2007 930 1412.2 0
1/4/2007 940 1411 0
1/4/2007 950 1413 0
1/4/2007 1000 1412.2 0
1/4/2007 1010 1407.2 -1 B
The result of the zip, say, c1 should look something like this:
[('1/3/2007', '1424.2'),
('1/3/2007', '1423.5'),
('1/3/2007', '1417.2'),
('1/3/2007', '1412.8'),
('1/3/2007', '1407.2'),
('1/3/2007', '1409.5'),
('1/3/2007', '1414'),
etc - all dates in between
('8/30/2012','1324')]
Thanks very much.
Don't use the zip, you can keep the data in pandas native datastructures.
Here prices should have read correctly as floats in the DataFrame.
You can do something like sub then groupby 'date':
df['dif'] = a1.sub(h, fill_value=0)
g = df.groubpy('date')['dif'].sum()
.
Note you can use read_csv keyword parse_dates as datetime objects:
df = pd.read_csv("file.csv",
names="date time price mag signal".split()
parse_dates=[['date','time']])

Categories