zip(,) string to float? [closed] - python

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I am trying to compute a daily P&L, with 10 min prices in a .csv (there are 42 times for each date)---where number of buys and number of sells in a day could be unequal. If they are unequal, the program should use the closing price for that unique date df["price"][t] to subtract (from/by) depending on whether it's a buy or sell.
import pandas as pd
df=pd.read_csv("file.csv", names="date time price mag signal".split())
s=df["signal"]=="S"
b=df["signal"]=="B"
ns=df["signal"]!="S"
nb=df["signal"]!="B"
t=df["time"]=="1620"
a1=df["price"][buy|(nb & t)]
b1=df["date"][buy|(nb & t)]
h=df["price"][s|(ns & t)]
g=df["date"][s|(ns & t)]
c1=zip(b1,a1)
c=zip(g,h)
c1, c are lists containing number of buys and sells, alongside its respective date. The problem here is c1 & c are strings--once they're zipped; hence cannot be subtracted. Is it possible to make a1, h floating point numbers so that I can difference them?
I want to match dates in c, c1 to subtract the prices at the Sells-Buys: S_i-B_i, for all i on a given day, then sum all and return that one value, for every date. I'd like to difference the prices at h-a1, only when the dates match.
Some sample data:
date time price mag signal
1/3/2007 930 1422.8
1/3/2007 940 1423.2 0
1/3/2007 950 1422.8 0
1/3/2007 1000 1420.5 0
1/3/2007 1010 1422.8 0
1/3/2007 1020 1426.2 1 S
.
.
.
1/3/2007 1230 1424.2 -1 B
1/3/2007 1240 1424.8 0
1/3/2007 1250 1425.8 1 S
1/3/2007 1300 1426 0
1/3/2007 1310 1425 0
1/3/2007 1320 1423.5 -1 B
1/3/2007 1330 1421.8 0
1/3/2007 1340 1421.5 0
1/3/2007 1350 1420.5 0
1/3/2007 1400 1421 0
1/3/2007 1410 1417.2 -1 B
1/3/2007 1420 1412.8 -1 B
1/3/2007 1430 1414.8 0
1/3/2007 1440 1413.5 0
1/3/2007 1450 1410 0
1/3/2007 1500 1407.2 -1 B
1/3/2007 1510 1410.2 1 S
1/3/2007 1520 1409.5 -1 B
1/3/2007 1530 1410.5 1 S
1/3/2007 1540 1412.5 0
...
1/3/2007 1610 1415.5 1 S
1/3/2007 1620 1414 -1 B
1/4/2007 930 1412.2 0
1/4/2007 940 1411 0
1/4/2007 950 1413 0
1/4/2007 1000 1412.2 0
1/4/2007 1010 1407.2 -1 B
The result of the zip, say, c1 should look something like this:
[('1/3/2007', '1424.2'),
('1/3/2007', '1423.5'),
('1/3/2007', '1417.2'),
('1/3/2007', '1412.8'),
('1/3/2007', '1407.2'),
('1/3/2007', '1409.5'),
('1/3/2007', '1414'),
etc - all dates in between
('8/30/2012','1324')]
Thanks very much.

Don't use the zip, you can keep the data in pandas native datastructures.
Here prices should have read correctly as floats in the DataFrame.
You can do something like sub then groupby 'date':
df['dif'] = a1.sub(h, fill_value=0)
g = df.groubpy('date')['dif'].sum()
.
Note you can use read_csv keyword parse_dates as datetime objects:
df = pd.read_csv("file.csv",
names="date time price mag signal".split()
parse_dates=[['date','time']])

Related

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

Make loop iterate through range but limited to input

I'm new to python so I apologize in advance if the question is too easy.
I'm trying to make a simulation to find the optimization point on a dataframe. This is what I have so far:
import random
import pandas as pd
import math
import numpy as np
loops = int(input('Q of simulations: '))
cost = 175
sell_price = 250
sale_price = 250/2
# order = 1000
simulation = 0
profit = 0
rows = []
order = range(1000, 3000)
ordenes = []
for i in order:
ordenes.append(i)
for i in ordenes:
demand = math.trunc(1000 + random.random() * (2001))
if demand >= i:
profit = (sell_price - cost)* i
rows.append([simulation, demand, i, profit, (demand - i)])
else:
profit = (sell_price - cost)* demand - (i - demand)* (sale_price - cost)
rows.append([simulation, demand, i, profit, (demand - i)])
DataFrame = pd.DataFrame(rows, columns = ['#Simulation', 'Demand', 'Order', 'Utility', 'Shortage'])
print(DataFrame)
DataFrame.loc[DataFrame['Utility'].idxmax()]
The current output (for any number specified in tis:
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 0 1392 1001 75075.0 391
2 0 1042 1002 75150.0 40
3 0 1457 1003 75225.0 454
4 0 1930 1004 75300.0 926
... ... ... ... ... ...
1995 0 1823 2995 195325.0 -1172
1996 0 2186 2996 204450.0 -810
1997 0 1384 2997 184450.0 -1613
1998 0 1795 2998 194775.0 -1203
1999 0 1611 2999 190225.0 -1388
[2000 rows x 5 columns]
#Simulation 0.0
Demand 2922.0
Order 2989.0
Utility 222500.0
Shortage -67.0
Name: 1989, dtype: float64
Desired Output (writing 5 in the input):
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 1 1392 1001 75075.0 391
2 2 1042 1002 75150.0 40
3 3 1457 1003 75225.0 454
4 4 1930 1004 75300.0 926
[5 rows x 5 columns]
#Simulation 4.0
Demand 1930.0
Order 1004.0
Utility 75300.0
Shortage 926.0
Name: 1989, dtype: float64
I really don't know how to make it happen, I've tried everything that comes to my mind but the outcome either fails on the 'order' column or as shown above.

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Pandas cumsum + cumcount on multiple columns

Aloha,
I have the following DataFrame
stores = [1,2,3,4,5]
weeks = [1,1,1,1,1]
df = pd.DataFrame({'Stores' : stores,
'Weeks' : weeks})
df = pd.concat([df]*53)
df['Weeks'] = df['Weeks'].add(df.groupby('Stores').cumcount())
df['Target'] = np.random.randint(400,600,size=len(df))
df['Actual'] = np.random.randint(350,800,size=len(df))
df['Variance %'] = (df['Target'] - df['Actual']) / df['Target']
df.loc[df['Variance %'] >= 0.01, 'Status'] = 'underTarget'
df.loc[df['Variance %'] <= 0.01, 'Status'] = 'overTarget'
df['Status'] = df['Status'].fillna('atTarget')
df.sort_values(['Stores','Weeks'],inplace=True)
this gives me the following
print(df.head())
Stores Weeks Target Actual Variance % Status
0 1 1 430 605 -0.406977 overTarget
0 1 2 549 701 -0.276867 overTarget
0 1 3 471 509 -0.080679 overTarget
0 1 4 549 378 0.311475 underTarget
0 1 5 569 708 -0.244288 overTarget
0 1 6 574 650 -0.132404 overTarget
0 1 7 466 623 -0.336910 overTarget
now what I'm trying to do is do a cumulative count of Stores where they were either over or undertarget but reset when the status changes.
I thought this would be the best way to do this (and many variants of this) but this does not reset the counter.
s = df.groupby(['Stores','Weeks','Status'])['Status'].shift().ne(df['Status'])
df['Count'] = s.groupby(df['Stores']).cumsum()
my logic was to group by my relevant columns, and do a != shift to reset the cumsum
Naturally I've scoured lots of different questions but I can't seem to figure this out. Would anyone be so kind to explain to me what would be the best method to tackle this problem?
I hope everything here is clear and reproducible. Please let me know if you need any additional information.
Expected Output
Stores Weeks Target Actual Variance % Status Count
0 1 1 430 605 -0.406977 overTarget 1
0 1 2 549 701 -0.276867 overTarget 2
0 1 3 471 509 -0.080679 overTarget 3
0 1 4 549 378 0.311475 underTarget 1 # Reset here as status changes
0 1 5 569 708 -0.244288 overTarget 1 # Reset again.
0 1 6 574 650 -0.132404 overTarget 2
0 1 7 466 623 -0.336910 overTarget 3
Try pd.Series.groupby() after create the key by cumsum
s=df.groupby('Stores')['Status'].apply(lambda x : x.ne(x.shift()).ne(0).cumsum())
df['Count']=df.groupby([df.Stores,s]).cumcount()+1

Convert regex values in pandas to 0 or 1

I have the below pandas column. I need to convert cells containing the word 'anaphylaxis' to 1 and the cells not containing the word to 0.
Till now I have tried but there is something missing
df['Name']= df['Name'].replace(r"^(.(?=anaphylaxis))*?$", 1,regex=True)
df['Name']= df['Name'].replace(r"^(.(?<!anaphylaxis))*?$", 0, regex=True)
ID Name
84 Drug-induced anaphylaxis
1041 Acute anaphylaxis
1194 Anaphylactic reaction
1483 Anaphylactic reaction, due to adverse effect o...
2226 Anaphylaxis, initial encounter
2428 Anaphylaxis
2831 Anaphylactic shock
4900 Other anaphylactic reaction
Use str.contains for case-insensitive matching.
import re
df['Name'] = df['Name'].str.contains(r'anaphylaxis', flags=re.IGNORECASE).astype(int)
Or, more concisely,
df['Name'] = df['Name'].str.contains(r'(?i)anaphylaxis').astype(int)
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
contains is useful when you want to also perform regex-based matching. Although in this case, you can probably get rid of the regex completely by adding regex=False for a bit more performance.
However, for even more performance, use a list comprehension.
df['Name'] = np.array(['anaphylaxis' in x.lower() for x in df['Name']], dtype=int)
Or even better,
df['Name'] = [1 if 'anaphylaxis' in x.lower() else 0 for x in df['Name'].tolist()]
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
You can use pd.Series.str.contains instead of regex. This method returns a Boolean series, which we then convert to int.
df['Name']= df['Name'].str.contains('anaphylaxis', case=False, regex=False)\
.astype(int)
Result:
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0

Categories