How to plot Matplotlib chart which takesvalues from different columns - python

This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)

Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])

If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).

Related

Pandas Dataframe - How to transpose one value for the row n to the row n-5 [duplicate]

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.

How do you add the value for a certain column from a previous row to your current row in Python Pandas? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

I wanna read each cell of pandas df one after another and do some calculation on them

I wanna read each cell of pandas df one after another and do some calculation on them, but I have a problem using dictionaries or lists. for example, I wanna check the Ith cell whether the outdoor door temperature is more than X and also humidity is more/less than Y!then do a special calculation for the row.
here is the body of loaded df:
data=pd.read_csv('/content/drive/My Drive/Thesis/DS1.xlsx - Sheet1.csv')
data=data.drop(columns=["Date","time","real feel","Humidity","indoor temp"])
print(data)
and here is the data:
outdoor temp Unnamed: 6 Humidity Estimation: (poly3)
0 26 NaN 64.1560
1 25 NaN 68.6875
2 25 NaN 68.6875
3 24 NaN 72.4640
4 24 NaN 72.4640
.. ... ... ...
715 35 NaN 22.5625
716 33 NaN 28.1795
717 32 NaN 32.3680
718 31 NaN 37.2085
719 30 NaN 42.5000
[720 rows x 3 columns]
Create a function and then use .apply() to use the function on each row. You can edit temp and humid to your desired values. If you want to reference a specific row then just use data[row index]. I am not sure what calculation you want to do but I just added one to the value.
def calculation(row, temp, humid):
if row["outdoor temp"] > temp:
row["outdoor temp"] += 1
if row["humidity"] > humid:
row["humidity"] += 1
data = data.apply(lambda row : calculation(row, temp, humid), axis = 1)

Python Pandas Fill Dataframe with another DataFrame

I have a dataframe
x = pd.DataFrame(index = ['wkdy','hr'],columns=['c1','c2','c3'])
This leads to 168 rows of data in the dataframe. 7 weekdays and 24 hours in each day.
I have another dataframe
dates = pd.date_range('20090101',periods = 10000, freq = 'H')
y = DataFrame(np.random.randn(10000, 3), index = dates, columns = ['c1','c2','c3'])
y['hr'] = y.index.hour
y['wkdy'] = y.index.weekday
I want to fill 'y' with data from 'x', so that all each weekday and hour has same data but has a datestamp attached to it..
The only way i know is to loop through the dates and fill values. Is there a faster, more efficient way to do this?
My Solution (rather crude to say the least) iterates over the entire dataframe y row by row and tries to fill from dataframe x through a lookup.
for r in range(0,len(y)):
h = int(y.iloc[r]['hr'])
w = int(y.iloc[r]['wkdy'])
y.iloc[r] = x.loc[(w,h)]
Your dataframe x doesn't have 168 rows but looks like
c1 c2 c3
wkdy NaN NaN NaN
hr NaN NaN NaN
and you can't index it using a tuple like in x.loc[(w,h)]. What you probably had in mind was something like
x = pd.DataFrame(
index=pd.MultiIndex.from_product(
[range(7), range(24)], names=['wkdy','hr']),
columns=['c1','c2','c3'],
data=np.arange(3 * 168).reshape(3, 168).T)
x
c1 c2 c3
wkdy hr
0 0 0 168 336
1 1 169 337
... ... ... ... ...
6 22 166 334 502
23 167 335 503
168 rows × 3 columns
Now your loop will work, although a pythonic representation would look like this:
for idx, row in y.iterrows():
y.loc[idx, :3] = x.loc[(row.wkdy, row.hr)]
However, iterating through dataframes is very expensive and you should look for a vectorized solution by simply merging the 2 frames and removing the unwanted columns:
y = (x.merge(y.reset_index(), on=['wkdy', 'hr'])
.set_index('index')
.sort_index()
.iloc[:,:-3])
y
wkdy hr c1_x c2_x c3_x
index
2009-01-01 00:00:00 3 0 72 240 408
2009-01-01 01:00:00 3 1 73 241 409
... ... ... ... ... ...
2010-02-21 14:00:00 6 14 158 326 494
2010-02-21 15:00:00 6 15 159 327 495
10000 rows × 5 columns
Now y is a dataframe with columns c1_x, c2_x, c3_x having data from dataframe x where y.wkdy==x.wkdy and y.hr==x.hr. Merging here is 1000 times faster than looping.

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Categories