I've a dataframe DF1:
YEAR JAN_EARN FEB_EARN MAR_EARN APR_EARN MAY_EARN JUN_EARN JUL_EARN AUG_EARN SEP_EARN OCT_EARN NOV_EARN DEC_EARN
0 2017 20 21 22.0 23 24.0 25.0 26.0 27.0 28 29.0 30 31
1 2018 30 31 32.0 33 34.0 35.0 36.0 37.0 38 39.0 40 41
2 2019 40 41 42.0 43 NaN 45.0 NaN NaN 48 49.0 50 51
3 2017 50 51 52.0 53 54.0 55.0 56.0 57.0 58 59.0 60 61
4 2017 60 61 62.0 63 64.0 NaN 66.0 NaN 68 NaN 70 71
5 2021 70 71 72.0 73 74.0 75.0 76.0 77.0 78 79.0 80 81
6 2018 80 81 NaN 83 NaN 85.0 NaN 87.0 88 89.0 90 91
group the rows by common row in "YEAR" column and add all the data of that column.
I tried to check with this:
DF2['New'] = DF1.groupby(DF1.groupby('YEAR')).sum()
The Expected Output is like:
DF2;
YEAR JAN_EARN FEB_EARN ......
0 2017 130 133 ......
1 2018 110 112 ......
2 2019 40 41 ......
3 2021 70 71 ......
Thank You For Your Time :)
You were halfway through there, just rectify some small details as following :
Don't assign a groupby object to a newly defined column, replace your line of 'Df2['New'] = ...' with :
DF2 = DF1.groupby('YEAR' , as_index = False).sum().reset_index(drop = True)
If you wish to see all the columns relative to each year, create a list with the range of years your df has then apply a mask for each element in that list. You will obtain one dataframe per year then concatenate them with axis = 0.
Another way of doing so would be sorting DF1's years by chronological order then slicing. I'm afraid we misunderstood your question, if that's the case please develop more so we can help.
so I'm trying to apply different conditions that depends on a date, months to be specific. For example, for January replace the data in TEMP that is above 45 but for February that is above 30 and so on. I already did that with the code below, but the problem is that the data from the previous month is replace it with nan.
This is my code:
meses = ["01", "02"]
for i in var_vars:
if i in dataframes2.columns.values:
for j in range(len(meses)):
test_prueba_mes = dataframes2[i].loc[dataframes2['fecha'].dt.month == int(meses[j])]
test_prueba = test_prueba_mes[dataframes2[i]<dataframes.loc[i]["X"+meses[j]+".max"]]
dataframes2["Prueba " + str(i)] = test_prueba
Output:
dataframes2.tail(5)
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 NaN NaN
23 2020-01-02 00:00:00 31.0 3 NaN NaN
24 2020-01-02 12:00:00 31.0 2 NaN NaN
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
My desired Output is:
Output:
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 29.0 NaN
23 2020-01-02 00:00:00 31.0 3 31.0 3.0
24 2020-01-02 12:00:00 31.0 2 31.0 2.0
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
Appreciate if anyone can help me.
Update: The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
An example of the data:
fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20
With clear understanding
have encoded monthly limits into a dict limits
use numpy select(), when a condition matches take value corresponding to condition from second parameter. Default to third parameter
build conditions dynamically from limits dict
second parameter needs to be same length as conditions list. Build list of np.nan as list comprehension so it's correct length
to consider all columns, use a dict comprehension that builds **kwarg params to assign()
df = pd.read_csv(io.StringIO(""" fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20"""), sep="\s\s+", engine="python")
df.fecha = pd.to_datetime(df.fecha)
# The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
limits = {1:45, 2:30, 3:45, 4:10, 5:15, 6:30}
df = df.assign(**{f"Prueba {c}":np.select( # construct target column name
# build a condition for each of the month limits
[df.fecha.dt.month.eq(m) & df[c].gt(l) for m,l in limits.items()],
[np.nan for m in limits.keys()], # NaN if beyond limit
df[c]) # keep value if within limits
for c in df.columns if "Avg" in c}) # do calc for all columns that have "Avg" in name
fecha
TEMP_C_Avg
RH_Avg
Prueba TEMP_C_Avg
Prueba RH_Avg
25
2020-02-01 10:00:00
29
5
29
5
26
2020-02-01 11:00:00
32
105
nan
nan
27
2020-03-01 10:00:00
55
3
nan
3
28
2020-03-01 11:00:00
40
5
40
5
29
2020-04-01 10:00:00
10
20
10
nan
30
2020-04-01 11:00:00
5
15
5
nan
31
2020-05-01 10:00:00
20
15
nan
15
32
2020-05-01 11:00:00
5
106
5
nan
33
2020-06-01 10:00:00
33
107
nan
nan
34
2020-06-01 11:00:00
20
20
20
20
I have two dataframes with a number of the same column headers in each.
I'm looking to merge both dataframes but only use data from dataframe B if there is no data is dataframe A available, i.e. Dataframe B is default values which should be used if there is no data is dataframe A.
Dataframe A
A B C
01/01/2020 78 45 78
02/01/2020 41 36 51
03/01/2020 81 43 51
04/01/2020 84 NaN NaN
05/01/2020 NaN NaN NaN
.
.
.
.
31/01/2022 NaN NaN NaN
Dataframe B;
A B C
01/01/2020 40 30 60
02/01/2020 40 30 60
03/01/2020 40 30 60
04/01/2020 40 30 60
.
.
.
.
31/01/2025 40 30 60
Example 04/01/2020 would read;
04/01/2020 84 30 60
Any form of join/merge I do seems to overwrite values incorrectly.
Any help much appreciated!
Assume df1
A B C
date
01/01/2020 78.0 45.0 78.0
02/01/2020 41.0 36.0 51.0
03/01/2020 81.0 43.0 51.0
04/01/2020 84.0 NaN NaN
05/01/2020 NaN NaN NaN
and df2
A B C
date
01/01/2020 40 30 60
02/01/2020 40 30 60
03/01/2020 40 30 60
04/01/2020 40 30 60
05/01/2020 40 30 60
Both having date as index
df3 = df1.fillna(df2)
A B C
date
01/01/2020 78.0 45.0 78.0
02/01/2020 41.0 36.0 51.0
03/01/2020 81.0 43.0 51.0
04/01/2020 84.0 30.0 60.0
05/01/2020 40.0 30.0 60.0
I plotted a data frame like this:
Date Quote-Spread
0 2013-11-17 2.0
1 2013-12-10 8.0
2 2013-12-11 8.0
3 2014-06-01 5.0
4 2014-06-23 15.0
5 2014-06-24 45.0
6 2014-06-25 10.0
7 2014-06-28 20.0
8 2014-09-13 50000.0
9 2015-03-30 250000.0
10 2016-04-02 103780.0
11 2016-04-03 119991.0
12 2016-04-04 29994.0
13 2016-04-05 69993.0
14 2016-04-06 39997.0
15 2016-04-09 490321.0
16 2016-04-10 65485.0
17 2016-04-11 141470.0
18 2016-04-12 109939.0
19 2016-04-13 29983.0
20 2016-04-16 39964.0
21 2016-04-17 39964.0
22 2016-04-18 79920.0
23 2016-04-19 29997.0
24 2016-04-20 108414.0
25 2016-04-23 126849.0
26 2016-04-24 206853.0
27 2016-04-25 37559.0
28 2016-04-26 22817.0
29 2016-04-27 37506.0
30 2016-04-30 37597.0
31 2016-05-01 18799.0
32 2016-05-02 18799.0
33 2016-05-03 9400.0
34 2016-05-07 29890.0
35 2016-05-08 29193.0
36 2016-05-09 7792.0
37 2016-05-10 3199.0
38 2016-05-11 8538.0
39 2016-05-14 49937.0
I use this command to plot them in ipython:
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
But my figure is plotted like this:
As you can see in day 2014-04-23, the Quote-Spread has a value about 126,000. But in plot it is just zero.
my whole plot is like this:
Here is my code of original data:
Sachad = df.loc[df['SID']== 40065016131938148]
#Drop rows with any zero
df1 = df1[~(df1 == 0).any(axis = 1)]
df1['Quote-Spread'] = (df1['SellPrice'].mask(df1['SellPrice'].eq(0))-
df1['BuyPrice'].mask(df1['BuyPrice'].eq(0))).abs()
df2 = df1.groupby('Date' , as_index = False )['Quote-Spread'].mean()
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
Another question is how i can plot for specific dates like between 2014-04-01 up to 2016-06-01. and draw vertical red lines for dates 2014-06-06 and 2016-01-06?
Please provide the code that produced the working plot. Any warning messages?
As for your last questions: to select the rows you want, you can simply use > and < operators to compare two datetimes in conditional statements.
For vertical lines, you can use plt.axvline(x=date, color = 'r')
I have a file-like object generated from StringIO which is a table with lines of information ahead the table (see below starting from #TIMESTAMP).
I want to add extra columns to the exisiting table using the information "Date", "UTCoffset - Time (Substraction)" from #Timestamp and "ZenAngle" from #GLOBAL_SUMMARY.
I used pd.read_csv command to read it but it only worked when I skip the first 8 rows which includes the information I need. Also the Error "TypeError: data argument can't be an iterator" was reported as I tried to import the object below as dataframe.
#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
I think you can first use read_csv to create 3 DataFrames:
import pandas as pd
import io
temp=u"""#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
"""
df1 = pd.read_csv(io.StringIO(temp),
skiprows=9)
print (df1)
Wavelength S-Irradiance Time
0 290.0 0.000000 NaN
1 290.5 0.000000 NaN
2 291.0 0.000004 NaN
3 291.5 0.000022 NaN
4 292.0 0.000021 NaN
5 292.5 0.000022 NaN
6 293.0 0.000025 NaN
7 293.5 0.000023 NaN
8 294.0 0.000031 NaN
9 294.5 0.000047 NaN
10 295.0 0.000034 NaN
11 295.5 0.000036 NaN
12 296.0 0.000043 NaN
13 296.5 0.000038 NaN
14 297.0 0.000048 NaN
15 297.5 0.000074 NaN
16 298.0 0.000092 NaN
17 298.5 0.000108 NaN
18 299.0 0.000214 NaN
19 299.5 0.000318 NaN
20 300.0 0.000334 NaN
21 300.5 0.000499 NaN
22 301.0 0.000869 NaN
23 301.5 0.001210 NaN
24 302.0 0.001133 NaN
df2 = pd.read_csv(io.StringIO(temp),
skiprows=1,
nrows=1)
print (df2)
UTCOffset Date Time
0 +00:30:32 2011-09-05 08:32:21
df3 = pd.read_csv(io.StringIO(temp),
skiprows=5,
nrows=1)
print (df3)
Time IntACGIH IntCIE ZenAngle MuValue AzimAngle Flag TempC O3 \
0 08:32:21 7.3576 52.758 59.109 1.929 114.427 0 24 291
Err_O3 SO2 Err_SO2 F324
0 1 NaN NaN 91.9