I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.
Related
I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year
I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!
Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()
I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.
Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
My data file is like this:
abb
sdsdfmn
sfdf sdf
2011-12-05 11:00 1.0 9.0
2011-12-05 12:00 44.9 2.0
2011-12-05 13:00 66.8 4.2
2011-12-05 14:00 22.8 1.0 26.2 45.2 2.3
2011-12-05 15:00 45.7 2.0 45.0 45.6 1.4
2011-12-05 16:00 23.2 3.0 456.2 11.7 1.5
2011-12-05 17:00 67.4 4.0 999.1 45.8 0.9
2011-12-05 18:00 34.4 1.2
2011-12-05 19:00 12.4 4.2 345.1 11.1 7.6
I used numpy genfromtxt:
data = np.genfromtxt('data.txt', usecols=(0,1,3), skip_header=4, dtype=[('date','S10'),('hour','S5'),('myfloat','f8')])
The Problem is column 3 has some empty values in there (at the beginning and later on). So it read a wrong column.
I tried the delimiter-parameter, because all float columns has fixed width (delimiter=[10,5,5]), but it also fails.
Is there a workaround?
I have a file with some metadata, and then some actual data consisting of 2 columns with headings. Do I need to separate the two types of data before using genfromtxt in numpy? Or can I somehow split the data maybe? What about placing the file pointer to the end of the line just above the headers, and then trying genfromtxt from there? Thanks
The format of the file is shown below:
&SRS
<MetaDataAtStart>
multiple=True
Wavelength (Angstrom)=0.97587
mode=assessment
background=True
issid=py11n2g
noisy=True
</MetaDataAtStart>
&END
Two Theta(deg) Counts(sec^-1)
10.0 41.0
10.1 39.0
10.2 38.0
10.3 38.0
10.4 41.0
10.5 42.0
10.6 38.0
10.7 44.0
10.8 42.0
10.9 39.0
11.0 37.0
11.1 37.0
11.2 45.0
11.3 36.0
11.4 37.0
11.5 37.0
11.6 40.0
11.7 44.0
11.8 45.0
11.9 46.0
12.0 44.0
12.1 40.0
12.2 41.0
12.3 39.0
12.4 41.0
If you don't want the first n rows, try (if there is no missing data):
data = numpy.loadtxt(yourFileName,skiprows=n)
or (if there are missing data):
data = numpy.genfromtxt(yourFileName,skiprows=n)
If you then want to parse the header information, you can go back and open the file parse the header, for example:
fh = open(yourFileName,'r')
for i,line in enumerate(fh):
if i is n: break
do_other_stuff_to_header(line)
fh.close()
So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say.
I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group.
group = df.groupby('step').power.mean()
group2 = df.groupby('step').power.std().fillna(0)
This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following:
upperlimit = group + 3*group2
lowerlimit = group - 3*group2
lowerlimit[lowerlimit<0] = 0
Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step).
Here's 50 lines of the sample data:
Index Power Step
0 106.0 5.0
1 200.4 5.5
2 201.4 5.6
3 226.9 5.6
4 206.8 5.6
5 177.5 5.3
6 124.0 4.9
7 121.0 4.8
8 93.9 4.7
9 135.6 5.0
10 211.1 5.6
11 265.2 6.0
12 281.4 6.2
13 417.9 6.9
14 546.0 7.4
15 619.9 7.9
16 404.4 7.1
17 241.4 5.8
18 44.3 3.9
19 72.1 4.6
20 21.1 3.3
21 6.3 2.3
22 0.0 0.8
23 0.0 0.9
24 0.0 3.2
25 0.0 4.6
26 33.3 4.2
27 97.7 4.7
28 91.0 4.7
29 105.6 4.8
30 97.4 4.6
31 126.7 5.0
32 134.3 5.0
33 133.4 5.1
34 301.8 6.3
35 298.5 6.3
36 312.1 6.5
37 505.3 7.5
38 491.8 7.3
39 404.6 6.8
40 324.3 6.6
41 347.2 6.7
42 365.3 6.8
43 279.7 6.3
44 351.4 6.8
45 350.1 6.7
46 573.5 7.9
47 490.1 7.5
48 520.4 7.6
49 548.2 7.9
To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform:
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
You can then create the new rows directly:
df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0))
df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0))
df.loc[df['lower'] < 0, 'lower'] = 0
And sort accordingly:
df = df[(df.power <= df.upper) & (df.power >= df.lower())]