I have a file with some metadata, and then some actual data consisting of 2 columns with headings. Do I need to separate the two types of data before using genfromtxt in numpy? Or can I somehow split the data maybe? What about placing the file pointer to the end of the line just above the headers, and then trying genfromtxt from there? Thanks
The format of the file is shown below:
&SRS
<MetaDataAtStart>
multiple=True
Wavelength (Angstrom)=0.97587
mode=assessment
background=True
issid=py11n2g
noisy=True
</MetaDataAtStart>
&END
Two Theta(deg) Counts(sec^-1)
10.0 41.0
10.1 39.0
10.2 38.0
10.3 38.0
10.4 41.0
10.5 42.0
10.6 38.0
10.7 44.0
10.8 42.0
10.9 39.0
11.0 37.0
11.1 37.0
11.2 45.0
11.3 36.0
11.4 37.0
11.5 37.0
11.6 40.0
11.7 44.0
11.8 45.0
11.9 46.0
12.0 44.0
12.1 40.0
12.2 41.0
12.3 39.0
12.4 41.0
If you don't want the first n rows, try (if there is no missing data):
data = numpy.loadtxt(yourFileName,skiprows=n)
or (if there are missing data):
data = numpy.genfromtxt(yourFileName,skiprows=n)
If you then want to parse the header information, you can go back and open the file parse the header, for example:
fh = open(yourFileName,'r')
for i,line in enumerate(fh):
if i is n: break
do_other_stuff_to_header(line)
fh.close()
Related
I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.
I have multiple dataframes (different years) that looks like the following dataframe. Each dataframe contains the share of wealth each id holds (across equally distributed 1000 units of x-axis bins. So for instance, if there are 4,000,000 individuals, each bin will represent the sum of 4,000 individuals in descending order). What I want is to plot this in one chart. I am lacking creatibity as to what is the best to way to show these very skewed wealth distribution across different years...
When i look at my dataframe from year 2021, the top 0.1 holds 92% of all wealth. So when I plot it using a bar chart, it looks like just one straight vertical line, and if i use a line chart, it is a L-shaped graph. I was thinking maybe i should have different x-axis bin width, as in, insteady of using 1000 equal sized bins on the a-axis, maybe the top 0.1%, top 0.1-0.5%, top 0.5-1%, 1-5%, 5-10%, 10-20%,... etc.
If anyone has a good idea, i'd really really appreciate it!
x wealth_share_2016
1 0.33430437283205316
2 0.08857907028903435
3 0.05827083476711605
4 0.03862747269456592
5 0.034995688078949164
6 0.025653645763917113
7 0.021026627708501285
8 0.018026751734878957
9 0.01642864468243111
10 0.015728925648574896
11 0.013588290634843092
12 0.01227954727973525
13 0.011382643296594532
14 0.010141965617682762
15 0.008819245941582449
..
1000 0.000000000011221421
x wealth_share_2017
0.0 0.901371131515615
1.0 0.029149650261610725
2.0 0.01448219525035078
3.0 0.00924941242097224
4.0 0.006528547368042855
5.0 0.004915282901262396
6.0 0.0038227195841958007
7.0 0.003202422960559232
8.0 0.0027194902152005056
9.0 0.002256081738439025
10.0 0.001913906326353021
11.0 0.001655920262049755
12.0 0.001497315358785623
13.0 0.0013007783674694787
14.0 0.0011483994993211357
15.0 0.0010006446573525651
16.0 0.0009187314949837794
17.0 0.0008060306765341464
18.0 0.0007121683663280601
19.0 0.0006479765506981805
20.0 0.0006209618807503557
21.0 0.0005522371927723867
22.0 0.0004900821167110386
23.0 0.0004397140637940455
24.0 0.00039311806560654995
25.0 0.0003568253540177216
26.0 0.00033181209459040074
27.0 0.0003194446403240109
28.0 0.0003184084588259308
29.0 0.0003182506069381648
30.0 0.0003148797013444408
31.0 0.0002961487376129427
32.0 0.00027052175379974156
33.0 0.00024743766685454786
34.0 0.0002256857592625916
35.0 0.00020579998427225097
36.0 0.000189038268813506
37.0 0.00017386965729266948
38.0 0.0001613485014690905
39.0 0.0001574132034911388
40.0 0.0001490677750078641
41.0 0.00013790177558791725
42.0 0.0001282878615396144
43.0 0.00012095612436994448
44.0 0.00011214167633915717
45.0 0.00010421673782294511
46.0 9.715626623684205e-05
47.0 9.282271063116496e-05
48.0 8.696571645233427e-05
49.0 8.108410275243205e-05
50.0 7.672762907247785e-05
51.0 7.164556991989368e-05
52.0 6.712091046340094e-05
53.0 6.402983760430654e-05
54.0 6.340827259447476e-05
55.0 6.212579456204865e-05
56.0 6.0479432395632356e-05
57.0 5.871255187231619e-05
58.0 5.6732218205513816e-05
59.0 5.469844909188562e-05
60.0 5.272638831110061e-05
61.0 5.082941624023762e-05
62.0 4.9172657560503e-05
63.0 4.7723292856953955e-05
64.0 4.640794539328976e-05
65.0 4.4830504104868853e-05
66.0 4.33432435988776e-05
67.0 4.17840819038174e-05
68.0 4.0359335324500254e-05
69.0 3.890539627505912e-05
70.0 3.773843593447448e-05
71.0 3.650676651396156e-05
72.0 3.528219096983737e-05
73.0 3.440527767945646e-05
74.0 3.350747980104347e-05
75.0 3.26561659597071e-05
76.0 3.19802966664897e-05
77.0 3.1835209823474306e-05
78.0 3.183429293715699e-05
79.0 3.183429293715699e-05
80.0 3.179465449554639e-05
81.0 3.1754468203569435e-05
82.0 3.1704945367497785e-05
83.0 3.1660515386167146e-05
84.0 3.161204511239972e-05
85.0 3.160031088406889e-05
86.0 3.160031088406889e-05
87.0 3.159054611415194e-05
88.0 3.1527283185355765e-05
89.0 3.1443493604304305e-05
90.0 3.1323353389521874e-05
91.0 3.117894171029721e-05
92.0 3.0954278315859144e-05
93.0 3.057844960395481e-05
94.0 3.014447137763062e-05
95.0 2.9597164606371073e-05
96.0 2.887863910263771e-05
97.0 2.8423195872524498e-05
98.0 2.7793813070448293e-05
99.0 2.7040901735687525e-05
100.0 2.619028564470109e-05
101.0 2.5450004510283205e-05
102.0 2.4855217140189223e-05
103.0 2.403822662596923e-05
104.0 2.3244772756237742e-05
... ...
1000.0 0.000000023425324
Binning these data across irregular percentage ranges is a common way to present such distributions. You can categorize and aggregate data using pd.cut() with subsequent group_by():
import pandas as pd
import matplotlib.pyplot as plt
#sample data generation
import numpy as np
rng = np.random.default_rng(123)
n = 1000
df = pd.DataFrame({"x": range(n), "wealth_share_2017": np.sort(rng.pareto(a=100, size=n))[::-1]})
df.loc[0, "wealth_share_2017"] = 50
df["wealth_share_2017"] /= df["wealth_share_2017"].sum()
n = len(df)
#define bins in percent
#the last valueis slightly above 100% to ensure that the final bin is included
bins = [0, 0.1, 0.5, 1.0, 10.0, 50.0, 100.01]
#create figure labels for intervals from bins
labels = [f"[{start:.1f}, {stop:.1f})" for start, stop in zip(bins[:-1], bins[1:])]
#categorize data
df["cats"] = pd.cut(df["x"], bins=[n*i/100 for i in bins], include_lowest=True, right=False, labels=labels)
#and aggregate
df_plot = df.groupby(by="cats")["wealth_share_2017"].sum().mul(100)
df_plot.plot.bar(rot=45, xlabel="Income percentile", ylabel="Wealth share (%)", title=df_plot.name)
plt.tight_layout()
plt.show()
time a b
2021-05-23 22:06:54 10.4 70.1
2021-05-23 22:21:41 10.7 68.3
2021-05-23 22:36:28 10.4 69.4
2021-05-23 22:51:15 9.9 71.7
2021-05-23 23:06:02 9.5 73.1
... ... ... ... ... ...
2021-11-19 08:18:31 19.8 43.0
2021-11-19 08:20:04 21.0 42.0
2021-11-19 08:21:25 35.5 20.0
2021-11-19 08:21:32 19.8 43.0
2021-11-19 08:23:05 21.0 42.0
here time is in the index, not a column.
when I did df.between_time("2021-11-17 08:15:00","2021-11-19 08:00:00")
it throws the error ValueError: Cannot convert arg ['2021-11-17 08:15:00'] to a time
data frame has not proper time stamp.
What i want to do,-: when i pass time range or date range, i want to get all the data between given time.
Thanks
Use truncate:
>>> df.truncate("2021-05-23 23:00:00", "2021-11-19 08:20:00")
a b
time
2021-05-23 23:06:02 9.5 73.1
2021-11-19 08:18:31 19.8 43.0
The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.
Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]
# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index
# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))
This answers both parts
build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})
df[df.isna().any(axis=1)]
ystart = df.Date.dt.year.min()
# generate rolling means for month/day. bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
.agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
.to_frame().rename_axis(["month","day"])
)
# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()
# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)
# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])
analysis
taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22
dfm.query("key_0==4 & key_1==11").head(7)
key_0
key_1
key_2
Date
vals
0
100
4
11
1918
1918-04-11 00:00:00
nan
22
465
4
11
1919
1919-04-11 00:00:00
12
22
831
4
11
1920
1920-04-11 00:00:00
2
22
1196
4
11
1921
1921-04-11 00:00:00
47
27
1561
4
11
1922
1922-04-11 00:00:00
47
36
1926
4
11
1923
1923-04-11 00:00:00
2
34.6
2292
4
11
1924
1924-04-11 00:00:00
37
29.4
I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements
Need an arbitrary number of averages
Use those averages to fill in the NA
I have addressed the
Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)
df.loc[df['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 NaN NaN 34.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN 49.0 NaN NaN NaN 32.0 NaN NaN 76 21.6
04-11 NaN 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 NaN 5.0 33.0 3 29.7
04-12 NaN 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 NaN 1.0 12.0 3 21.3
04-13 NaN 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 NaN 2.0 46.0 3 21.3
04-14 NaN 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 NaN 16.0 8.0 3 24.7
df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 21.6 21.6 34.0 21.6 21.6 21.6 2.0 21.6 21.6 21.6 ... 21.6 49.0 21.6 21.6 21.6 32.0 21.6 21.6 76.0 21.6
04-11 29.7 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 29.7 5.0 33.0 3.0 29.7
04-12 21.3 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 21.3 1.0 12.0 3.0 21.3
04-13 21.3 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 21.3 2.0 46.0 3.0 21.3
04-14 24.7 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 24.7 16.0 8.0 3.0 24.7
df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))