How to extract numbers from a DataFrame column in python? - python

Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset. I started to clean data, but in the dataset there is a "height" column which is of Object type. It is in the format of feet and inch.
I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula. I have the formula ready for the conversion but I am not able to extract it.
Also I want to convert it into Int datatype before applying the formula. I am stuck on this mode.
-------- 2 height 2449 non-null object --------
I am trying to extract it using String manipulation, but not able to do it. Can anybody help.
height
5'3"
5'4"
I have attached a github link to access the dataset.
text
import numpy as np
import pandas as pd
from collections import Counter
agemrg = pd.read_csv('age_of_marriage_data.csv')
for height in range(len(height_list)):
BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
foot_int = int(BrideGroomHeight[0])
inch_int = int(BrideGroomHeight[2:4])
print(foot_int)
print(inch_int)
if height in ['nan']:
continue
output -
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
1 for height in range(len(height_list)):
----> 2 BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
3 foot_int = int(BrideGroomHeight[0])
4 inch_int = int(BrideGroomHeight[2:4])
5 print(foot_int)
AttributeError: 'float' object has no attribute 'rstrip'
There are some nan values, due to which I am not able to perform this operation.

You can use .split() to get the feet and inches portion. If you are certain you only have to deal with a few NaN rows, then a simple version could be:
df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]
Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.
Output:
>>> print(df[['height', 'height_feet', 'height_inches']])
height height_feet height_inches
0 5'4" 5 4
1 5'7" 5 7
2 5'7" 5 7
3 5'0" 5 0
4 5'5" 5 5
... ... ... ...
2562 5'3" 5 3
2563 5'11" 5 11
2564 5'3" 5 3
2565 4'11" 4 11
2566 5'2" 5 2
[2567 rows x 3 columns]

You can use str.extract:
df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d+)"''') \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Or str.split and str.strip:
df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Output:
>>> df.filter(like='height')
height height2 height3
0 5'4" 162.56 162.56
1 5'7" 170.18 170.18
2 5'7" 170.18 170.18
3 5'0" 152.40 152.40
4 5'5" 165.10 165.10
... ... ... ...
2562 5'3" 160.02 160.02
2563 5'11" 180.34 180.34
2564 5'3" 160.02 160.02
2565 4'11" 149.86 149.86
2566 5'2" 157.48 157.48
[2567 rows x 3 columns]

Related

Trying to find the most efficient way to 'travel' from coordinates where the coordinates are in a pandas dataframe

I have the following dataframe:
id x_coordinate y_coordinate money time (hr)
0 545 0.676576 3.079094 4200 1.414706
1 4138 0.262979 -0.769170 700 0.943230
2 5281 -0.301234 -3.568590 200 1.314108
3 4369 -0.585544 1.610388 11600 0.703957
4 2173 -1.239105 3.168139 29200 0.666473
5 9971 -1.556373 -1.624628 18700 0.776165
6 2622 -1.747544 3.145381 100 0.842138
7 4522 -1.923251 -2.695298 36700 0.186741
8 7299 -2.697775 2.038365 500 0.469136
9 5425 -4.443474 0.428256 1400 0.760269
Say the starting point is the first row. I want to manipulate the dataframe so that the second row is the row whose coordinates are closest to the first row's and the third row is the row whose coordinates are closest to the second row's coordinates out of the remaining rows and so on.
Imagine the coordinates on a graph and I start at (0,0) for example. I am trying to find the most efficient way to 'travel' between these coordinates and end up at (0,0)
The output could be a list of coordinates sorted following the most efficient route for example. Or a new dataframe - just trying to figure out a way to solve this
So far I have tried to sort the df like this:
df = df.sort_values(["x_coordinate", "y_coordinate"], ascending = (False, True))
I have alternatively also tried to convert the df columns to lists using df.to_list() and then sorting them.
However, neither approach doesn't give the result I want. So any tips on how to go about manipulating a dataframe like this?
Any help is much appreciated!
I think, sort won't be of any help. You will need to take a Euclidean distance of this points from your coordinates and then sort on it. That will give you the closest points.
This is what I tried quickly. See if this works. I didn't get a chance to verify the results.
import numpy as np
import pandas as pd
from scipy import spatial
df = pd.DataFrame({
'id': [545,4138,5281,4369,2173,9971,2622,4522,7299,5425],
'x_coordinate': [0.676576, 0.262979, -0.301234, -0.585544, -1.239105,-1.556373,-1.747544,-1.923251,-2.697775,-4.443474],
'y_coordinate': [3.079094, -0.769170, -3.568590, 1.610388, 3.168139, -1.624628,3.145381,-2.695298,2.038365,0.428256],
})
print(df)
id x_coordinate y_coordinate
0 545 0.676576 3.079094
1 4138 0.262979 -0.769170
2 5281 -0.301234 -3.568590
3 4369 -0.585544 1.610388
4 2173 -1.239105 3.168139
5 9971 -1.556373 -1.624628
6 2622 -1.747544 3.145381
7 4522 -1.923251 -2.695298
8 7299 -2.697775 2.038365
9 5425 -4.443474 0.428256
def shortest_neighbour(pt,nebr):
tree = spatial.KDTree(nebr)
dist = tree.query(pt,2)
return dist[0][1],dist[1][1]
arr=df[['x_coordinate','y_coordinate']].to_numpy().reshape(len(df),2)
df[['Dist','Ord']]=pd.DataFrame(df.apply(lambda row : shortest_neighbour(arr[row.name],arr),axis = 1).tolist(),index=df.index)
print(df)
id x_coordinate y_coordinate Dist Ord
0 545 0.676576 3.079094 1.917749 4
1 4138 0.262979 -0.769170 2.010435 5
2 5281 -0.301234 -3.568590 1.842167 7
3 4369 -0.585544 1.610388 1.689299 4
4 2173 -1.239105 3.168139 0.508948 6
5 9971 -1.556373 -1.624628 1.131783 7
6 2622 -1.747544 3.145381 0.508948 4
7 4522 -1.923251 -2.695298 1.131783 5
8 7299 -2.697775 2.038365 1.458912 6
9 5425 -4.443474 0.428256 2.374851 8
dfs=df.sort_values(by=['Ord','Dist'],ascending =[True,True])
print(dfs)
id x_coordinate y_coordinate Dist Ord
6 2622 -1.747544 3.145381 0.508948 4
3 4369 -0.585544 1.610388 1.689299 4
0 545 0.676576 3.079094 1.917749 4
7 4522 -1.923251 -2.695298 1.131783 5
1 4138 0.262979 -0.769170 2.010435 5
4 2173 -1.239105 3.168139 0.508948 6
8 7299 -2.697775 2.038365 1.458912 6
5 9971 -1.556373 -1.624628 1.131783 7
2 5281 -0.301234 -3.568590 1.842167 7
9 5425 -4.443474 0.428256 2.374851 8
And one more thing I tired. Thats the distance from (0,0). See below if that is what you want.
df['Dist']=df.apply(lambda row: np.linalg.norm(np.zeros(2)-np.array([row.x_coordinate,row.y_coordinate])),axis = 1)
print(df.sort_values(by=['Dist'],ascending =[True]))
id x_coordinate y_coordinate Dist
1 4138 0.262979 -0.769170 0.812884
3 4369 -0.585544 1.610388 1.713538
5 9971 -1.556373 -1.624628 2.249825
0 545 0.676576 3.079094 3.152551
7 4522 -1.923251 -2.695298 3.311122
8 7299 -2.697775 2.038365 3.381260
4 2173 -1.239105 3.168139 3.401836
2 5281 -0.301234 -3.568590 3.581281
6 2622 -1.747544 3.145381 3.598240
9 5425 -4.443474 0.428256 4.464064

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

How to create Traingular moving average in python using for loop

I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()
TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115

Python Bokeh Getting Empty Heatmap

i'm attempting to make a HeatMap just like this one using Bokeh.
Here is my dataframe Data from which i'm trying to make the HeatMap
Day Code Total
0 1 6001 44
1 1 6002 40
2 1 6006 8
3 1 6008 2
4 1 6010 38
5 1 6011 1
6 1 6014 19
7 1 6018 1
8 1 6019 1
9 1 6023 10
10 1 6028 4
11 2 6001 17
12 2 6010 2
13 2 6014 4
14 2 6020 1
15 2 6028 2
16 3 6001 48
17 3 6002 24
18 3 6003 1
19 3 6005 1
20 3 6006 2
21 3 6008 18
22 3 6010 75
23 3 6011 1
24 3 6014 72
25 3 6023 34
26 3 6028 1
27 3 6038 3
28 4 6001 19
29 4 6002 105
30 5 6001 52
...
And here is my code:
from bokeh.io import output_file
from bokeh.io import show
from bokeh.models import (
ColumnDataSource,
HoverTool,
LinearColorMapper
)
from bokeh.plotting import figure
output_file('SHM_Test.html', title='SHM', mode='inline')
source = ColumnDataSource(Data)
TOOLS = "hover,save"
# Creating the Figure
SHM = figure(title="HeatMap",
x_range=[str(i) for i in range(1,32)],
y_range=[str(i) for i in range(6043,6000,-1)],
x_axis_location="above", plot_width=400, plot_height=970,
tools=TOOLS, toolbar_location='right')
# Figure Styling
SHM.grid.grid_line_color = None
SHM.axis.axis_line_color = None
SHM.axis.major_tick_line_color = None
SHM.axis.major_label_text_font_size = "5pt"
SHM.axis.major_label_standoff = 0
SHM.toolbar.logo = None
SHM.title.text_alpha = 0.3
# Color Mapping
CM = LinearColorMapper(palette='RdPu9', low=Data.Total.min(), high=Data.Total.max())
SHM.rect(x='Day', y="Code", width=1, height=1,source=source,
fill_color={'field': 'Total','transform': CM})
show(SHM)
When i excecute my code i don't get any errors but i just get an empty Frame, as shown in the image below.
I've been struggling trying to find where is my mistake, ¿Why i'm getting this? ¿Where is my error?
The problem with your code is the data type that you are setting for the x and y axis range and the data type of your ColumnDataSource are different. You are setting the x_range and y_range to be a list of strings, but from looking at your data in csv format it will be treated as integers.
In your case, you would want to make sure that your Day and Code column are in
string format.
This can be easily done using the pandas package with
Data['Day'] = Data['Day'].astype('str')
Data['Code'] = Date['Code'].astype('str')

Get sliced dataframe by giving two values

Given two values, how can I get all the value between those two values. Thank you
for example:
dataframe:
Quarter GDP
0 1947q1 243.1
1 1947q2 246.3
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
8 1949q1 275.4
9 1949q2 271.7
10 1949q3 273.3
11 1949q4 271.0
12 1950q1 281.2
13 1950q2 290.7
14 1950q3 308.5
15 1950q4 320.3
16 1951q1 336.4
given 1947q3 and 1948q4, I need to get all the data between(inclusive) those two values
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
This will give you the desired result
df[(df['Quarter'] >= '1947q3') & (df['Quarter'] <= '1948q4')]
Quarter GDP
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
You can also use .between
df[df['Quarter'].between('1947q3', '1948q4', inclusive=True)]
put the data in dataframe.txt
f = open('dataframe.txt', 'r')
f_r = f.readline()
data = []
while f_r:
infos = f_r.split(' ')
infos = [info.strip() for info in infos if info]
if len(infos) == 3:
data.append((infos[1], infos[2]))
f_r = f.readline()
def get_rangedata_by_quarter(quarter_s, quarter_b):
""" quarter_s is the small one
quarter_b is the big one
"""
for info in data:
quarter = info[0]
if quarter >= quarter_s and quarter <= quarter_b:
print quarter, info[1]
get_rangedata_by_quarter('1947q3', '1948q4')

Categories