Subset a data frame based on index value [duplicate] - python

This question already has answers here:
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
(3 answers)
Select rows between two DatetimeIndex dates
(2 answers)
Closed 4 years ago.
I've got a data frame of weekly stock price returns that are indexed by date, as follows.
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988
2005-02-04 0.016406 0.027037
2005-02-11 0.015315 0.001887
I would like to return a data frame of rows where the index is in some interval, let's say all dates in January 2005. I'm aware that I could do this by turning the index into a "Date" column, but I was wondering if there's any way to do this directly.

Yup, there is, even simpler than doing a column!
Using .loc function, then just slice the dates out, like:
print(df.loc['2005-01-01':'2005-01-31'])
Output:
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988
Btw, if index are objects, do:
df.index = pd.to_datetime(df.index)
before everything.
As #Peter mentioned The best is:
print(df.loc['2005-01'])
Also outputs:
FTSE_350 SP_500
2005-01-14 -0.004498 -0.001408
2005-01-21 0.001287 -0.014056
2005-01-28 0.011469 0.002988

Related

How to put a condition while using a GroupBy in Pandas? [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I have used the following code to make a distplot.
data_agg = data.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
sns.pointplot(data.HourOfDay.values, data.travel_time.values)
plt.show()
However I want to choose hours above 8 only and not 0-7. How do I proceed with that?
What about filtering first?
data_filtered = data[data['HourOfDay'] > 7]
# depending of the type of the column of date
data_agg = data_filtered.groupby('HourOfDay')['travel_time'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,3))
Sns.pointplot(data_filtered.HourOfDay.values, data_filtered.travel_time.values)
plt.show()

Columns in Pandas Dataframe [duplicate]

This question already has answers here:
Binning a column with pandas
(4 answers)
Closed 3 years ago.
I have a dataframe of cars. I have its car price column and I want to create a new column carsrange that would have values like 'high','low' etc according to car price. Like for example :
if price is between 0 and 9000 then cars range should have 'low' for those cars. similarly, if price is between 9000 and 30,000 carsrange should have 'medium' for those cars etc. I tried doing it, but my code is replacing one value to the other. Any help please?
I ran a for loop in the price column, and use the if-else iterations to define my column values.
for i in cars_data['price']:
if (i>0 and i<9000): cars_data['carsrange']='Low'
elif (i<9000 and i<18000): cars_data['carsrange']='Medium-Low'
elif (i<18000 and i>27000): cars_data['carsrange']='Medium'
elif(i>27000 and i<36000): cars_data['carsrange']='High-Medium'
else : cars_data['carsrange']='High'
Now, When I run the unique function for carsrange, it shows only 'High'.
cars_data['carsrange'].unique()
This is the Output:
In[74]:cars_data['carsrange'].unique()
Out[74]: array(['High'], dtype=object)
I believe I have applied the wrong concept here. Any ideas as to what I should do now?
you can use list:
resultList = []
for i in cars_data['price']:
if (i>0 and i<9000):
resultList.append("Low")
else:
resultList.append("HIGH")
# write other conditions here
cars_data["carsrange"] = resultList
then find uinque values from cars_data["carsrange"]

Using the format function to name columns [duplicate]

This question already has an answer here:
Renaming columns when using resample
(1 answer)
Closed 5 years ago.
The line of code below takes columns that represent each months total sales and averages the sales by quarter.
mdf = tdf[sel_cols].resample('3M',axis=1).mean()
What I need to do is title the columns with a str (cannot use pandas .Period function).
I attempting to use the following code, but I cannot get it to work.
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year, [1, 2, 3, 4][x.quarter==1]))
I want the columns to read... 2000q1, 2000q2, 2000q3, 2000q4, 2001q1,... etc, but keep getting wrong things like 2000q1, 2000q1, 2000q1, 2000q2, 2001q1.
How can I use the .format function to make this work properly.
The easiest way is to perform the quarter function on the datetime list like so
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year,x.quarter))

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

Year range to date time format

Currently I have a series of string as a column in pandas dataframe which represents a particular year in a "yyyy-yyyy" format for example "2004-2005" is a single string value in this column.
I wanted to know if there is anyway to convert this from string to something similar to datetime format.
The purpose for this is to calculate the difference between the values of this column and other similar column in "Years". For example something similar to below:
col 1 col2 Answer(Total years)
2004-2005 2006-2007 3
Note: One of the ways I thought of doing was to make a dictionary mapping each year to a unique integer value and then calculate the difference between them.
Although I was wondering if there is any simpler way of doing it.
It looks like you subtracting the last year in column 2 with the first year in column 1. In which case I'd use str.extract (and convert the result to a number):
In [11]: pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[11]:
0 2004
Name: col 1, dtype: int64
In [12]: pd.to_numeric(df['col2'].str.extract('-(\d{4})')) - pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[12]:
0 3
dtype: int64
What do you mean by "something similar to a datetime object." Datetimes aren't designed to represent date ranges.
If you want to create a pair of datetime objects you could do something like this:
[datetime.datetime.strptime(x, '%Y') for x in '2005-2006'.split('-')]
Alternatively you could try using a Pandas date_range object if that's closer to what you want.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.date_range.html
If you are trying to find the difference between the lowest year and the highest year, here is a go at it
col1="2004-2005"
col2="2006-2007"
col1=col1.split("-") # make a list of the years in col1 ['2004', '2005']
col2=col2.split("-") # make a list of the years in col2 ['2006', '2007']
biglist=col1+col2 #add the two list
biglist.sort() #sort the list from lowest year to highest year
Answer=int(biglist[len(biglist)-1])-int(biglist[0]) #find the difference between lowest and highest year

Categories