I created a dataframe as :
df1 = pandas.read_csv(ifile_name, header=None, sep=r"\s+", usecols=[0,1,2,3,4],
index_col=[0,1,2], names=["year", "month", "day", "something1", "something2"])
now I would like to create another dataframe where year>2008. Hence I tried :
df2 = df1[df1.year>2008]
But getting error :
AttributeError: 'DataFrame' object has no attribute 'year'
I guess, it is not seeing the "year" among the columns because I defined it within index. But how can I get data based on year>2008 in that case?
Get the level by name using MultiIndex.get_level_values and create a boolean mask for row selection:
df2 = df1[df1.index.get_level_values('year') > 2008]
If you plan to make modifications, create a copy of df1 so as to not operate on the view.
df2 = df1[df1.index.get_level_values('year') > 2008].copy()
You are correct that year is an index rather than a column. One solution is to use pd.DataFrame.query, which lets you use index names directly:
df = pd.DataFrame({'year': [2005, 2010, 2015], 'value': [1, 2, 3]})
df = df.set_index('year')
res = df.query('year > 2008')
print(res)
value
year
2010 2
2015 3
Assuming your index is sorted
df.loc[2008:]
Out[259]:
value
year
2010 2
2015 3
Related
I have a dataframe (DF1) as such - each Personal-ID will have 3 dates associated w/that ID:
I have created a dataframe (DF_ID) w/1 row for each Personal-ID & Column for Each Respective Date (which is currently blank) and would like load/loop the 3 dates/Personal-ID (DF1) into the respective date columns the final dataframe to look as such:
I am trying to learn python and have tried a number of codinging script to accomplish such as:
{for index, row in df_bnp_5.iterrows():
df_id['Date-1'] = (row.loc[0,'hv_lab_test_dt'])
df_id['Date-2'] = (row.loc[1,'hv_lab_test_dt'])
df_id['Date-3'] = (row.loc[2,'hv_lab_test_dt'])
for i in range(len(df_bnp_5)) :
df_id['Date-1'] = df1.iloc[i, 0], df_id['Date-2'] = df1.iloc[i, 2])}
Any assistance would be appreciated.
Thank You!
Here is one way. I created a 'helper' column to arrange the dates for each Personal-ID.
import pandas as pd
# create data frame
df = pd.DataFrame({'Personal-ID': [1, 1, 1, 5, 5, 5],
'Date': ['10/01/2019', '12/28/2019', '05/08/2020',
'01/19/2020', '06/05/2020', '07/19/2020']})
# change data type
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# create grouping key
df['x'] = df.groupby('Personal-ID')['Date'].rank().astype(int)
# convert to wide table
df = df.pivot(index='Personal-ID', columns='x', values='Date')
# change column names
df = df.rename(columns={1: 'Date-1', 2: 'Date-2', 3: 'Date-3'})
print(df)
x Date-1 Date-2 Date-3
Personal-ID
1 2019-10-01 2019-12-28 2020-05-08
5 2020-01-19 2020-06-05 2020-07-19
I have two dataframes:
df1 with columns 'state', 'date', 'number'
df2 with columns 'state', 'specificDate' (one specificDate for one state, each state is mentioned just once)
In the end, I want to have a dataset with columns 'state', 'specificDate', 'number'. Also, I would like to add 14 days to each specific date and get numbers for those dates too.
I tried this
df = df1.merge(df2, left_on='state', right_on='state')
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate), df.numbers)
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate+datetime.timedelta(days=14)), df.numbers)
but I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
when I add all() it still gives me the same error
I feel that my logic is not correct. How else can I insert those values into my dataset?
I think you want to use df2 as the left side of the join. You can use pd.DateOffset to add 14 days.
# create dataset with specific date and specific date + 14
df2_14 = df2.set_index('state')['date'].apply(pd.DateOffset(14)).reset_index()
df = pd.concat([df2, df2_14])
# now join the values from df1
df = df.join(df1.set_index(['state', 'date']),
how='left',
on=['state', 'specificDate'])
You can declare an empty DataFrame and insert filtered data in it.
To filter data you may iterate through all rows of df2 and set a mask between the dates of specificDate column and specificDate+14 with same state name.
I have create two DataFrames df1 and df2 with several values from your DataFrames and tested the above procedure.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20"],
"number":[0,5,7]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
begin_date = row["specificDate"]
end_date = begin_date+datetime.timedelta(days=14)
mask = (df1['date'] >= begin_date) & (df1['date'] <= end_date) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-14 7
Updated Answer:
To show the data only for specific date and specific date+14th date from df1 we should update the mask of the above code snippet.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20", "3/27/20", "3/28/20"],
"number":[0,5,7,9,3]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
first_date = row["specificDate"]
last_date = first_date+datetime.timedelta(days=14)
mask = ((df1['date'] == first_date) | (df1['date'] == last_date)) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-27 9
Just a slight tweek on the first line in Eric's answer to make it a little simpler, as I was confused why he used set_index and reset_index.
df2_14['date'] = df2['date'].apply(pd.DateOffset(14))
import pandas as pd
df = pd.DataFrame({
'col1':[99,99,99],
'col2':[4,5,6],
'col3':[7,None,9]
})
col_list = ['col1','col2']
df[col_list].replace(99,0,inplace=True)
This generates a Warning and leaves the dataframe unchanged.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I want to be able to apply the replace method on a subset of the columns specified by the user. I also want to use inplace = True to avoid making a copy of the dataframe, since it is huge. Any ideas on how this can be accomplished would be appreciated.
When you select the columns for replacement with df[col_list], a slice (a copy) of your dataframe is created. The copy is updated, but never written back into the original dataframe.
You should either replace one column at a time or use nested dictionary mapping:
df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}},
inplace=True)
The nested dictionary for to_replace can be generated automatically:
d = {col : {99:0} for col in col_list}
You can use replace with loc. Here is a slightly modified version of your sample df:
d = {'col1':[99,99,9],'col2':[99,5,6],'col3':[7,None,99]}
df = pd.DataFrame(data=d)
col_list = ['col1','col2']
df.loc[:, col_list] = df.loc[:, col_list].replace(99,0)
You get
col1 col2 col3
0 0 0 7.0
1 0 5 NaN
2 9 6 99.0
Here is a nice explanation for similar issue.
I have tried resetting the index and then selecting that column then setting the index again like so:
df.reset_index(inplace=True,drop=False)
country_names = df['Country'] #the Series I want to select
df.set_index('Country',drop=True,inplace=True)
But it seems like there should be a better way to do this.
To get the index of a dataframe as a pd.series object you can use the to_series method, for example:
df = pd.DataFrame([1, 3], index=['a', 'b'])
df.index.to_series()
a a
b b
dtype: object
I have dict that I would like to turn into a DataFrame with MultiIndex.
The dict is:
dikt = {'bloomberg': Timestamp('2009-01-26 10:00:00'),
'investingcom': Timestamp('2009-01-01 09:00:00')}
I construct a MultiIndex such as follow:
MI= MultiIndex(levels=[['Existing Home Sales MoM'], ['investingcom', 'bloomberg']],
labels=[[0, 0], [0, 1]],
names=['indicator', 'source'])
Then a DataFrame as such:
df = pd.DataFrame(index = MI, columns=["datetime"],data =np.full((2,1),np.NaN))
Then lastly I fill the df with data stored in a dict such :
for key in ['indicator', 'source']:
df.loc[('Existing Home Sales MoM',key), "datetime"] = dikt[key]
and get the expected result:
But would there be a more concise way of doing so by passing the dikt directly into the construction of the df such as
df = pd.DataFrame(index = MI, columns=["datetime"],data =dikt)
so as to combine the 2 last steps in 1?
You can create a datframe from a dictionary using from_dict:
pd.DataFrame.from_dict(dikt, orient='index')
0
bloomberg 2009-01-26 10:00:00
investingcom 2009-01-01 09:00:00
You can chain the column and index definitions to get the result you're after in 1 step:
pd.DataFrame.from_dict(dikt, orient='index') \
.rename(columns={0: 'datetime'}) \
.set_index(MI)