I have one column DOB(Year) in df dataframe, which consist values like below:
DOB(Year)
1990.0
1998.0
2015.0
2017.0
I want to remove .0 from all values.
I have tried
df[DOB(Year)]=df[DOB(Year)].astype(str)
df[DOB(Year)]=df[DOB(Year)].str.replace(".0$", "",regex=True)
But resulting column values are nan.
Can anyone please suggest solution for this?
If you want a safe method that works on numeric/string input:
df['DOB(Year)'] = (pd.to_numeric(df['DOB(Year)'], errors='coerce')
.round().convert_dtypes()
)
Example (as new column):
DOB(Year) DOB(Year)_converted
0 1990.0 1990
1 1998.0 1998
2 2015.0 2015
3 2017.0 2017
4 2011.0001 2011
5 abc <NA>
Try this:
df[DOB(Year)]=df[DOB(Year)].astype('int')
Related
I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
Let's say I have the following df:
year date_until
1 2010 -
2 2011 30.06.13
3 2011 NaN
4 2015 30.06.18
5 2020 -
I'd like to fill all - and NaNs in the date_until column with 30/06/{year +1}. I tried the following but it uses the whole year column instead of the corresponding value of the specific row:
df['date_until] = df['date_until].str.replace('-', f'30/06/{df["year"]+1}')
my final goal is to calculate the difference between the year and the year of date_until, so maybe the step above is even unnecessary.
We can use pd.to_datetime here with errors='coerce' to ignore the faulty dates. Then use the dt.year to calculate the difference:
df['date_until'] = pd.to_datetime(df['date_until'], format='%d.%m.%y', errors='coerce')
df['diff_year'] = df['date_until'].dt.year - df['year']
year date_until diff_year
0 2010 NaT NaN
1 2011 2013-06-30 2.0
2 2011 NaT NaN
3 2015 2018-06-30 3.0
4 2020 NaT NaN
For everybody who is trying to replace values just like I wanted to in the first place, here is how you could solve it:
for i in range(len(df)):
if pd.isna(df['date_until'].iloc[i]):
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}'
if df['date_until'].iloc[i] == '-':
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}
But #Erfan's approach is much cleaner
I have a dataset like this where data for some years are missing .
County Year Pop
12 1999 1.1
12 2001 1.2
13 1999 1.0
13 2000 1.1
I want something like
County Year Pop
12 1999 1.1
12 2000 NaN
12 2001 1.2
13 1999 1.0
13 2000 1.1
13 2001 nan
I have tried setting index to year and then using reindex with another dataframe of just years method (mentioned here Pandas: Add data for missing months) but it gives me error cant reindex with duplicate values. I have also tried df.loc but it has same issue. I even tried a full outer join with blank df of just years but that also didnt work.
How can I solve this?
Make a MultiIndex so you don't have duplicates:
df.set_index(['County', 'Year'], inplace=True)
Then construct a full MultiIndex with all the combinations:
index = pd.MultiIndex.from_product(df.index.levels)
Then reindex:
df.reindex(index)
The construction of the MultiIndex is untested and may need a little tweaking (e.g. if a year is entirely absent from all counties), but I think you get the idea.
I'm working under the assumption that you may want to add all years between the minimum and maximum years. It may be the case that you were missing 2000 for both Counties 12 and 13.
I'll construct a pd.MultiIndex from_product using unique values from the 'County' column and all integer years between and including the min and max years in the 'Year' column.
Note: this solution fills in all missing years even if they aren't currently present.
mux = pd.MultiIndex.from_product([
df.County.unique(),
range(df.Year.min(), df.Year.max() + 1)
], names=['County', 'Year'])
df.set_index(['County', 'Year']).reindex(mux).reset_index()
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
You can use pivot_table:
In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year 1999 2000 2001
County
12 1.1 NaN 1.2
13 1.0 1.1 NaN
and stack the result (a Series is required):
In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County Year
12 1999 1.1
2000 NaN
2001 1.2
13 1999 1.0
2000 1.1
2001 NaN
dtype: float64
Or you can try some black magic :P
min_year, max_year = df.Year.min(), df.Year.max()
df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()
You mentioned you've tried to join to a blank df and this approach can actually work.
Setup:
df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})
Solution
#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])
#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]:
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
Here is a function inspired by the accepted answer but for a case where the time-variable starts and stops at different places for different group ids. The only difference from the accepted answer is that I manually construct the multi-index.
def fill_gaps_in_panel(df, group_col, year_col):
"""
Fills the gaps in a panel by constructing an index
based on the group col and the sequence of years between min-year
and max-year for each group id.
"""
index_group = []
index_time = []
for group in df[group_col].unique():
_min = df.loc[df[group_col]==group, year_col].min()
_max = df.loc[df[group_col]==group, year_col].max() + 1
index_group.extend([group for t in range(_min, _max)])
index_time.extend([t for t in range(_min, _max)])
multi_index = pd.MultiIndex.from_arrays(
[index_group, index_time], names=(group_col, year_col))
df.set_index([group_col, year_col], inplace=True)
return df.reindex(multi_index)
I have a dataframe like this:
Basic Stats Min Max Mean Stdev
1 LT50300282010256PAC01 0.336438 0.743478 0.592622 0.052544
2 LT50300282009269PAC01 0.313259 0.678561 0.525667 0.048047
3 LT50300282008253PAC01 0.374522 0.746828 0.583513 0.055989
4 LT50300282007237PAC01 -0.000000 0.749325 0.330068 0.314351
5 LT50300282006205PAC01 -0.000000 0.819288 0.600136 0.170060
and for the column Basic Stats I want to retain only the characters between [9:12] so for row 1 I only want to retain 2010 and for row 2 I only want to retain 2009. Is there a way to do this?
Just use vectorised str method to slice your strings:
In [23]:
df['Basic Stats'].str[9:13]
Out[23]:
0 2010
1 2009
2 2008
3 2007
4 2006
Name: Basic Stats, dtype: object
One way would be to use
df['Basic Stats'] = df['Basic Stats'].map(lambda x: x[9:13])
You can slice
df["Basic Stats"] = df["Basic Stats"].str.slice(9,13)
Output:
Basic Stats Min Max Mean Stdev
0 2010 0.336438 0.743478 0.592622 0.052544
1 2009 0.313259 0.678561 0.525667 0.048047
2 2008 0.374522 0.746828 0.583513 0.055989
3 2007 -0.000000 0.749325 0.330068 0.314351
4 2006 -0.000000 0.819288 0.600136 0.170060
You can do:
df["Basic Stats"] = [ x[9:13] for x in df["Basic Stats"] ]