I have a pandas Data Frame that I would like to fill in some NaN values of.
import pandas as pd
tuples = [('a', 1990),('a', 1994),('a',1996),('b',1992),('b',1997),('c',2001)]
index = pd.MultiIndex.from_tuples(tuples, names = ['Type', 'Year'])
vals = ['NaN','NaN','SomeName','NaN','SomeOtherName','SomeThirdName']
df = pd.DataFrame(vals, index=index)
print(df)
0
Type Year
a 1990 NaN
1994 NaN
1996 SomeName
b 1992 NaN
1997 SomeOtherName
c 2001 SomeThirdName
The output that I would like is:
Type Year
a 1990 SomeName
1994 SomeName
1996 SomeName
b 1992 SomeOtherName
1997 SomeOtherName
c 2001 SomeThirdName
This needs to be done on a much larger DataFrame (millions of rows) where each 'Type' can have between 1-5 unique 'Years' and the name value is only present for the most recent year. I'm trying to avoid iterating over rows for performance purposes.
You can sort your data frame by index in descending order and then ffill it:
import pandas as pd
df.sort_index(level = [0,1], ascending = False).ffill()
# 0
# Type Year
# c 2001 SomeThirdName
# b 1997 SomeOtherName
# 1992 SomeOtherName
# a 1996 SomeName
# 1994 SomeName
# 1990 SomeName
Note: The example data doesn't really contain np.nan values but string NaN, so in order for ffill to work you need to replace the NaN string as np.nan:
import numpy as np
df[0] = np.where(df[0] == "NaN", np.nan, df[0])
Or as #ayhan suggested, after replacing the String "NaN" with np.nan use df.bfill().
Related
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)
I am using a pandas dataframe and I would like to remove all information after a space occures. My dataframe is similar as this one:
import pandas as pd
d = {'Australia' : pd.Series([0,'1980 (F)\n\n1957 (T)\n\n',1991], index=['Australia', 'Belgium', 'France']),
'Belgium' : pd.Series([1980,0,1992], index=['Australia','Belgium', 'France']),
'France' : pd.Series([1991,1992,0], index=['Australia','Belgium', 'France'])}
df = pd.DataFrame(d, dtype='str')
df
I am able to remove the values for one specific column, however the split() function does not apply to the whole dataframe.
f = lambda x: x["Australia"].split(" ")[0]
df = df.apply(f, axis=1)
Anyone an idea how I could remove the information after a space occures for each value in the dataframe?
I think need convert all columns to strings and then apply split function:
df = df.astype(str).apply(lambda x: x.str.split().str[0])
Another solution:
df = df.astype(str).applymap(lambda x: x.split()[0])
print (df)
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0
Let's try using assign since the column names in this dataframe are "well tame" meaning not containing a space nor special characters:
df.assign(Australia=df.Australia.str.split().str[0])
Output:
Australia Belgium France
Australia 0 1980 1991
Belgium 1980 0 1992
France 1991 1992 0
Or you can use apply and a lamda function if all your column datatypes are strings:
df.apply(lambda x: x.str.split().str[0])
Or if you have a mixture of numbers and string dtypes then you can use select_dtypes with assign like this:
df.assign(**df.select_dtypes(exclude=np.number).apply(lambda x: x.str.split().str[0]))
You could loop over all columns and apply below:
for column in df:
df[column] = df[column].str.split().str[0]
I have data that contains prices, volumes and other data about various financial securities. My input data looks like the following:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot?
You need to label each year 0-4. To do this, use the cumcount after grouping. Then you can pivot correctly using that new column as the index.
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214
You can use groupby with apply new Series created with numpy array by values and then reshape by unstack:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445
I have a dataframe:
import pandas as pd
tuples = [('a', 1990),('a', 1994),('a',1996),('b',1992),('b',1997),('c',2001)]
index = pd.MultiIndex.from_tuples(tuples, names = ['Type', 'Year'])
vals = ['This','That','SomeName','This','SomeOtherName','SomeThirdName']
df = pd.DataFrame(vals, index=index, columns=['Whatev'])
df
Out[3]:
Whatev
Type Year
a 1990 This
1994 That
1996 SomeName
b 1992 This
1997 SomeOtherName
c 2001 SomeThirdName
And I'd like to add a column of ascending integers corresponding to 'Year' that resets for each 'Type', like so:
Whatev IndexInt
Type Year
a 1990 This 1
1994 That 2
1996 SomeName 3
b 1992 This 1
1997 SomeOtherName 2
c 2001 SomeThirdName 1
Here's my current method:
grouped = df.groupby(level=0)
unique_loc = []
for name, group in grouped:
unique_loc += range(1,len(group)+1)
joined['IndexInt'] = unique_loc
But this seems ugly and convoluted to me, and I imagine it could get slow on the ~50 million row dataframe that I'm working with. Is there a simpler way?
you can use groupby(level=0) + cumcount():
In [7]: df['IndexInt'] = df.groupby(level=0).cumcount()+1
In [8]: df
Out[8]:
Whatev IndexInt
Type Year
a 1990 This 1
1994 That 2
1996 SomeName 3
b 1992 This 1
1997 SomeOtherName 2
c 2001 SomeThirdName 1