I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0
Related
I have one column DOB(Year) in df dataframe, which consist values like below:
DOB(Year)
1990.0
1998.0
2015.0
2017.0
I want to remove .0 from all values.
I have tried
df[DOB(Year)]=df[DOB(Year)].astype(str)
df[DOB(Year)]=df[DOB(Year)].str.replace(".0$", "",regex=True)
But resulting column values are nan.
Can anyone please suggest solution for this?
If you want a safe method that works on numeric/string input:
df['DOB(Year)'] = (pd.to_numeric(df['DOB(Year)'], errors='coerce')
.round().convert_dtypes()
)
Example (as new column):
DOB(Year) DOB(Year)_converted
0 1990.0 1990
1 1998.0 1998
2 2015.0 2015
3 2017.0 2017
4 2011.0001 2011
5 abc <NA>
Try this:
df[DOB(Year)]=df[DOB(Year)].astype('int')
I have DF1 with several int columns and DF2 with 1 int column
DF1:
Year Industrial Consumer Discretionary Technology Utilities Energy Materials Communications Consumer Staples Health Care #No L1 US Agg Financials China Agg EU Agg
2001 5.884277 6.013842 6.216585 6.640594 6.701400 8.488806 7.175017 6.334284 6.082113 0.000000 5.439149 4.193736 4.686188 4.294788
2002 5.697814 6.277471 5.241045 6.608475 6.983511 8.089475 7.399775 5.882947 5.818563 7.250000 4.877012 3.635425 4.334125 3.944324
2003 5.144356 6.503754 6.270268 5.737079 6.466985 8.122228 7.040089 5.461827 5.385670 5.611753 4.163365 2.888026 3.955665 3.464020
2004 5.436486 6.463149 4.500574 5.329104 5.863406 7.562982 6.521106 5.990889 4.874258 6.554348 4.384878 3.502861 4.556418 3.412025
2005 5.003606 6.108812 5.732764 5.543677 6.131144 7.239053 7.228042 5.421092 5.561518 NaN 4.660754 3.970243 3.944251 3.106951
2006 4.505980 6.017253 4.923927 5.955308 5.799030 7.425253 6.942308
DF2:
Year Values
2002 4.514752
2003 3.994849
2004 4.254575
2005 4.277520
2006 4.784476
etc..
The indexes are the same for 2 DataFrames.
The goal is to create DF3 while subtracting DF2 from every single column from DF1. (DF2 - DF1 = DF3)
Anywhere where there is a nan, it should skip the math.
Assuming "Year" is the index for both (if not, you can make it the index using set_index), you can use sub on axis:
df3 = df1.sub(df2['Values'], axis=0)
Output:
Industrial Consumer Discretionary Technology Utilities Energy \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 1.183062 1.762719 0.726293 2.093723 2.468759 3.574723
2003 1.149507 2.508905 2.275419 1.742230 2.472136 4.127379
2004 1.181911 2.208574 0.245999 1.074529 1.608831 3.308407
2005 0.726086 1.831292 1.455244 1.266157 1.853624 2.961533
2006 -0.278496 1.232777 0.139451 1.170832 1.014554 2.640777
Materials Communications Consumer.1 Staples Health_Care US_Agg \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 2.885023 1.368195 1.303811 2.735248 0.362260 -0.879327
2003 3.045240 1.466978 1.390821 1.616904 0.168516 -1.106823
2004 2.266531 1.736314 0.619683 2.299773 0.130303 -0.751714
2005 2.950522 1.143572 1.283998 NaN 0.383234 -0.307277
2006 2.157832 NaN NaN NaN NaN NaN
Financials China_Agg
Year
2001 NaN NaN
2002 -0.180627 -0.570428
2003 -0.039184 -0.530829
2004 0.301843 -0.842550
2005 -0.333269 -1.170569
2006 NaN NaN
If you want to subtract df1 from df2 instead, you can use rsub instead of sub. It not clear which one you want since you explain that you want df1-df2 but your formula is the opposite.
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)