I have a large dataframe, ~ 1 million rows and 9 columns with some rows missing data in a few of the columns.
dat = pd.read_table( 'file path', delimiter = ';')
I z Sp S B B/T r gf k
0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 0.0779 2 0.00 0.00 0.05 0.01 0.00
The first few columns are being read in as a string, and the last few as NaN, even when there is a numeric value there. When I include dtype = 'float64' I get:
ValueError: could not convert string to float:
Any help in fixing this?
You can use replace by regex - one or more whitespaces to NaN, then cast to float
Empty strings in data are converted to NaN in read_table.
df = df.replace({'\s+':np.nan}, regex=True).astype(float)
print (df)
I z Sp S B B/T r gf k
0 0.0 0.0303 2.0 0.606 0.31 0.04 0.23 0.03 0.38
1 1.0 0.0779 2.0 NaN 0.00 0.00 0.05 0.01 0.00
If data contains some strings which need be replaced to NaN is possible use to_numeric with apply:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
print (df)
I z Sp S B B/T r gf k
0 0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 1 0.0779 2 NaN 0.00 0.00 0.05 0.01 0.00
Related
How do I replace only the integer values in the ID column with a sequence of consecutive numbers? I'd like any non-integer or NaN cells skipped.
Current df:
ID AMOUNT
1 0.00
test 5.00
test test 0.00
test 0.00
1 0.00
xx 304.95
x xx 304.95
1 0.00
1 0.00
xxxxx 0.00
1 0.00
xxx 0.00
xx xx 0.00
1 0.00
Desired Outcome:
ID AMOUNT
1 0.00
test 5.00
test test 0.00
test 0.00
2 0.00
xx 304.95
x xx 304.95
3 0.00
4 0.00
xxxxx 0.00
5 0.00
xxx 0.00
xx xx 0.00
6 0.00
I tried making a new column using np.arange(len(df)) and then replacing the ID values with that, but it's not giving me the expected outcome.
Thank you!
You can use:
df['ID'] = (pd
.to_numeric(df['ID'], errors='coerce') # convert to numeric
.cumsum() # increment numbers
.convert_dtypes().astype(object) # back to integer
.fillna(df['ID']) # restore non-numeric
)
Alternative using slicing and updating:
s = pd.to_numeric(df['ID'], errors='coerce')
df['ID'].update(s[s.notna()].cumsum().astype(int).astype(object))
output:
ID AMOUNT
0 1 0.00
1 test 5.00
2 test test 0.00
3 test 0.00
4 2 0.00
5 xx 304.95
6 x xx 304.95
7 3 0.00
8 4 0.00
9 xxxxx 0.00
10 5 0.00
11 xxx 0.00
12 xx xx 0.00
13 6 0.00
Solution 1
Identify numeric values with regex then create a range counter and use boolean indexing to update the values
m = df['ID'].str.match('\d+', na=False)
df.loc[m , 'ID'] = range(1, m.sum() + 1)
Solution 2
Identify numeric values with pandas builtin function then create a range counter and use boolean indexing to update the values
m = pd.to_numeric(df['ID'], errors='coerce').notna()
df.loc[m , 'ID'] = range(1, m.sum() + 1)
Result
ID AMOUNT
0 1 0.00
1 test 5.00
2 test test 0.00
3 test 0.00
4 2 0.00
5 xx 304.95
6 x xx 304.95
7 3 0.00
8 4 0.00
9 xxxxx 0.00
10 5 0.00
11 xxx 0.00
12 xx xx 0.00
13 6 0.00
If you can iterate over the ID-column this can be done easily via pythons isinstance(object, class) function.
count = 0
for index, value in enumerate(df['ID']): # Iterate over the column
if isinstance(value, int): # Check if this is an integer
df['ID'][index] = count # Replace integer
count += 1
pass
pass
I try to calculate the relative weights of df1 in each row with a max value of 0.5. So far, I was able to calculate the relative weights in df2 but without an upper boundary. Here would be a simple example:
import pandas as pd
df1 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'ID1':[0,0,2,1,1],
'ID2':[1,3,1,1,2],
'ID3':[1,0,0,1,0],
'ID4':[1,1,7,1,0],
'ID5':[0,6,0,0,1]})
df1:
Dates ID1 ID2 ID3 ID4 ID5
0 2021-01-01 0 1 1 1 0
1 2021-01-02 0 3 0 1 6
2 2021-01-03 2 1 0 7 0
3 2021-01-04 1 1 1 1 0
4 2021-01-05 1 2 0 0 1
df1 = df1.set_index('Dates').T
df2 = df1.transform(lambda x: x/sum(x)).T
df2.round(2)
df2:
Dates ID1 ID2 ID3 ID4 ID5
2021-01-01 0.00 0.33 0.33 0.33 0.00
2021-01-02 0.00 0.30 0.00 0.10 0.60
2021-01-03 0.20 0.10 0.00 0.70 0.00
2021-01-04 0.25 0.25 0.25 0.25 0.00
2021-01-05 0.25 0.50 0.00 0.00 0.25
I try to get df3 with a relative weight maximum of 0.5.
df3:
Dates ID1 ID2 ID3 ID4 ID5
2021-01-01 0.00 0.33 0.33 0.33 0.00
2021-01-02 0.00 0.30 0.00 0.10 0.50
2021-01-03 0.20 0.10 0.00 0.50 0.00
2021-01-04 0.25 0.25 0.25 0.25 0.00
2021-01-05 0.25 0.50 0.00 0.00 0.25
When I use the following adjusted function, I get the error: Transform function failed
df1.transform(lambda x: x/sum(x) if x/sum(x) < 0.5 else 0.5).T
Thanks a lot!
Instead of transposing and applying transformations on each element, we can manipulate rows directly.
df3 = df1.copy().set_index('Dates')
df3 = df3.div(df3.sum(axis=1), axis=0).clip(upper=0.5).round(2).reset_index()
Output:
Dates ID1 ID2 ID3 ID4 ID5
0 2021-01-01 0.00 0.33 0.33 0.33 0.00
1 2021-01-02 0.00 0.30 0.00 0.10 0.50
2 2021-01-03 0.20 0.10 0.00 0.50 0.00
3 2021-01-04 0.25 0.25 0.25 0.25 0.00
4 2021-01-05 0.25 0.50 0.00 0.00 0.25
Would this work for you?
You can use apply(...,axis=1) and clip the values with a max of 0.5 (this assumes Date is always the first columns - alternatively, we could set it as an index):
df1[df1.columns[1:]] = df1[df1.columns[1:]].apply(lambda x:x/sum(x), axis=1).clip(upper=0.5)
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: x/sum(df1[col]) if x/sum(df1[col]) < 0.5 else 0.5)
Have fun!
I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
I have calculated a series of totals tips by day of a week and appended it to the bottom of totalspt dataframe.
I have set the index.name for the totalspt dataframe to None.
However while the dataframe is displaying the default 0,1,2,3 index it doesn't display the default empty cell in the top left directly above the index.
How could I make this cell empty in the dataframe?
total_bill tip sex smoker day time size tip_pct
0 16.54 1.01 F N Sun D 2 0.061884
1 12.54 1.40 F N Mon D 2 0.111643
2 10.34 3.50 M Y Tue L 4 0.338491
3 20.25 2.50 M Y Wed D 2 0.123457
4 16.54 1.01 M Y Thu D 1 0.061064
5 12.54 1.40 F N Fri L 2 0.111643
6 10.34 3.50 F Y Sat D 3 0.338491
7 23.25 2.10 M Y Sun B 3 0.090323
pivot = tips.pivot_table('total_bill', index=['sex', 'size'],columns=['day'],aggfunc='sum').fillna(0)
print pivot
day Fri Mon Sat Sun Thu Tue Wed
sex size
F 2 12.54 12.54 0.00 16.54 0.00 0.00 0.00
3 0.00 0.00 10.34 0.00 0.00 0.00 0.00
M 1 0.00 0.00 0.00 0.00 16.54 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 20.25
3 0.00 0.00 0.00 23.25 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 10.34 0.00
totals_row = tips.pivot_table('total_bill',columns=['day'],aggfunc='sum').fillna(0).astype('float')
totalpt = pivot.reset_index('sex').reset_index('size')
totalpt.index.name = None
totalpt = totalpt[['Fri', 'Mon','Sat', 'Sun', 'Thu', 'Tue', 'Wed']]
totalpt = totalpt.append(totals_row)
print totalpt
**day** Fri Mon Sat Sun Thu Tue Wed #problem text day
0 12.54 12.54 0.00 16.54 0.00 0.00 0.00
1 0.00 0.00 10.34 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 16.54 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 20.25
4 0.00 0.00 0.00 23.25 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 10.34 0.00
total_bill 12.54 12.54 10.34 39.79 16.54 10.34 20.25
That's the columns' name.
In [11]: df = pd.DataFrame([[1, 2]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
In [13]: df.columns.name = 'XX'
In [14]: df
Out[14]:
XX A B
0 1 2
You can set it to None to clear it.
In [15]: df.columns.name = None
In [16]: df
Out[16]:
A B
0 1 2
An alternative, if you wanted to keep it, is to give the index a name:
In [21]: df.columns.name = "XX"
In [22]: df.index.name = "index"
In [23]: df
Out[23]:
XX A B
index
0 1 2
You can use rename_axis. Since 0.17.0
In [3939]: df
Out[3939]:
XX A B
0 1 2
In [3940]: df.rename_axis(None, axis=1)
Out[3940]:
A B
0 1 2
In [3942]: df = df.rename_axis(None, axis=1)
In [3943]: df
Out[3943]:
A B
0 1 2
I have demographic panel data, where each data point is categorized by a country, sex, year, and age. For a given country, sex, and year, my age-pattern has missing data, and I want to interpolate it based on the value of the age. For example, if 5 year-olds have a value of 5, and 10 year-olds have a value of 10, 6.3 year-olds should have a value of 6.3. I cannot use the default pandas 'linear' interpolation method because my age groups are not spaced linearly. My data look something like this:
iso3s = ['USA', 'CAN']
age_start_in_years = [0, 0.01, 0.1, 1]
years = [1990, 1991]
sexes = [1,2]
multi_index = pd.MultiIndex.from_product([iso3s,sexes,years,age_start_in_years],
names = ['iso3','sex','year','age_start'])
frame_length = len(iso3s)*len(age_start_in_years)*len(years)*len(sexes)
test_df = pd.DataFrame({'value':range(frame_length)},index=multi_index)
test_df=test_df.sortlevel()
# Insert missingness to practice interpolating
idx = pd.IndexSlice
test_df.loc[idx[:,:,:,[0.01,0.1]],:] = np.NaN
test_df
value
iso3 sex year age_start
CAN 1 1990 0.00 0
0.01 NaN
0.10 NaN
1.00 3
1991 0.00 4
0.01 NaN
0.10 NaN
1.00 7
2 1990 0.00 8
...
However, when I try to use test_df.interpolate(method='index'), I get this error:
ValueError: Only `method=linear` interpolation is supported on MultiIndexes.
Surely there must be some way to interpolate based on the index values.
This might come a little late, but I ran into the same problem today. What I came up with is also just a workaround, but it uses pandas built-ins at least. My approach was to reset the index, then group by the first subset of index columns (i.e. all but age_start). These sub-DataFrames can then be interpolated with the method='index' parameter and put back together into a whole frame with pd.concat. The resulting DataFrame then gets its original index reassigned.
idx_names = test_df.index.names
test_df = test_df.reset_index()
concat_list = [grp.set_index('age_start').interpolate(method='index') for _, grp in test_df.groupby(['iso3', 'sex', 'year'])]
test_df = pd.concat(concat_list)
test_df = test_df.reset_index().set_index(idx_names)
test_df
value
iso3 sex year age_start
CAN 1 1990 0.00 16.00
0.01 16.03
0.10 16.30
1.00 19.00
1991 0.00 20.00
0.01 20.03
0.10 20.30
1.00 23.00
2 1990 0.00 24.00
EDIT
I got back to this problem today and found a bug in my originally proposed solution. When the multi-index is not ordered as it is in your example, the above code sorts your DataFrame by index values. To get around this, I joined the result back into a DataFrame with the original index so that index order is preserved. I've also put it inside a function.
def interp_multiindex(df, interp_idx_name):
"""
Provides index-based interpolation for pd.Multiindex which usually only support linear
interpolation. Interpolates full DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame with NaN values
interp_idx_name : str
The name of the multiindex level on which index-based interpolation should take place
Returns
-------
df : pd.DataFrame
The DataFrame with index-based interpolated values
"""
# Get all index level names in order
existing_multiidx = df.index
# Remove the name on which interpolation will take place
noninterp_idx_names = [idx_name for idx_name in existing_multiidx.names
if idx_name != interp_idx_name]
df = df.reset_index()
concat_list = [grp.set_index(interp_idx_name).interpolate(method='index')
for _, grp in df.groupby(noninterp_idx_names)]
df = pd.concat(concat_list)
df = df.reset_index().set_index(existing_multiidx.names)
df = pd.DataFrame(index=existing_multiidx).join(df)
return df
I found this hacky work-around that gets rid of the MultiIndex and uses a combination of groupby and transform:
def multiindex_interp(x, interp_col, step_col):
valid = ~pd.isnull(x[interp_col])
invalid = ~valid
x['last_valid_value'] = x[interp_col].ffill()
x['next_valid_value'] = x[interp_col].bfill()
# Generate a new Series filled with NaN's
x['last_valid_step'] = np.NaN
# Copy the step value where we have a valid value
x['last_valid_step'][valid] = x[step_col][valid]
x['last_valid_step'] = x['last_valid_step'].ffill()
x['next_valid_step'] = np.NaN
x['next_valid_step'][valid] = x[step_col][valid]
x['next_valid_step'] = x['next_valid_step'].bfill()
# Simple linear interpolation= distance from last step / (range between closest valid steps) *
# difference between closest values + last value
x[interp_col][invalid] = (x[step_col]-x['last_valid_step'])/(x['next_valid_step'] - x['last_valid_step']) \
* (x['next_valid_value']-x['last_valid_value']) \
+ x['last_valid_value']
return x
test_df = test_df.reset_index(drop=False)
grouped = test_df.groupby(['iso3','sex','year'])
interpolated = grouped.transform(multiindex_interp,'value','age_start')
test_df['value'] = interpolated['value']
test_df
iso3 sex year age_start value
0 CAN 1 1990 0.00 16.00
1 CAN 1 1990 0.01 16.03
2 CAN 1 1990 0.10 16.30
3 CAN 1 1990 1.00 19.00
4 CAN 1 1991 0.00 20.00
5 CAN 1 1991 0.01 20.03
6 CAN 1 1991 0.10 20.30
7 CAN 1 1991 1.00 23.00
8 CAN 2 1990 0.00 24.00
9 CAN 2 1990 0.01 24.03
10 CAN 2 1990 0.10 24.30
11 CAN 2 1990 1.00 27.00
...
You can try something like this:
test_df.groupby(level=[0,1,2])\
.apply(lambda g: g.reset_index(level=[0,1,2], drop=True)
.interpolate(method='index'))
Output:
value
iso3 sex year age_start
CAN 1 1990 0.00 16.00
0.01 16.03
0.10 16.30
1.00 19.00
1991 0.00 20.00
0.01 20.03
0.10 20.30
1.00 23.00
2 1990 0.00 24.00
0.01 24.03
0.10 24.30
1.00 27.00
1991 0.00 28.00
0.01 28.03
0.10 28.30
1.00 31.00
USA 1 1990 0.00 0.00
0.01 0.03
0.10 0.30
1.00 3.00
1991 0.00 4.00
0.01 4.03
0.10 4.30
1.00 7.00
2 1990 0.00 8.00
0.01 8.03
0.10 8.30
1.00 11.00
1991 0.00 12.00
0.01 12.03
0.10 12.30
1.00 15.00