I have a task to take the first 6 digits of a column in pandas. However, if this number is less than 6 digits long it adds a decimal to the end of the number. Unfortunately, this is not acceptable for my needs later down the road.
I'm sure I can get rid of the decimal with various code, but It will probably be inefficient as DataFrames get larger.
Current code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,0],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
wow2 = df1
wow2['D'] = wow2['D'][:6]
print(wow2)
A B C D E
0 NaN 1.0 10 123456 Assign
1 NaN 0.0 0 123456 Unassign
2 3.0 3.0 30 123456 Assign
3 4.0 5.0 50 123456 Ugly
4 5.0 0.0 0 12345. Appreciate <--- Notice Decimal
5 5.0 0.0 0 12345. Undo <--- Notice Decimal
6 3.0 NaN 4 NaN Assign
7 1.0 9.0 10 NaN Unicycle
8 5.0 0.0 1 NaN Assign
9 NaN 0.0 0 NaN Unicorn
Is there a way I can leave the digit if it's length is not over 6? I thought about converting the column to string and doing a loop... But I believe that would be wildly inefficient and create more problems than solutions
To get the first 6 digits of a number (without converting to string and back), you may use the modulo operator.
In order to represent your numeric values as non floating point numbers you need to convert them into integers. However, mixing integers and np.NaN within the same column will result into float64 (see here for more). To get around this (which is kind of ugly) you need to convert the integers into strings which forces the dtype to be object because you mix strings and float values.
The solution looks like the following:
wow2['D'] = wow2['D'].mod(10**6)\
.dropna()\
.astype(int)\
.astype(str)
print(wow['D'])
0 123456
1 123456
2 234567
3 345678
4 12345
5 12345
6 345678
7 456789
8 234567
9 NaN
Name: D, dtype: object
Related
I have a dataframe with around 50 columns and around 3000 rows. Most cells are empty but not all of them. I am trying to add a new row at the end of the dataframe, with the mean value of each column and I need it to ignore the empty cells.
I am using df.mean(axis=0), which somehows turns all values of the dataframe into imaginary numbers. All values stay the same but a +0j is added. I have no Idea why.
Turbine.loc['Mean_Values'] = Turbine.mean(axis=0)
I couldnt find a solution for this, is it because of the empty cells?
Base on this, df.mean() will automatically skip the NaN/Null value with parameter value of skipna=True. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan]})
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output:
value
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 3.4
But if there is a complex number in a cell, the result of df.mean() will be cast to complex number. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan, complex(1,0)]})
print(df)
print('\n')
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output with a complex value in a cell:
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
8 (3+0j)
Hope this can help you :)
some cells had information about directions (north, west...) in them, which were interpreted as imaginary numbers.
I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".
df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0]))
I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you!
mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:
def fast_mode(df, key_cols, value_col):
"""
Calculate a column mode, by group, ignoring null values.
Parameters
----------
df : pandas.DataFrame
DataFrame over which to calcualate the mode.
key_cols : list of str
Columns to groupby for calculation of mode.
value_col : str
Column for which to calculate the mode.
Return
------
pandas.DataFrame
One row for the mode of value_col per key_cols group. If ties,
returns the one which is sorted first.
"""
return (df.groupby(key_cols + [value_col]).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=key_cols)).drop(columns='counts')
Sample data df:
CIK SIK
0 C 2.0
1 C 1.0
2 B NaN
3 B 3.0
4 A NaN
5 A 3.0
6 C NaN
7 B NaN
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN
Code:
df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)
Output df:
CIK SIK
0 C 2.0
1 C 1.0
2 B 3.0
3 B 3.0
4 A 2.0
5 A 3.0
6 C 1.0
7 B 3.0
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN
I have a car dataset where I want to replace the '?' values in the column normalized-values to the mean of the remaining numerical values. The code I have used is:
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace("?",mean)
However, this produces the error:
ValueError: could not convert string to float: '???164164?158?158?192192188188??121988111811811814814814814811014513713710110110111078106106858585107????145??104104104113113150150150150129115129115?115118?93939393?142???161161161161153153???125125125137128128128122103128128122103168106106128108108194194231161161??161161??16116116111911915415415474?186??????1501041501041501048383831021021021021028989858587877477819191919191919191168168168168134134134134134134656565656519719790?1221229494949494?256???1037410374103749595959595'
Can anyone help with the way in which I can convert the '?' values to the mean values. Also, this is the first time I am working with the Pandas package so if I have made any silly mistakes, please forgive me.
Use to_numeric for convert non numeric values to NaNs and then fillna with mean:
vals = pd.to_numeric(df["normalized-losses"], errors='coerce')
df["normalized-losses"] = vals.fillna(vals.mean())
#data from jpp
print (df)
normalized-losses
0 1.0
1 2.0
2 3.0
3 3.4
4 5.0
5 6.0
6 3.4
Details:
print (vals)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
Name: normalized-losses, dtype: float64
print (vals.mean())
3.4
Use replace() followed byfillna():
df['normalized-losses'] = df['normalized-losses'].replace('?',np.NaN)
df['normalized-losses'].fillna(df['normalized-losses'].mean())
The mean of a series of mixed types is not defined. Convert to numeric and then use replace:
df = pd.DataFrame({'A': [1, 2, 3, '?', 5, 6, '??']})
mean = pd.to_numeric(df['A'], errors='coerce').mean()
df['B'] = df['A'].replace('?', mean)
print(df)
A B
0 1 1
1 2 2
2 3 3
3 ? 3.4
4 5 5
5 6 6
6 ?? ??
If you need to replace all non-numeric values, then use fillna:
nums = pd.to_numeric(df['A'], errors='coerce')
df['B'] = nums.fillna(nums.mean())
print(df)
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 ? 3.4
4 5 5.0
5 6 6.0
6 ?? 3.4
Using python and panda: For a given data set how does one find the total number of missing attributes? I have found the number for each column, but I need to sum the columns using python to find the total. Below is the code I have currently used.
def num_missing(x):
return sum(x.isnull())
print("Missing Values per Column:")
print(data_file1.apply(num_missing))
Consider df -
df
A B C
0 1.0 4 NaN
1 2.0 5 1.0
2 NaN 6 6.0
3 NaN 7 3.0
Column-wise NaN count -
df.isnull().sum(0)
A 2
B 0
C 1
dtype: int64
Row-wise NaN count -
df.isnull().sum(1)
0 1
1 0
2 1
3 1
dtype: int64
df-wide NaN count -
df.isnull().values.sum()
3
Option 1: call .sum() twice, where the second call finds the sum of the intermediate Series.
df = pd.DataFrame(np.ones((5,5)))
df.iloc[2:4, 1:3] = np.nan
df.isnull().sum().sum()
# 4
Option 2: use underlying NumPy array.
np.isnan(df.values).sum()
# 4
Option 2 should be significantly faster (8.5 us vs. 249 us on this sample data).
As noted by #root and here, np.isnan() works only on numeric data, not object dtypes. pandas.DataFrame.isnull() doesn't have this problem.
I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.