I have a series of type object, When using Series.str.strip() the cells that contain only int are getting turned into Nan.
How do I avoid this?
example
sr = pd.Series([1,2,3,'foo '])
sr.str.strip()
0 NaN
1 NaN
2 NaN
3 foo
dtype: object
desired outcome
0 1
1 2
2 3
3 foo
dtype: object
The simpliest is replace missing values by original values by Series.fillna:
sr = pd.Series([1,2,3,'foo '])
sr.str.strip().fillna(sr)
Or striping only strings tested by isinstance:
print (sr.apply(lambda x: x.strip() if isinstance(x, str) else x))
0 1
1 2
2 3
3 foo
dtype: object
You can cast the series to str altogether and then strip:
>>> sr.astype(str).str.strip()
0 1
1 2
2 3
3 foo
dtype: object
this way 1 becomes "1" and remains unchanged against stripping. But they will remain strings at the end, not integers; not sure if that's the desired output.
Related
How can I access using length of strings(ex.len function) in Pandas Series
How to access 'eee'(3 char) exclude index accessing
test=pd.Series(['aaaa','bbbb','cccc','dddd','eee'])
==> 4 eee
What you want is unclear.
To get the length of each string use str.len:
test.str.len()
output:
0 4
1 4
2 4
3 4
4 3
dtype: int64
To select the strings with 3 characters use boolean indexing:
test[test.str.len().eq(3)]
output:
4 eee
dtype: object
Consider following data frame:
a
0 1
1 1
2 2
3 4
4 5
5 6
6 4
Is there a convenient way (without iterating rows) to create a column that represent "is seen before" for every value of column a.
For example desired output for the example is (0 represent not seen before, 1 represent seen before):
0
1
0
0
0
0
1
If this is possible, is there a way to enhance it with counts of previous occurrences and not just binary indicator?
Should just be .duplicated() (see documentation). Then if you want to cast it to an integer for 0's and 1's instead of False and True you can use .astype(int) on the output:
From pd.DataFrame:
df.duplicated(subset="a").astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
dtype: int32
From pd.Series:
df["a"].duplicated().astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
Name: a, dtype: int32
This will mark the first time a value is "seen" as False, and all subsequent values that have already been "seen" as True. Coercing it to an int datatype via astype will change False -> 0 and True -> 1
Use assign and duplicated:
df.assign(seenbefore = lambda x: x.a.duplicated().astype(int))
The following command works fine if I run python 2
df5b = pd.merge(df5a, df5bb, how='outer')
However, when I run the same command with the same dfs in python 3, I get the following error:
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
My dataframes are very large, I hope that someone can help me out, without giving examples of my dataframes. The command is ok with python 2, so I assume that the problem are not the dataframes, but maybe a change of this command in python 3?
There is problem some columns are integers in one DataFrame and strings in another with same names.
Simpliest solution is cast all columns to strings:
df5b = pd.merge(df5a.astype(str), df5bb.astype(str), how='outer')
Another is test dtypes:
print (df5a.dtypes)
print (df5bb.dtypes)
And convert columns for same, e.g. convert strings columns from list to integers:
cols = ['col1','col12','col3']
df5a[cols] = df5a[cols].astype(int)
Sample:
df5a = pd.DataFrame({
'B':[4,5,4,5],
'C':[7,8,9,4],
'F':list('aaab')
})
df5bb = pd.DataFrame({
'B':['4','5','5'],
'F':list('aab')
})
df5b = pd.merge(df5a.astype(str), df5bb.astype(str), how='outer')
print (df5b)
B C F
0 4 7 a
1 4 9 a
2 5 8 a
3 5 4 b
print (df5a.dtypes)
B int64
C int64
F object
dtype: object
print (df5bb.dtypes)
B object
F object
dtype: object
cols = ['B']
df5bb[cols] = df5bb[cols].astype(int)
df5b = pd.merge(df5a, df5bb, how='outer')
print (df5b)
B C F
0 4 7 a
1 4 9 a
2 5 8 a
3 5 4 b
as i stated in my comment section, coerce is not happening on mixed types(may be int , str or float) hence you can consider concat or you can convert them as str and then merge which jezrael mentioned.
Just to determine types you can see ..
>>> pd.concat([df5a, df5bb]).dtypes
B object
C float64
F object
dtype: object
>>> pd.concat([df5a, df5bb])
B C F
0 4 7.0 a
1 5 8.0 a
2 4 9.0 a
3 5 4.0 b
0 4 NaN a
1 5 NaN a
2 5 NaN b
I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])
shift converts my column from integer to float. It turns out that np.nan is float only. Is there any ways to keep shifted column as integer?
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1)
df['a']
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: int64
df['b']
# 0 NaN
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: float64
Solution for pandas under 0.24:
Problem is you get NaN value what is float, so int is converted to float - see na type promotions.
One possible solution is convert NaN values to some value like 0 and then is possible convert to int:
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1).fillna(0).astype(int)
print (df)
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
Solution for pandas 0.24+ - check Series.shift:
fill_value object, optional
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.
Changed in version 0.24.0.
df['b'] = df['a'].shift(fill_value=0)
another solution starting from pandas version 0.24.0: simply provide a value for the parameter fill_value:
df['b'] = df['a'].shift(1, fill_value=0)
You can construct a numpy array by prepending a 0 to all but the last element of column a
df.assign(b=np.append(0, df.a.values[:-1]))
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN.
df = pd.DataFrame({"a":range(5)})
df = df.convert_dtypes()
df['b'] = df['a'].shift(1)
print(df['a'])
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: Int64
print(df['b'])
# 0 <NA>
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: Int64
another solution is to use replace() function and type cast
df['b'] = df['a'].shift(1).replace(np.NaN,0).astype(int)
I don't like other answers which may change original dtypes, what if you have float, str in data?
Since we don't need the first nan row , why not skip it.
I would keep all dtypes and cast back:
dt = df.dtypes
df = df.shift(1).iloc[1:].astype(dt)