I am trying to clean up a dataset. Only values smaller than the last value should be kept.
Right now it look slike this:
my_data
0 10
1 8
2 7
3 10
4 5
5 8
6 2
after the cleanup it should look like this:
my_data
0 10
1 8
2 7
3 7
4 5
5 5
6 2
I also have some working code but I am looking for a faster and more pythonic way of doing it.
import pandas as pd
df_results = pd.DataFrame()
df_results['my_data'] = [10, 8, 7, 10, 5, 8, 2]
data_idx = list(df_results['my_data']._index)
for i in range(1, len(df_results['my_data'])):
current_value = df_results['my_data'][data_idx[i]]
last_value = df_results['my_data'][data_idx[i - 1]]
df_results['my_data'][data_idx[i]] = current_value if current_value < last_value else last_value
You can use:
In [53]: df[df.my_data.diff() > 0] = np.nan
In [54]: df
Out[54]:
my_data
0 10.0
1 8.0
2 7.0
3 NaN
4 5.0
5 NaN
6 2.0
In [55]: df.ffill()
Out[55]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0
I am using shift with diff
s=df.my_data.diff().gt(0)
df.loc[s,'my_data']=df.loc[s.shift(-1).fillna(False),'my_data'].values
Out[71]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0
Related
I have a dataset in which I need to filter by the values of one of the columns.
I'll try to explain what I need with an example. Suppose we have the following dataset, whose columns may contain NaN values.
In [11]: df
Out[11]:
date A B C
2012-11-29 0 0 NaN
2012-11-30 1 1 NaN
2012-12-01 2 2 2
2012-12-02 NaN 3 3
2012-12-03 4 4 4
2012-12-04 5 5 NaN
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 NaN
2012-12-08 9 9 NaN
I need to filter a dataframe to get data between the maximum and minimum values of a column C.
That is, at the output, I should get the following data set, while all the values of NaN within this interval should be unchanged.
The result of filtering should be the following:
date A B C
2012-12-01 2 2 2
2012-12-02 NaN 3 3
2012-12-03 4 4 4
2012-12-04 5 5 NaN
2012-12-05 6 6 6
2012-12-06 7 7 7
How can I do that? I tried this kind of construct but it didn't give any results:
interval_1 = pd.DataFrame(pd.date_range(df['C'].min(), df['C'].max()))
You can use idxmin/idxmax and slicing:
df.loc[df['C'].idxmin():df['C'].idxmax()]
output:
date A B C
2 2012-12-01 2.0 2 2.0
3 2012-12-02 NaN 3 3.0
4 2012-12-03 4.0 4 4.0
5 2012-12-04 5.0 5 NaN
6 2012-12-05 6.0 6 6.0
7 2012-12-06 7.0 7 7.0
You could first get min and max values of C, and then filter the whole df by a and b column values:
min_c = df.c.min()
max_c = df.c.max()
df = df[(df.a >= min_c) & (df.a <= max_c) & (df.b >= min_c) & (df.b <= max_c)]
import pandas as pd
import numpy as np
columns = ['date','A','B','C']
data = [
['2012-11-29', 0 , 0 , np.nan],
['2012-11-30', 1 , 1 , np.nan],
['2012-12-01', 2 , 2 , 2],
['2012-12-02', np.nan, 3 , 3],
['2012-12-03', 4 , 4 , 4],
['2012-12-04', 5 , 5 , np.nan],
['2012-12-05', 6 , 6 , 6],
['2012-12-06', 7 , 7 , 7],
['2012-12-07' , 8 , 8 , np.nan],
['2012-12-08' , 9 , 9 , np.nan]]
df = pd.DataFrame(data=data, columns=columns)
minVal = df['C'].min()
maxVal = df['C'].max()
df_filter = df[((df['A'] >= minVal) | (df['B'] >= minVal)) & ((df['A'] <= maxVal) | (df['B'] <= maxVal))]
Output:
print(df_filter)
date A B C
2 2012-12-01 2.0 2 2.0
3 2012-12-02 NaN 3 3.0
4 2012-12-03 4.0 4 4.0
5 2012-12-04 5.0 5 NaN
6 2012-12-05 6.0 6 6.0
7 2012-12-06 7.0 7 7.0
import pandas
import numpy
names = ['a', 'b', 'c']
df = pandas.DataFrame([1, 2, 3, numpy.nan, numpy.nan, 4, 5, 6, numpy.nan, numpy.nan, 7, 8, 9])
For the above one, how will the condition change? Can someone please explain?
how can I get this,
df1 =
0
0 1.0
1 2.0
2 3.0
df2 =
0
4 4.0
5 5.0
6 6.0
df3 =
0
8 7.0
9 8.0
10 9.0
You can generate a temporary column, remove NaNs, and group by the temporary column:
dataframes = {f'df{idx+1}': d for idx, (_, d) in enumerate(df.dropna().groupby(df.assign(cond=df.isna().cumsum()).dropna()['cond']))}
Output:
>>> dataframes
{'df1': 0
0 1.0
1 2.0
2 3.0,
'df2': 0
5 4.0
6 5.0
7 6.0,
'df3': 0
10 7.0
11 8.0
12 9.0}
>>> dataframes['df1']
0
0 1.0
1 2.0
2 3.0
>>> dataframes['df2']
0
5 4.0
6 5.0
7 6.0
>>> dataframes['df3']
0
10 7.0
11 8.0
12 9.0
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
I have a text file of the form :
data.txt
2
8
4
3
1
9
6
5
7
How to read it into a pandas dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Try this:
with open(filename, 'r') as f:
data = f.read().replace('\n',',').replace(',,','\n')
In [7]: pd.read_csv(pd.compat.StringIO(data), header=None)
Out[7]:
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 1
Much easier, if you know there are always N elements in a group - just load your data and reshape -
pd.DataFrame(np.loadtxt('data.txt').reshape(3, -1))
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
To load integers, pass dtype to loadtxt -
pd.DataFrame(np.loadtxt('data.txt', dtype=int).reshape(3, -1))
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 2
This is more general, will work when you cannot guarantee that there are always 3 numbers at a time. The idea here is to read in blank lines as NaN, and separate your data based on the presence of NaNs.
df = pd.read_csv('data.txt', header=None, skip_blank_lines=False)
df
0
0 2.0
1 8.0
2 4.0
3 NaN
4 3.0
5 1.0
6 9.0
7 NaN
8 6.0
9 5.0
10 7.0
df_list = []
for _, g in df.groupby(df.isnull().cumsum().values.ravel()):
df_list.append(g.dropna().reset_index(drop=True))
df = pd.concat(df_list, axis=1, ignore_index=True)
df
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
Caveat - if your data also has NaNs, this will not separate properly.
Although this is definitely not the best way to handle it, we can do some processing ourselves. In case the values are integers, the following should work:
import pandas as pd
with open('data.txt') as f:
data = [list(map(int, row.split())) for row in f.read().split('\n\n')]
dataframe = pd.DataFrame(data)
which produces:
>>> dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)