This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 4 years ago.
I'm trying to obtain the mode of a column in a groupby object, but I'm getting this error: incompatible index of inserted column with frame index.
This is the line I'm getting this on, and I'm not sure how to fix it. Any help would be appreciated.
dfBitSeq['KMeans'] = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: x.mode())
Pandas mode returns a data frame unlike mean and median which return a scalar. So you just need to select the slice using x.mode().iloc[0]
dfBitSeq['KMeans'] = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: x.mode().iloc[0])
You can use scipy.stats.mode. Example below.
from scipy.stats import mode
df = pd.DataFrame([[1, 5], [2, 3], [3, 5], [2, 4], [2, 3], [1, 4], [1, 5]],
columns=['OnBitSeq', 'KMeans'])
# OnBitSeq KMeans
# 0 1 5
# 1 2 3
# 2 3 5
# 3 2 4
# 4 2 3
# 5 1 4
# 6 1 5
modes = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: mode(x)[0][0]).reset_index()
# OnBitSeq KMeans
# 0 1 5
# 1 2 3
# 2 3 5
If you need to add this back to the original dataframe:
df['Mode'] = df['OnBitSeq'].map(modes.set_index('OnBitSeq')['KMeans'])
You could look at Attach a calculated column to an existing dataframe.
This error looks similar and the answer is pretty useful.
Related
I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.
This question already has answers here:
Finding median of list in Python
(28 answers)
Closed 2 years ago.
I have a csv file and I want to sort (lowest from greatest) the first column.
The first column's name is "CRIM".
I can read the first column, but I can't sort it, the numbers are floats.
Also, I would like to find the median of the list.
This is what I did so far:
import csv
with open('data.csv', newline='') as csvfile:
data = csv.DictReader(csvfile)
for line in data:
print(line['CRIM'])
I would advise using pandas >> dataframe.median()
Eg data:
A B C D
0 12 5 20 14
1 4 2 16 3
2 5 54 7 17
3 44 3 3 2
4 1 2 8 6
# importing pandas as pd
import pandas as pd
# for your csv
# df = pd.read_csv('data.csv')
# Creating the dataframe (example)
df = pd.DataFrame({"A":[12, 4, 5, 44, 1],
"B":[5, 2, 54, 3, 2],
"C":[20, 16, 7, 3, 8],
"D":[14, 3, 17, 2, 6]})
# Find median Even if we do not specify axis = 0, the method
# will return the median over the index axis by default
df.median(axis = 0)
A 5.0
B 3.0
C 8.0
D 6.0
dtype: float64
df['A'].median(axis = 0)
5.0
https://www.programiz.com/python-programming/methods/built-in/sorted
Use sorted():
CRIM_sorted = sorted(line['CRIM'])
For the median, you can use a package or just build your own:
Finding median of list in Python
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I am coming from R background. I need elementary with pandas.
if I have a dataframe like this
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
I want to subset dataframe to select a fixed column and select a row by a boolean.
For example
df.iloc[df.2 > 4][2]
then I want to set the value for the subset cell to equal a value.
something like
df.iloc[df.2 > 4][2] = 7
It seems valid for me however it seem pandas work with booleans in more strict way than R
In here it is .loc
df.loc[df[2] > 4,2]
1 6
Name: 2, dtype: int64
df.loc[df[2] > 4,2]=7
df
0 1 2
0 1 2 3
1 4 5 7
Let's say that I have som data from a file where some columns are "of the same kind", only of different subscripts of some mathematical variable, say x:
n A B C x[0] x[1] x[2]
0 1 2 3 4 5 6
1 2 3 4 5 6 7
Is there some way I can load this into a pandas dataframe df and somehow treat the three x-columns as an indexable, array-like entity (I'm new to pandas)? I believe it would be convenient, because I could do operations on the data-series contained in x such as sum(df.x).
Kind regards.
EDIT:
Admittedly, my original post was not clear enough. I'm not just interested in getting the sum of three columns. That was just an example. I'm looking for a generally applicable abstraction that I hope is built into pandas.
I'd like to have multiple columns accessible through (sub-)indices of one entity, e.g. df.x[0], such that I (or any other user of the data) can do whichever operation he/she wants (sum/max/min/avg/standard deviation, you name it). You can consider the x's as an ensamble of time-dependent measurements if you like.
Kind regards.
Consider, you define your dataframe like this
df = pd.DataFrame([[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7]], columns=['A', 'B', 'C', 'x0', 'x1', 'x2'])
Then with
x = ['x0', 'x1', 'x2']
You use the following notation allowing a quite general definition of x
>>> df[x].sum(axis=1)
0 15
1 18
dtype: int64
Look of column which starts with 'x' and perform operations you need
column_num=[col for col in df.columns if col.startswith('x')]
df[column_num].sum(axis=1)
I'll give you another answer which will defer from you initial data structure in exchange for addressing the values of the dataframe by df.x[0] etc.
Consider you have defined your dataframe like this
>>> dv = pd.DataFrame(np.random.randint(10, size=20),
index=pd.MultiIndex.from_product([range(4), range(5)]), columns=['x'])
>>> dv
x
0 0 8
1 3
2 4
3 6
4 1
1 0 8
1 9
2 1
3 8
4 8
[...]
Then you can exactly do this
dv.x[1]
0 8
1 9
2 1
3 8
4 8
Name: x, dtype: int64
which is your desired notation. Requires some changes to your initial set-up but will give you exactly what you want.
Here is my example:
import pandas as pd
df = pd.DataFrame({'col_1':[1,5,6,77,9],'col_2':[6,2,4,2,5]})
df.index = [8,9,10,11,12]
This sub-setting is by row order:
df.col_1[2:5]
returns
10 6
11 77
12 9
Name: col_1, dtype: int64
while this subsetting is already by index and does not to work:
df.col_1[2]
returns:
KeyError: 2
I find it very confusing and am curios what is the reason behind it?
You're statements are ambiguous, therefore it best to explicitly define what you want.
df.col_1[2:5] is working like df.col_1.iloc[2:5] using integer location.
Where as df.col[2] is working like df.col_1.loc[2] using index label location, hence there is no index labelled 2, so you get the KeyError.
Hence is best to defined whether are are using integer location with .iloc or index label location using .loc.
See Pandas Indexing docs.
Let's assume this is the initial DataFrame:
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=list('abcde')
)
df
Out:
col_1 col_2
a 1 6
b 5 2
c 6 4
d 77 2
e 9 5
The index consists of strings so it is generally obvious what you are trying to do:
df['col_1']['b'] You passed a string so you are probably trying to access by label. It returns 5.
df['col_1'][1] You passed an integer so you are probably trying to access by position. It returns 5.
Same deal with slices: df['col_1']['b':'d'] uses labels and df['col_1'][1:4] uses positions.
When the index is also integer, nothing is obvious anymore.
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=[8, 9, 10, 11, 12]
)
df
Out:
col_1 col_2
8 1 6
9 5 2
10 6 4
11 77 2
12 9 5
Let's say you type df['col_1'][8]. Are you trying to access by label or by position? What if it was a slice? Nobody knows. At this point, pandas chooses one of them based on their usage. It is in the end a Series and what distinguishes a Series from an array is its labels so the choice for df['col_1'][8] is labels. Slicing with labels is not that common so pandas is being smart here and using positions when you pass a slice. Is it inconsistent? Yes. Should you avoid it? Yes. This is the main reason ix was deprecated.
Explicit is better than implicit so use either iloc or loc when there is room for ambiguity. loc will always raise a KeyError if you try to access an item by position and iloc will always raise a KeyError if you try to access by label.