Change dataframe index values while keeping other column data same - python

I have a DataFrame with 4 columns and 251 rows and an index that is a progression of numbers e.g. 1000 to 1250 . The index was initially necessary to aid in joining data from 4 different dataframes. However, once i get the 4 columns together, i would like to change the index to a number progression from 250 to 0. This is because i would be performing the same operation on different sets of data (in groups of 4) that would have different indices, e.g. 2000 to 2250 or 500 to 750, but would all have the same number of rows. 250 to 0 is a way of unifying these data sets, but i can't figure out how to do this. i.e. i'm looking for something that replaces any existing index with the function range(250, 0, -1)
I've tried using set_index below and a whole bunch of other attempts that invariably return errors,
df.set_index(range(250, 0, -1), inplace=True)
and in the instance when i am able to set the index of the df to the range, the data in the 4 columns change to NaN since they have no data that matches the new index. I apologize if this is rudimentary, but i'm a week old in the world of python/pandas, haven't programmed in +10yrs, and have taken 2 days to try to figure this out for myself as an exercise, but its time to cry... Uncle!!

Try introducing the 250:0 indices as a column first, then setting them as the index:
df = pd.DataFrame({'col1': list('abcdefghij'), 'col2': range(0, 50, 5)})
df['new_index'] = range(30, 20, -1)
df.set_index('new_index')
Before:
col1 col2 new_index
0 a 0 30
1 b 5 29
2 c 10 28
3 d 15 27
4 e 20 26
5 f 25 25
6 g 30 24
7 h 35 23
8 i 40 22
9 j 45 21
After:
col1 col2
new_index
30 a 0
29 b 5
28 c 10
27 d 15
26 e 20
25 f 25
24 g 30
23 h 35
22 i 40
21 j 45

You can just do
df.index = range(250, 0, -1)
or am I missing something?

Related

Efficient lookup between pandas column values and a list of values

I have a list of n elements lets say:
[5,30,60,180,240]
And a dataframe with the following characteristics
id1 id2 feat1
1 1 40
1 2 40
1 3 40
1 4 40
2 6 87
2 7 87
2 8 87
The combination of id1 + id2 is unique but all of the records with common id1 share the value of feat1. I would like to write a function to run it via groupby + apply (or whatever is faster) that creates a column called 'closest_number'. The 'closest_number' will be the closest element between the feat1 column for a given id1+id2 (or id1 as the records share feat1) and each of the elements of the list.
Desired output:
id1 id2 feat1 closest_number
1 1 40 30
1 2 40 30
1 3 40 30
1 4 40 30
2 6 87 60
2 7 87 60
2 8 87 60
If this will be a standard 2 array lookup problem I could do:
def get_closest(array, values):
# make sure array is a numpy array
array = np.array(array)
# get insert positions
idxs = np.searchsorted(array, values, side="left")
# find indexes where previous index is closer
prev_idx_is_less = ((idxs == len(array))|(np.fabs(values - array[np.maximum(idxs-1, 0)]) < np.fabs(values - array[np.minimum(idxs, len(array)-1)])))
idxs[prev_idx_is_less] -= 1
return array[idxs]
An if I apply this do the columns there I will get as output:
array([30, 60])
However I will not get any information about which indexes they have the correspondence with 30 and 60.
What will be the optimum way of doing this? As my list of elements is very small I have created distance columns in my dataset and then I have selected the one that gets me the min distances.
But I assume there should be a more elegant way of doing this.
BR
E
Use get_closest as follows:
# obtain the series with index id1 and values feat1
vals = df.groupby("id1")["feat1"].first().rename("closest_number")
# find the closest values and assign them back
vals[:] = get_closest(s, vals)
# merge the series into the original DataFrame
res = df.merge(vals, right_index=True, left_on="id1", how="left")
print(res)
Output
id1 id2 feat1 closest_number
0 1 1 40 30
1 1 2 40 30
2 1 3 40 30
3 1 4 40 30
4 2 6 87 60
5 2 7 87 60
6 2 8 87 60

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

How to select the "locals" min and max out of a list in a panda DataFrame

I'm struggling to figure out how to do the following:
I've a dataframe that looks like this (it's a little more complicated, this is but an example):
df = pd.DataFrame({'id' : ['id1','id2'], 'coverage' : ['1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20 40 41 42 43 44 45 46 47 48 49 50','1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 106 107 108 109 110']})
And I want to generate a new key that only holds the min-max of every segment, basically it should look like this:
id coverage
0 id1 1 11 13 20 40 50
1 id2 1 10 100 110
It's a simple problem but I can't come up with any solutions, I know that map(lambda x:) could work...
Thanks!
Let's try:
# split the values and convert to integers
s = df['coverage'].str.split().explode().astype(int)
# continuous blocks
blocks = s.diff().ne(1).groupby(level=0).cumsum()
s['coverage'] = (s.groupby([s.index, blocks])
.agg(['min','max'])
.astype(str).agg(' '.join, axis=1)
.groupby(level=0).agg(' '.join)
)
First, split those strings and then explode into a large Series, keeping the index as your 'id' column. Then we take the difference between successive rows within each group and check where it's not equal to 1.
Slice the exploded Series by this mask and it shifted to get the start and end points, then groupby and agg(list) (or ' '.join) to get your output.
# To numeric so values become numbers.
s = pd.to_numeric(df.set_index('id')['coverage'].str.split().explode())
m = s.groupby(level=0).diff().ne(1)
result = s[m | m.shift(-1).fillna(True)].groupby(level=0).agg(list)
id
id1 [1, 11, 13, 20, 40, 50]
id2 [1, 10, 100, 110]
Name: coverage, dtype: object

Sort pandas df into individual columns

I am trying to sort a pandas df into individual columns based on when values in columns change. For the df below I can sort the df into separate columns when a values changes in Col B. But I'm trying to add Col Cso it's when values change in both Col B and Col C.
import pandas as pd
df = pd.DataFrame({
'A' : [10,20,30,40,40,30,20,10,5,10,15,20,20,15,10,5],
'B' : ['X','X','X','X','Y','Y','Y','Y','X','X','X','X','Y','Y','Y','Y'],
'C' : ['W','W','Z','Z','Z','Z','W','W','W','W','Z','Z','Z','Z','W','W'],
})
d = df['B'].ne(df['B'].shift()).cumsum()
df['C'] = d.groupby(df['B']).transform(lambda x: pd.factorize(x)[0]).add(1).astype(str)
df['D'] = df.groupby(['B','C']).cumcount()
df = df.set_index(['D','C','B'])['A'].unstack([2,1])
df.columns = df.columns.map(''.join)
Output:
X1 Y1 X2 Y2
D
0 10 40 5 20
1 20 30 10 15
2 30 20 15 10
3 40 10 20 5
As you can see, this creates a new column every time there's a new value in Col B. But I'm trying to incorporate Col C as well. So it should be every time there's a change in both Col B and Col C.
Intended output:
XW1 XZ1 YZ1 YW1 XW2 XZ2 YZ2 YW2
0 10 30 40 20 5 15 20 10
1 20 40 30 10 10 20 15 5
Just base on your out put create the help columns one by one.
df['key']=df.B+df.C# create the key
df['key2']=(df.key!=df.key.shift()).ne(0).cumsum() # make the continue key into one group
df.key2=df.groupby('key').key2.apply(lambda x : x.astype('category').cat.codes+1)# change the group number to 1 or 2
df['key3']=df.groupby(['key','key2']).cumcount() # create the index for pivot
df['key']=df.key+df.key2.astype(str) # create the columns for pivot
df.pivot('key3','key','A')#yield
Out[126]:
key XW1 XW2 XZ1 XZ2 YW1 YW2 YZ1 YZ2
key3
0 10 5 30 15 20 10 40 20
1 20 10 40 20 10 5 30 15

How can I add two values of a row, and then put the result into a new cell?

In Python, I have a dataset/frame with 2 values, column A has values of say, 10, 20, 30 and column B has values of 5, 10, 15 etc.
How can I add the value of each row of each column and have the result in a column next to it?
So essentially it will be column C that has the sum results, so the first row will add column A and B for a result in column C for 15, and so on.
Thanks.
simple addition will do
df['C'] = df['A'] + df['B']
Using eval
making a copy by using inplace=False
df.eval('C = A + B', inplace=False)
# create a copy with a new column
A B C
0 10 5 15
1 20 10 30
2 30 15 45
altering the existing dataframe by using inplace=True
df.eval('C = A + B', inplace=True)
df
A B C
0 10 5 15
1 20 10 30
2 30 15 45
Like this:
df = pd.DataFrame({'A':[10,20,30],'B':[5,10,15]})
df = df.assign(C=df.A + df.B)
print(df)
Ouput:
A B C
0 10 5 15
1 20 10 30
2 30 15 45

Categories