I am using Excel for this task now, but I was wondering if any of you know a way to find and insert missing sequence numbers in python.
Say I have a dataframe:
import pandas as pd
data = {'Sequence': [1, 2, 4, 6, 7, 9, 10],
'Value': ["x", "x", "x", "x", "x", "x", "x"]
}
df = pd.DataFrame (data, columns = ['Sequence','Value'])
And now I want to use some code here to find missing sequence numbers in the column 'Sequence', and leave blank spaces at the column 'Values' for the rows of missing sequence numbers. To get the following output:
print(df)
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
Even better would be a solution in which you can also define the start and end of the sequence. For example when the sequence starts with 3 but you want it to start from 1 and end at 12. But a solution for only the first part will already help a lot. Thanks in advance!!
You can set_index and reindex using a range from the Sequence's min and max values:
(df.set_index('Sequence')
.reindex(range(df.Sequence.iat[0],df.Sequence.iat[-1]+1), fill_value='')
.reset_index())
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
Or do it by merging DataFrames:
seq = [1, 2, 4, 6, 7, 9, 10]
dfs0 = pd.DataFrame.from_dict({'Sequence': seq, 'Value': ['x']*len(seq)})
dfseq = pd.DataFrame.from_dict({'Sequence': range( min(seq), max(seq)+1 )})
.merge(dfs0, on='Sequence', how='outer').fillna('')
print(dfseq)
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
You can try this :
Sequence = [1, 2, 4, 6, 7, 9, 10]
df = pd.DataFrame(np.arange(1,12), columns=["Sequence"])
df = df.loc[df.Sequence.isin(Sequence), 'Value'] = 'x'
df = df.fillna('')
First you create your DataFrame with the given range of values you want it to have for sequence.
Then you set 'Value' to 'x' for the rows where 'Sequence' is in your Sequence list. Finally you fill the missing values with ''.
Related
I have a pandas dataframe like this:
col
0 3
1 5
2 9
3 5
4 6
5 6
6 11
7 6
8 2
9 10
that could be created in Python with the code:
import pandas as pd
df = pd.DataFrame(
{
'col': [3, 5, 9, 5, 6, 6, 11, 6, 2, 10]
}
)
I want to find the rows that have a value greater than 8, and also there is at least one row before them that has a value less than 4.
So the output should be:
col
2 9
9 10
You can see that index 0 has a value equal to 3 (less than 4) and then index 2 has a value greater than 8. So we add index 2 to the output and continue to check for the next rows. But we don't anymore consider indexes 0, 1, 2, and reset the work.
Index 6 has a value equal to 11, but none of the indexes 3, 4, 5 has a value less than 4, so we don't add index 6 to the output.
Index 8 has a value equal to 2 (less than 4) and index 9 has a value equal to 10 (greater than 8), so index 9 is added to the output.
It's my priority not to use any for-loops for the code.
Have you any idea about this?
Boolean indexing to the rescue:
# value > 8
m1 = df['col'].gt(8)
# get previous value <4
# check if any occurred previously
m2 = df['col'].shift().lt(4).groupby(m1[::-1].cumsum()).cummax()
df[m1&m2]
Output:
col
2 9
9 10
Check Below code using SHIFT:
df['val'] = np.where(df['col']>8, True, False).cumsum()
df['val'] = np.where(df['col']>8, df['val']-1, df['val'])
df.assign(min_value = df.groupby('val')['col'].transform('min')).\
query('col>8 and min_value<4')[['col']]
OUTPUT:
Let's consider data frame following:
import pandas as pd
df = pd.DataFrame([[1, -2, 3, -5, 4 ,2 ,7 ,-8 ,2], [2, -4, 6, 7, -8, 9, 5, 3, 2], [2, 4, 6, 7, 8, 9, 5, 3, 2], [1, 2, 3, 4, 5, 6, 7, 8, 9]]).transpose()
df.columns = ["A", "B", "C", "D"]
A B C D
0 1 2 2 1
1 -2 -4 4 2
2 3 6 6 3
3 -5 7 7 4
4 4 -8 8 5
5 2 9 9 6
6 7 5 5 7
7 -8 3 3 8
8 2 2 2 9
I want to add at the end of the column name "pos" if column contain only positive values. What I would do with it is:
pos_idx = df.loc[:, (df>0).all()].columns
df[pos_idx].columns = df[pos_idx].columns + "pos"
However it seems not to work - it returns no error, however it does not change column names. Moreover, what is very interesting, is that code:
df.columns = df.columns + "anything"
actually add to column names word "anything". Could you please explain to me why it happens (works in general case, but it does not work on index case), and how to do this correctly?
You are saving the new column names onto a copy of the dataframe. The below statement is not overwriting column names of df, but only of the slice df[pos_idx]
df[pos_idx].columns = df[pos_idx].columns + "pos"
Your second code example directly acccesses df, that's why that one works
How to make it work? --> Define the "full columns list" (separately). Afterwards write it into df directly.
How to define the "full list"? Add "pos" as a suffix to all cols which don't have any occurrence of values that are <=0.
my_col_list = [col+(count==0)*"_pos" for col, count in (df <= 0).sum().to_dict().items()]
df.columns = my_col_list
First of all, use .rename() function to change the name of a column.
To add 'pos' to columns with non negative values you can use this:
renamed_columns = {i:i+' pos' for i in df.columns if df[i].min()>=0}
df.rename(columns=renamed_columns,inplace=True)
How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]
so I have this data set below that I want to sort base on mylist from column 'name' as well as acsending by 'A' and descending by 'B'
import pandas as pd
import numpy as np
df1 = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]) , ('name', ['x','x','x'])])
df2 = pd.DataFrame.from_items([('B', [5, 6, 7]), ('A', [8, 9, 10]) , ('name', ['y','y','y'])])
df3 = pd.DataFrame.from_items([('C', [5, 6, 7]), ('D', [8, 9, 10]), ('A',[1,2,3]), ('B',[4,5,7] ), ('name', ['z','z','z'])])
df_list = [df1,df2,df3[['A','B','name']]]
df = pd.concat(df_list, ignore_index=True)
so my list is
mylist = ['z','x','y']
I want the dataset to start with sort by my list , then sort asc column A then desc column B
is there a way to do this in python ?
======== Edit ==========
I want my final result to be something like
OK, a way to sort by a custom order is to create a dict that defines how 'name' column should be order, call map to add a new column that defines this new order, then call sort and pass in the new column and the others, plus the param ascending where you selectively decide whether each column is sorted ascending or not, and then finally drop that column:
In [20]:
name_sort = {'z':0,'x':1,'y':2}
df['name_sort'] = df.name.map(name_sort)
df
Out[20]:
A B name name_sort
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
In [23]:
df = df.sort(['name_sort','A','B'], ascending=[1,1,0])
df
Out[23]:
A B name name_sort
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
In [25]:
df = df.drop('name_sort', axis=1)
df
Out[25]:
A B name
6 1 4 z
7 2 5 z
8 3 7 z
0 1 4 x
1 2 5 x
2 3 6 x
3 8 5 y
4 9 6 y
5 10 7 y
Hi We can do the above issue by using the following:
t = pd.CategoricalDtype(categories=['z','x','y'], ordered=True)
df['sort'] = pd.Series(df.name, dtype=t)
df.sort_values(by=['sort','A','B'], inplace=True)