I have the current DataFrame:
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
I am trying to automatically rename every other column so that the first two are lo1, up1, lo2, up2. This is just an example, but I was hoping for a way to develop this for an entire DataFrame of many columns.
Thanks.
An efficient approach is to use itertools.product:
from itertools import product
cols = [f'{a}{b+1}' for b,a in product(range(len(df. columns)//2), ['lo','up'])]
df.columns = cols
Output:
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
Here's one approach:
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
max_nr = int(len(df.columns)/2)
for i in range(max_nr):
df.rename(columns={'{}'.format(i*2):'lo{}'.format(i), '{}'.format(i*2+1):'hi{}'.format(i)}, inplace=True)
output:
lo0
hi0
lo1
hi1
0
-1
2
1
3
1
4
6
7
8
2
-2
10
11
13
3
5
6
8
9
Even though it is longer compared to the concise solutions posted above, there is another method by defining a function named col_namer in which col_size corresponds to the number of columns in DataFrame and names_list is the list of strings with which you want to name your columns. As a result we have:
def col_namer(col_size, names_list):
rep = len(names_list)
col_names = []
mul = col_size // rep
rem = col_size % rep
if rem == 0:
for i in range(1, mul + 1):
for j in names_list:
col_names.append(j + str(i))
return col_names
else:
for i in range(1, mul + 1):
for j in names_list:
col_names.append(j + str(i))
for i in range(mul + 1, mul + rem + 2):
for j in names_list:
if len(col_names) < col_size:
col_names.append(j + str(i))
return col_names
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
df.columns = col_namer(df.shape[1], ['lo', 'up'])
df
Out:
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
Related
I have a table like this:
import pandas as pd
df = pd.DataFrame({
"day": [1, 2, 3, 4, 5, 6],
"tmin": [-2, -3, -1, -4, -4, -2]
})
I want to create a column like this:
df['days_under_0_until_now'] = [1, 2, 3, 4, 5, 6]
df['days_under_-2_until_now'] = [1, 2, 0, 1, 2, 3]
df['days_under_-3_until_now'] = [0, 1, 0, 1, 2, 0]
So days_under_X_until_now means how many consecutive days until now tmin was under or equals X
I'd like to avoid do this with loops since the data is huge. Is there an alternative way to do it?
For improve performance avoid using groupby compare values of column to list and then use this solution for count consecutive Trues:
vals = [0,-2,-3]
arr = df['tmin'].to_numpy()[:, None] <= np.array(vals)[ None, :]
cols = [f'days_under_{v}_until_now' for v in vals]
df1 = pd.DataFrame(arr, columns=cols, index=df.index)
b = df1.cumsum()
df = df.join(b.sub(b.mask(df1).ffill().fillna(0)).astype(int))
print (df)
day tmin days_under_0_until_now days_under_-2_until_now \
0 1 -2 1 1
1 2 -3 2 2
2 3 -1 3 0
3 4 -4 4 1
4 5 -4 5 2
5 6 -2 6 3
days_under_-3_until_now
0 0
1 1
2 0
3 1
4 2
5 0
I have a pandas dataframe, with a column containing item numbers that are supposed to increase by 1, each row.
df1 = pd.DataFrame({
"item_number" : [1, 2, 3, 4, 5, 6, 8, 10],
"col_A" : ['aaa','bbb','ccc','ddd','eee','fff','hhh', 'jjj']})
df1
item_number col_A
0 1 aaa
1 2 bbb
2 3 ccc
3 4 ddd
4 5 eee
5 6 fff
6 8 hhh
7 10 jjj
As you can see, the item number increases by two between 6 and 8 and 8 and 10. Is there a way to write a function that will a list of the skipped numbers ie. ['7','9'] otherwise, return True
s=pd.Series(range(df['item_number'].min(), (df['item_number'].max()+1)))
s[~s.isin(df['item_number'])].values
array([7, 9], dtype=int64)
one-liner:
set(range(df1.item_number.min(), df1.item_number.max()+1)) - set(df1.item_number) or True
You can take advantage of Python sets and lists operations to find out if the condition you are proposing meets on the input list:
li = [1, 2, 3, 4, 5, 6, 8, 10]
def fun(l):
a = list(set(list(range(l[0], l[-1]+1))) - set(l))
if a == []:
return True
else:
return a
print(fun(li))
Output:
[9, 7]
Also, you can use return sorted(a) if you want the list elements to be returned in order.
Use range with np.setdiff1d:
In [1518]: import numpy as np
In [1519]: rng = range(df1.item_number.min(), df1.item_number.max() + 1)
In [1523]: res = np.setdiff1d(rng, df1.item_number)
In [1524]: res
Out[1524]: array([7, 9])
This will do it:
def foo(df):
x = df.set_index('item_number').reindex(range(df.item_number.min(), df.item_number.max() + 1))
x = list(x.index[x.col_A.isna()])
return x if x else True
Examples:
y = foo(df1)
print(y)
y = foo(df1.loc[range(1, 6)])
print(y)
Output:
[7, 9]
True
I'm trying to figure out a way to loop over a panda DataFrame to generate a new key.
Here's an example of the dataframe:
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
And now I want to create a new key, that basically takes the first and last number of df["range"] and full the list with the numbers in the middle (ie, the first one will be [1 2 3 4 5 6 7 8 9 10])
So far I think that I should be using something like this, but I could be completely wrong:
df["total"] = df["range"].map(lambda x: #and here I should append all the "x" that are betwen df["range"][0] and df["range"][1]
Here's an example of the result that I'm looking for:
pdb beg end range total
0 a 1 10 1 10 1 2 3 4 5 6 7 8 9 10
1 b 2 11 2 11 2 3 4 5 6 7 8 9 10 11
I could use some help with the lambda function, I get really confused with the syntax.
Try with apply
df['new'] = df.apply(lambda x : list(range(x['beg'],x['end']+1)),axis=1)
Out[423]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
dtype: object
This should work:
df['total'] = df['range'].apply(lambda x: [n for n in range(x[0], x[1]+1)])
As per your output, you need
In [18]: df['new'] = df.apply(lambda x : " ".join(list(map(str,range(x['beg'],x['end']+1)))),axis=1)
In [19]: df
Out[19]:
pdb beg end range new
0 a 1 10 [1, 10] 1 2 3 4 5 6 7 8 9 10
1 b 2 11 [2, 11] 2 3 4 5 6 7 8 9 10 11
If you want to use iterrows then you can do it in the loop itself as follows:
Code :
import pandas as pd
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
df['total'] = [range(*x) for x in zip(df['beg'], df['end'])]
Output:
pdb beg end range total
0 a 1 10 [1, 10] (1, 2, 3, 4, 5, 6, 7, 8, 9)
1 b 2 11 [2, 11] (2, 3, 4, 5, 6, 7, 8, 9, 10)
I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
My assignment states that I get a list of birthdays and that I have to arrange them chronologically. I must write my own, so I can't use Python's predefined functions, such this:
import datetime
d = ['09-2012', '04-2007', '11-2012', '05-2013', '12-2006', '05-2006', '08-2007']
sorted(d, key=lambda x: datetime.datetime.strptime(x, '%m-%Y'))
Here is what I'm thinking of doing.
Step 1: Red all dates and put them into a list dd/mm/yyyy
date_list = [[1,2,1991],[2,1,1991],[3,4,1992],[5,6,1993],[4,5,1992],[8,5,1993]]
For better visualization, I will rearrange them like so:
1 / 2 / 1991
2 / 1 / 1991
3 / 4 / 1992
5 / 6 / 1993
4 / 5 / 1992
8 / 5 / 1993
Step 2: Sort the entire list by year (col 2)
1 / 2 / 1991
2 / 1 / 1991
3 / 4 / 1992
4 / 5 / 1992
5 / 6 / 1993
8 / 5 / 1993
Step 3: For each unique year, sort that sublist by the column near it (col 1)
2 / 1 / 1991
1 / 2 / 1991
3 / 4 / 1992
4 / 5 / 1992
8 / 5 / 1993
5 / 6 / 1993
Step 4: Do the same for the sublist of each unique month of that year (col 0)
1 / 1 / 1991
2 / 2 / 1991
3 / 4 / 1992
4 / 5 / 1992
8 / 5 / 1993
5 / 6 / 1993
And that should be it. I've used the following functions to try and it:
#Sorts the sublist date_list[position..position+length] by the col
def insertion(date_list, position, length, col):
for i in range (position + 1, pozition + lenght - 1):
aux = date_list[i]
j = i - 1
while j >= 0 and aux[col] < date_list[j][col]:
date_list[j+1] = date_list[j]
j -= 1
date_list[j+1] = aux
return date_list
def sortDateList(date_list, position, lenght, col):
# Nothing to do here
if col < 0:
return date_list
# If it's the first sort, sort everything
if col == 2:
date_list = insertion(date_list, 0, len(date_list), 2)
for i in range (position, position + length - 1):
# Divides the list into sublists based on the column
if date_list[i][col] == date_list[i][col]:
length += 1
else:
# Sorts the sublist, then sorts it after the previous column in it
date_list = insertion(date_list, position, length, col)
date_list = sortDateList(date_list, position, length, col - 1)
position += length
length = 1
date_list = insertion(date_list, position, length, col)
return date_list
I'm not sure exactly what the problem is here, I'm pretty sure it's something really basic that slipped my mind, and I can't keep track of recursion in my brain that well. It gives me some index out of bound errors and such.
For debug, I've printed out info as such:
col position position + length
date_list[position:position+length] before insertion()
date_list[position:position+length] after insertion()
Here is what the console gives me:
2 0 6
2 0 7
[[1, 2, 1991], [2, 1, 1991], [3, 4, 1992], [4, 5, 1992], [5, 6, 1993], [8, 5, 1993]]
[[1, 2, 1991], [2, 1, 1991], [3, 4, 1992], [4, 5, 1992], [5, 6, 1993], [8, 5, 1993]]
1 0 7
[[1, 2, 1991], [2, 1, 1991], [3, 4, 1992], [4, 5, 1992], [5, 6, 1993], [8, 5, 1993]]
[[2, 1, 1991], [1, 2, 1991], [3, 4, 1992], [4, 5, 1992], [8, 5, 1993], [5, 6, 1993]]
0 0 7
[[2, 1, 1991], [1, 2, 1991], [3, 4, 1992], [4, 5, 1992], [8, 5, 1993], [5, 6, 1993]]
[[1, 2, 1991], [2, 1, 1991], [3, 4, 1992], [4, 5, 1992], [5, 6, 1993], [8, 5, 1993]]
0 7 8
[]
[]
0 8 9
[]
[]
0 9 10
[]
[]
0 10 11
[]
[]
0 11 12
Any help is greatly appreciated!
Just write a simple sort algorithm and a compare function, such as this:
date_list = [[1,2,1991],[2,1,1991],[3,4,1992],[5,6,1993],[4,5,1992],[8,5,1993]]
# first compare years, if equal compare months, if equal compare days
def compare(date1,date2):
if date1[2] != date2[2]:
return date1[2]<date2[2]
if date1[1] != date2[1]:
return date1[1]<date2[1]
return date1[0] < date2[0]
for i in range(len(date_list)):
for j in range(i+1,len(date_list)):
if not compare(date_list[i],date_list[j]):
date_list[i],date_list[j] = date_list[j],date_list[i]
print date_list
The time complexity is O(n^2) but you can improve it by using a more efficient sort algorithm.
If you convert it to YYYYMMDD string format you can easily sort it. Try to sort string concatinated data instead of spiting it to 3 part.