Renaming some part of columns of dataframe with values from another dataframe - python

I want to change the column names from another DataFrame.
There are some similar questions in stackoverflow, but I need advanced version of it.
data1 = {
"ABC-123_afd": [420, 380, 390],
"LFK-402_ote": [50, 40, 45],
"BPM-299_qbm": [50, 40, 45],
}
data2 = {
"ID": ['ABC-123', 'LFK-402', 'BPM-299'],
"NewID": ['IQU', 'EUW', 'NMS']
}
data1_df=pd.DataFrame(data1)
# ABC-123_afd LFK-402_ote BPM-299_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
data2_df=pd.DataFrame(data2)
# ID NewID
#0 ABC-123 IQU
#1 LFK-402 EUW
#2 BPM-299 NMS
I want to make the final result as below:
data_final_df
# IQU_afd EUW_ote NMS_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
I tried the code in Renaming columns of dataframe with values from another dataframe.
It ran without error, but there were no changes. I think column names in data 1 are not perfectly matched to the value in the data2 value.
How can I change some part of the column name from another pandas DataFrame?

We could create a mapping from "ID" to "NewID" and use it to modify column names:
mapping = dict(zip(data2['ID'], data2['NewID']))
data1_df.columns = [mapping[x] + '_' + y for x, y in data1_df.columns.str.split('_')]
print(data1_df)
or
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(mapping) + '_' + s.str[1]
or use the DataFrame data2_df:
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(data2_df.set_index('ID')['NewID']) + '_' + s.str[1]
Output:
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45

One option is to use replace:
mapping = dict(zip(data2['ID'], data2['NewID']))
s = pd.Series(data1_df.columns)
data1_df.columns = s.replace(regex = mapping)
data1_df
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45

Related

Filtering dataframes based on one column with a different type of other column

I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?
Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

How to calculate value of column in dataframe based on value/count of other columns in the dataframe in python?

I have a pandas dataframe which has data of 24 hours of the day for a whole month with the following fields:
(df1):- date,hour,mid,rid,percentage,total
I need to create 2nd dataframe using this dataframe with the following fields:
(df2) :- date, hour,mid,rid,hour_total
Here hour_total is to be calculated as below:
If for a combination of (date,mid,rid) from dataframe 1, count of records where df1.percentage is 0 is 24, then hour_total = df1.total/24 else hour_total = (df1.percentage /100) * total
For example if dataframe 1 is as below:- (count of records for group of date mid,rid where perc is 0 is 24)
date,hour,mid,rid,perc,total
2019-10-31,0,2, 0,0,3170.87
2019-10-31,1,2,0,0,3170.87
2019-10-31,2,2,0,0,3170.87
2019-10-31,3,2,0,0,3170.87
2019-10-31,4,2,0,0,3170.87
.
.
2019-10-31,23,2,0,0,3170.87
Then dataframe 2 should be: (hour_total = df1.total/24)
date,hour,mid,rid,hour_total
2019-10-31,0,2,0,132.12
2019-10-31,1,4,0,132.12
2019-10-31,2,13,0,132.12
2019-10-31,3,17,0,132.12
2019-10-31,4,7,0,132.12
.
.
2019-10-31,23,27,0,132.12
How can I accomplish this?
You can try the apply function
For example
a = np.random.randint(100,200, size=5)
b = np.random.randint(100,200, size=5)
c = [datetime.now() for x in range(100) if x%20 == 0]
df1 = pd.DataFrame({'Time' : c, "A" : a, "B" : b})
Above data frame looks like this
Time A B
0 2019-10-24 20:37:38.907058 158 190
1 2019-10-24 20:37:38.907058 161 127
2 2019-10-24 20:37:38.908056 100 100
3 2019-10-24 20:37:38.908056 163 164
4 2019-10-24 20:37:38.908056 121 159
Now if we want to compute a new column whose value depends on the other values of column.
You can define a function which does this computation.
def func(x):
t = x[0] # time
a = x[1] # A
b = x[2] # B
return a+b
And apply this function to the data frame
df1["new_col"] = df1.apply(func, axis=1)
Which would yield the following result.
Time A B new_col
0 2019-10-24 20:37:38.907058 158 190 348
1 2019-10-24 20:37:38.907058 161 127 288
2 2019-10-24 20:37:38.908056 100 100 200
3 2019-10-24 20:37:38.908056 163 164 327
4 2019-10-24 20:37:38.908056 121 159 280

Reading specific rows out of a panda dataframe using a list

I have a pandas dataframe that I need to pull specific rows out of and into a new dataframe.
These rows are in a list that look something like this:[42 50 52 59 60 62]
I am creating the dataframe from a .csv file but as far as I can tell there is not a way to designate the row numbers when reading the .csv and creating the dataframe.
import pandas as pd
df = pd.read_csv('/Users/uni/Desktop/corrindex+id/rt35',index_col = False, header = None )
Here's a portion of the dataframe:
0
0 1 269 245 44 5
1 2 293 393 33 5
2 3 295 175 67 12
3 4 298 415 33 5
4 5 304 392 213 11
Use skiprows with a callable:
import pandas as pd
keep_rows = [42 50 52 59 60 62]
df = pd.read_csv('/Users/uni/Desktop/corrindex+id/rt35',
header=None
skiprows=lambda x: x not in keep_rows)
Unfortunately, pandas read_cvs expects a true file, and not a mere line generator, so it is not easy to select only a bunch of lines. But you can to that at Python level easily:
lines = [line for i, line in enumerate(open('/Users/uni/Desktop/corrindex+id/rt35'), 1)
if i in [42 50 52 59 60 62]]
df = pd.read_csv(io.StringIO(''.join(lines)),index_col = False, header = None )
You can also use skiprows to ignore all the lines except the ones to keep:
df = pd.read_csv('/Users/uni/Desktop/corrindex+id/rt35',index_col = False,
header = None, skiprows=lambda x: x not in [42 50 52 59 60 62])
You can go about it like this:
import pandas as pd
my_list = [42, 50, 52, 59, 60, 62]
df = pd.read_csv('/Users/uni/Desktop/corrindex+id/rt35',
index_col= False,
header=None,
nrows=max(my_list) + 1).iloc[mylist]

Dataframe/Row Indexing for Pandas

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.
If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

filter pandas dataframe based in another column

this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)

Categories