Python replace all values in dataframe with values from other dataframe

Python replace all values in dataframe with values from other dataframe - python

I'm quite new to python (and pandas) and a have a replace task for a large dataframe i couldn't find a solution for.
So i have two dataframes, one (df1) which looks something like this:
Id Id Id
4954733 3929949 515674
2950086 1863885 4269069
1241018 3711213 4507609
3806276 2035233 4968071
4437138 1248817 1167192
5468160 4726010 2851685
1211786 2604463 5172095
2914539 5235788 4130808
4730974 5835757 1536235
2201352 5779683 5771612
3864854 4784259 2928288
the other dataframe (df2) containing all the 'old' id's and the corresponding new ones in the next column (from 1 to 20,000), which looks something like this:
Id Id_new
5774290 1
761000 2
3489755 3
1084156 4
2188433 5
3456900 6
4364416 7
3518181 8
3926684 9
5797492 10
4435820 11
what i would like to do is replace all the id's (all columns) in df1 with the corresponding Id_new from df2. I guess ideally without having to do a merge or join for each column, given the size of the dataset?
The result should be like this: df_new
Id_new Id_new Id_new
8 12 22
16 9 8
21 25 10
10 15 13
29 6 4
22 7 22
30 3 3
11 31 29
32 29 27
12 3 4
14 6 24
Any tips would be great, thanks in advance!

I think you need replace by Series created by set_index:
print (df1)
Id Id.1 Id.2
0 4954733 3929949 515674 <-first value changed for match data
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288
df = df1.replace(df2.set_index('Id')['Id_new'])
print (df)
Id Id.1 Id.2
0 1 3929949 515674
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.

Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

HTML table in pandas with single header row

I have the following dataframe:
ID mutex add atomic add cas add ys_add blocking ticket queued fifo
Cores
1 21.0 7.1 12.1 9.8 32.2 44.6
2 121.8 40.0 119.2 928.7 7329.9 7460.1
3 160.5 81.5 227.9 1640.9 14371.8 11802.1
4 188.9 115.7 347.6 1945.1 29130.5 15660.1
There is both a column index (ID) and a row index (Cores). When I use DataFrame.to_html(), I get a table like this:
Instead, I'd like a table with a single header row, composed of all the column names (but without the column index name ID) and with the row index name Cores in that same header row, like so:
I'm open to manipulating the dataframe prior to the to_html() call, or adding parameters to the to_html() call, but not messing around with the generated html.

Initial setup:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],
columns = ['attr_a', 'attr_b', 'attr_c', 'attr_c'])
df.columns.name = 'ID'
df.index.name = 'Cores'
df
ID attr_a attr_b attr_c attr_c
Cores
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Then set columns.name to 'Cores', and index.name to None. df.to_html() should then give you the output you want.
df.columns.name='Cores'
df.index.name = None
df.to_html()
Cores attr_a attr_b attr_c attr_c
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined

You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Supress printing of pandas output to console

I am using topic_.set_value(each_topic, word, prob) to change the value of cells in a pandas dataframe. Basically, I initialized a numpy array with a certain shape and converted it to a pandas dataframe. I am then replacing these zeros by iterating over all the columns and rows using the code above. The problem is that the number of cells are around 50,000 and every time I set the value pandas prints the array to the console. I want to suppress this behavior. Any ideas?
EDIT
I have two dataframes one is topic_ which is the target dataframe and tw which is the source dataframe. The topic_ is a topic by word matrix, where each cell stores the probability of a word occurring in a particular topic. I have initialized the topic_ dataframe to zero using numpy.zeros. A sample of the tw dataframe-
print(tw)
topic_id word_prob_pair
0 0 [(customer, 0.061703717964), (team, 0.01724444...
1 1 [(team, 0.0260560163563), (customer, 0.0247838...
2 2 [(customer, 0.0171786268847), (footfall, 0.012...
3 3 [(team, 0.0290787264225), (product, 0.01570401...
4 4 [(team, 0.0197917953222), (data, 0.01343226630...
5 5 [(customer, 0.0263740639141), (team, 0.0251677...
6 6 [(customer, 0.0289764173735), (team, 0.0249938...
7 7 [(client, 0.0265082412402), (want, 0.016477447...
8 8 [(customer, 0.0524006965405), (team, 0.0322975...
9 9 [(generic, 0.0373422774996), (product, 0.01834...
10 10 [(customer, 0.0305256248248), (team, 0.0241559...
11 11 [(customer, 0.0198707090364), (ad, 0.018516805...
12 12 [(team, 0.0159852971954), (customer, 0.0124540...
13 13 [(team, 0.033444510469), (store, 0.01961003290...
14 14 [(team, 0.0344793243818), (customer, 0.0210975...
15 15 [(team, 0.026416114692), (customer, 0.02041691...
16 16 [(campaign, 0.0486186973667), (team, 0.0236024...
17 17 [(customer, 0.0208270072145), (branch, 0.01757...
18 18 [(team, 0.0280889397541), (customer, 0.0127932...
19 19 [(team, 0.0297011415217), (customer, 0.0216007...
My topic_ dataframe is of the size of num_topics(which is 20) by number_of_unique_words (in the tw dataframe)
Following is the code I am using to replace each value in the topic_ dataframe
for each_topic in range(num_topics):
a = tw['word_prob_pair'].iloc[each_topic]
for word, prob in a:
topic_.set_value(each_topic, word, prob)

just redirect the output into variable:
>>> df.set_value(index=1,col=0,value=1)
0 1
0 0.621660 -0.400869
1 1.000000 1.585177
2 0.962754 1.725027
3 0.773112 -1.251182
4 -1.688159 2.372140
5 -0.203582 0.884673
6 -0.618678 -0.850109
>>> a=df.set_value(index=1,col=0,value=1)
>>>
To init df it's better to use this:
pd.DataFrame(np.zeros_like(pd_n), index=pd_n.index, columns=pd_n.columns)

If you do not wish to create a variable ('a' in the suggestion above) then use python's throwaway variable '_'. So your statement becomes :
_ = df.set_value(index=1,col=0,value=1)

Grouping by unique values in python pandas dataframe

I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--

df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python replace all values in dataframe with values from other dataframe - python

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

HTML table in pandas with single header row

Matching 'Date' dataframes in Pandas to enable joins/merging

Supress printing of pandas output to console

Grouping by unique values in python pandas dataframe

Categories

Resources