I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3
I have the following dataframe:
ID mutex add atomic add cas add ys_add blocking ticket queued fifo
Cores
1 21.0 7.1 12.1 9.8 32.2 44.6
2 121.8 40.0 119.2 928.7 7329.9 7460.1
3 160.5 81.5 227.9 1640.9 14371.8 11802.1
4 188.9 115.7 347.6 1945.1 29130.5 15660.1
There is both a column index (ID) and a row index (Cores). When I use DataFrame.to_html(), I get a table like this:
Instead, I'd like a table with a single header row, composed of all the column names (but without the column index name ID) and with the row index name Cores in that same header row, like so:
I'm open to manipulating the dataframe prior to the to_html() call, or adding parameters to the to_html() call, but not messing around with the generated html.
Initial setup:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],
columns = ['attr_a', 'attr_b', 'attr_c', 'attr_c'])
df.columns.name = 'ID'
df.index.name = 'Cores'
df
ID attr_a attr_b attr_c attr_c
Cores
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Then set columns.name to 'Cores', and index.name to None. df.to_html() should then give you the output you want.
df.columns.name='Cores'
df.index.name = None
df.to_html()
Cores attr_a attr_b attr_c attr_c
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.
I am using topic_.set_value(each_topic, word, prob) to change the value of cells in a pandas dataframe. Basically, I initialized a numpy array with a certain shape and converted it to a pandas dataframe. I am then replacing these zeros by iterating over all the columns and rows using the code above. The problem is that the number of cells are around 50,000 and every time I set the value pandas prints the array to the console. I want to suppress this behavior. Any ideas?
EDIT
I have two dataframes one is topic_ which is the target dataframe and tw which is the source dataframe. The topic_ is a topic by word matrix, where each cell stores the probability of a word occurring in a particular topic. I have initialized the topic_ dataframe to zero using numpy.zeros. A sample of the tw dataframe-
print(tw)
topic_id word_prob_pair
0 0 [(customer, 0.061703717964), (team, 0.01724444...
1 1 [(team, 0.0260560163563), (customer, 0.0247838...
2 2 [(customer, 0.0171786268847), (footfall, 0.012...
3 3 [(team, 0.0290787264225), (product, 0.01570401...
4 4 [(team, 0.0197917953222), (data, 0.01343226630...
5 5 [(customer, 0.0263740639141), (team, 0.0251677...
6 6 [(customer, 0.0289764173735), (team, 0.0249938...
7 7 [(client, 0.0265082412402), (want, 0.016477447...
8 8 [(customer, 0.0524006965405), (team, 0.0322975...
9 9 [(generic, 0.0373422774996), (product, 0.01834...
10 10 [(customer, 0.0305256248248), (team, 0.0241559...
11 11 [(customer, 0.0198707090364), (ad, 0.018516805...
12 12 [(team, 0.0159852971954), (customer, 0.0124540...
13 13 [(team, 0.033444510469), (store, 0.01961003290...
14 14 [(team, 0.0344793243818), (customer, 0.0210975...
15 15 [(team, 0.026416114692), (customer, 0.02041691...
16 16 [(campaign, 0.0486186973667), (team, 0.0236024...
17 17 [(customer, 0.0208270072145), (branch, 0.01757...
18 18 [(team, 0.0280889397541), (customer, 0.0127932...
19 19 [(team, 0.0297011415217), (customer, 0.0216007...
My topic_ dataframe is of the size of num_topics(which is 20) by number_of_unique_words (in the tw dataframe)
Following is the code I am using to replace each value in the topic_ dataframe
for each_topic in range(num_topics):
a = tw['word_prob_pair'].iloc[each_topic]
for word, prob in a:
topic_.set_value(each_topic, word, prob)
just redirect the output into variable:
>>> df.set_value(index=1,col=0,value=1)
0 1
0 0.621660 -0.400869
1 1.000000 1.585177
2 0.962754 1.725027
3 0.773112 -1.251182
4 -1.688159 2.372140
5 -0.203582 0.884673
6 -0.618678 -0.850109
>>> a=df.set_value(index=1,col=0,value=1)
>>>
To init df it's better to use this:
pd.DataFrame(np.zeros_like(pd_n), index=pd_n.index, columns=pd_n.columns)
If you do not wish to create a variable ('a' in the suggestion above) then use python's throwaway variable '_'. So your statement becomes :
_ = df.set_value(index=1,col=0,value=1)
I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.