Looking up a row in a dataframe from a different dataframe - python

I have two data frames df and ctr.
df contains a column position a value between 1 and 100, and a column Average monthly searches which contains an integer.
position | Average monthly searches
1 | 250
2 | 10
3 | 30
2 | 40
4 | 100
ctr contains a column position a value between 1 and 100, and a column Decay Ctr which is a percentage that reflects the decay at each position.
Position | Decay Ctr
1 | 27.18%
2 | 18.27%
3 | 12.66%
4 | 9.13%
5 | 6.90%
What I want to do is for each row in df lookup that position in ctr and times Average monthly searches by the correct Decay Ctr.
with open("C:\Environments\ENV\Export npower Report Rebuild KWs.csv",newline='') as csvfile:
with open("C:\Environments\ENV\ctr_csv.csv", newline='') as ctrfile:
ctr = pd.read_csv(ctrfile)
df = pd.read_csv(csvfile)
But I am unsure how to extract the correct elements to put it into a new column visibility in df. I tried using an apply statement but was unsure how to reference ctr correctly.
df['visibility'] = df.apply(numpy.multiply(df['Average monthly searches'] , ctr[ctr[]] ), axis = 0 )

You can just merge the two frames and then create a new column.
comb_df = df.merge(ctr, left_on = 'Position', right_on = 'Position')
comb_df['visibility'] = comb_df['Avg Monthly searches'] *
comb_df['Decay']

I am unsure of what you are asking but if I understand it correctly, what you want to do it actually merge the two data frames and then you can perform whatever operations you like
First way
df1 = pd.merge(df,ctr, on='Position' index=False)
#Then you can multiply both columns however you like
df1['Visibility'] = df1.apply(numpy.multiply(df['Average monthly searches'] , df['Decay Ctr']), axis=1)
Second Way
df1 = pd.merge(df,ctr,on='Position',index=False)
def multiply(x):
for index, row in df1.iterrows():
row['Visibility'] = row['Average monthly searches'] * row['Decay Ctr']
return row['Visibility']
df1 =df1.apply(multiply, axis=1)

Related

Update each row value with constant value plus previous value using pandas

I have a data frame having 4 columns, 1st column is equal to the counter which has values in hexadecimal.
Data
counter frequency resistance phase
0 15000.000000 698.617126 -0.745298
1 16000.000000 647.001708 -0.269421
2 17000.000000 649.572265 -0.097540
3 18000.000000 665.282775 0.008724
4 19000.000000 690.836975 -0.011101
5 20000.000000 698.051025 -0.093241
6 21000.000000 737.854003 -0.182556
7 22000.000000 648.586792 -0.125149
8 23000.000000 643.014160 -0.172503
9 24000.000000 634.954223 -0.126519
a 25000.000000 631.901733 -0.122870
b 26000.000000 629.401123 -0.123728
c 27000.000000 629.442016 -0.156490
Expected output
| counter | sampling frequency | time. |
| --------| ------------------ |---------|
| 0 | - |t0=0 |
| 1 | 1 |t1=t0+sf |
| 2 | 1 |t2=t1+sf |
| 3 | 1 |t3=t2+sf |
The time column is the new column added to the original data frame. I want to plot time in the x-axis and frequency, resistance, and phase in y-axis.
Because in order to calculate the value of any row you need to calculate the value of the previous row before, you may have to use a for loop for this problem.
For a constant frequency, you could just calculate it in advance, no need to operate in the datafame:
sampling_freq = 1
df['time'] = [sampling_freq * i for i in range(len(df))]
If you need to operate in the dataframe (let's say the frequency may change at some point), in order to call each cell based on row number and column name, you can this suggestion. Syntax would be a lot easier using both numbers for row and column, but I prefer to refer to 'time' instead of 2.
df['time'] = np.zeros(len(df))
for i in range(1, len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + df.iloc[i, df.columns.get_loc('sampling frequency')]
Or, alternatively, resetting the index so you can iterate through consecutive numbers:
df['time'] = np.zeros(len(df))
df = df.reset_index()
for i in range(1, len(df)):
df.loc[i, 'time'] = df.loc[i-1, 'time'] + df.loc[i, 'sampling frequency']
df = df.set_index('counter')
Note that, because your sampling frequency is likely constant in the whole experiment, you could simplify it like:
sampling_freq = 1
df['time'] = np.zeros(len(df))
for i in range(1,len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + sampling_freq
But it's not going to be better than just create the time series as in the first example.

How to compare 2 data frames in python and highlight the differences?

I am trying to compare 2 files one is in xls and other is in csv format.
File1.xlsx (not actual data)
Title Flag Price total ...more columns
0 A Y 12 300
1 B N 15 700
2 C N 18 1000
..
..
more rows
File2.csv (not actual data)
Title Flag Price total ...more columns
0 E Y 7 234
1 B N 16 600
2 A Y 12 300
3 C N 17 1000
..
..
more rows
I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.
If I use merge function as below, I am getting output with the values only for File 1.
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
output I am getting
Title Attention_Needed Price total
1 B N 15 700
2 C N 18 1000
This output is not showing the correct diff as record with Title 'E' is missing
I also tried using panda merge
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
& output for above was
Title Flag Price total Exist
0 A Y 12 300 both
1 B N 15 700 left_only
2 C N 18 1000 left_only
3 E Y 7 234 right_only
4 B N 16 600 right_only
5 C N 17 1000 right_only
Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.
Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines
Title Flag Price total Comments
0 A Y 12 300 matching
1 B N 15 700 Price, total different
2 C N 18 1000 Price different
3 E Y 7 234 New record
If above output can not be possible, then please suggest if there is any other way to solve this.
PS: This is my first question here, so please let me know if you need more details here.
Rows in DF1 Which Are Not Available in DF2
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
Rows in DF2 Which Are Not Available in DF1
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
If you're differentiating by row not column
pd.concat([df1,df2]).drop_duplicates(keep=False)
If each df has the same columns and each column should be compared individually
for col in data.columns:
set(df1.col).symmetric_difference(df2.col)
# WARNING: this way of getting column diffs likely won't keep row order
# new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]
lets assume df1 (left) is our "source of truth" for what's considered an original record.
after running
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
take the output and split it into 2 df's.
df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]
Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".
Let's assume you add the 'comments' column to df1
you could say that everything in df2 is a new record, but that would disregard the "price/total different".
If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)
After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.
# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
weights = [0 for col in l]
total = 100
while total > 0:
for i, e in enumerate(l):
for j in range(i+1):
weights[j] += 1
total -= 1
return weights
def scoreRows(row1, row2):
s = 0
for i, colName in enumerate(colRank):
if row1[colName] == row2[colName]:
s += weights[i]
colRank = ['title', 'flag']
weights = getWeights(colRank)
Let's say only these 2 matter and the rest are considered 'modifications' to an original row
That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record
What makes a row a new record is completely up to you.
Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1
if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]
and you want to compare this row against that df
[1, 6, 5, 4]
this row has the same first element as both, the same second element as row2, and the same 4th element of row1.
so which row does it differ from?
if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records
if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table
# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = []
# list of lists
# each inner list index correlates to the index for df2
# inner lists are
# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)
for i, row1 in df2.itterrows():
thisRowsScores = []
#df2 first because they are what we are scoring
for j, row2 in df1.iterrows():
s = scoreRows(row1, row2)
if s>0: # only save rows and scores that matter
thisRowsScores.append((row2, s))
# at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
# or you can just keep the best score like i'll be doing
#sort by score
sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
rowScores.append(sortedRowScores[0])
# appends empty list if no good matches found in df1
# alternatively, remove 'reversed' from above and index at -1
The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments
At this point, lets just say that df1 already has the comments "matching" added to it
Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.
But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Merge subgroup into adjacent subgroup after groupby

If we run the following code
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
We get average values for each grouping of the feature set:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
In the table above, some of the groups has too few observations and I want to merge it into 'adjacent' group by some rules. For example, I may want to merge the group [0,0,0] with group [0,0,1] since it has no more than 30 observations. I wonder if there is any good way of operating such group combinations according to columns values without creating a separate dictionary? More specifically, I may want to merge from the smallest count group to its adjacent group (the next group within the index order) until the total number of groups is no more than 10.
A simple way to do it is with a loop for on indexes meeting your condition:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)

Categories