Python3, Pandas - New Column Value based on Column To Left Data (Dynamic) - python

I have a spreadsheet with several columns containing survey responses. This spreadsheet will be merged into others and I will then have duplicate rows similar to the ones below. I will then need to take all questions with the same text and calculate the percentages of the answers based on the entirety of the merged document.
Example Excel Data
**Poll Question** **Poll Responses**
The content was clear and effectively delivered  37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating  37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training  37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course?  5 Total Votes
My method for calculating a non-percent number of votes will be to convert the percentages to a number. E.G. find and extract 37 from 37 Total Votes, then use the following formula to get the number of users that voted on that particular answer: percent * total / 100.
So 24.30 * 37 / 100 = 8.99 rounded up means 9 out of 37 people voted for "Strongly Agree".
Here's an example spreadsheet of what I'd like to be able to do:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(note: non-percent and subtotal would be newly created columns)
Currently I take a folder full of .xls files and I loop through that folder, saving them to another in an .xlsx format. Inside that loop, I've added a comment block that contains my # NEW test CODE where I'm trying to put the logic to do this.
As you can see, I'm trying to target the cell and get the value, then get some regex and extract the number from it, (then add it to the subtotal column in that row. I then want to add it till I see a new instance of a row containing x Total Votes.
Here's my current code:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL;DR
This seems very convoluted, but if I can get the two new columns that contain the total votes for a question and the number (not percentage) of votes for an answer, then I can do some VLOOKUP magic for this on the merged document. Any help or methodology suggestions would be greatly appreciated. Thanks!

I solved this, I'll post the pseudo code below:
I loop through each sheet. Inside that loop, I loop through each row using for n, row in enumerate(df.itertuples(), 1):.
I get the value of the field that might contain "Poll Response" poll_response = str(row[3])
Using an if / else I check if the poll_response contains the text "Total Votes". If it does, it must be a question, otherwise it must be a row with an answer.
In the if for the question I get the cells that contain the data I need. I then have a function that compares the question text with all objects question text in the array. If it's a match, then I simply update the fields of the object, otherwise I create a new question object.
else the row is an answer row, and I use the question text to find the object in the array and update/add the answers or data.
This process loops through all the rows in each spreadsheet, and now I have my array full of unique question objects.

Related

Number of concurrent events per username in Pandas

I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.
Reference
starttime
stoptime
Username
1
2020-07-28 06:41:56.000
2020-07-28 07:11:25.000
Arthur
2
2020-07-18 13:24:02.000
2020-07-18 13:38:42.000
Arthur
3
2020-07-03 09:27:03.000
2020-07-03 10:35:24.000
Arthur
4
2020-07-05 19:42:38.000
2020-07-05 20:07:52.000
Bob
5
2020-07-04 10:22:48.000
2020-07-04 10:24:32.000
Bob
Any ideas?
Someone asked a similar question just yesterday so here it is:
ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])
# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)
df
To answer your final question for the mean value you would then run:
df['overlap'].mean()
I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:
Collect all start times and stop times as 'moments of change in activities'.
Loop over all of them in your grouped dataframe
Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
Save these in a list to compute the average count of cases
I don't have your code, but in pseudo-code it would look something like:
df = ... # your raw df
grouped = df.groupby(by='Username')
for user, user_df in grouped:
cases = []
user_starts_cases = user_df['starttime'].to_numpy()
user_stops_cases = user_df['stoptime'].to_numpy()
times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
for xs in times_of_activity_changes:
num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)]) # mind the brackets
active_cases.append(num_activities)
print(sum(active_cases)/len(active_cases))
It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.

fuzzy matching for SQL using fuzzy wuzzy and pandas

i have the following table in SQL and want to use Fuzzy Wuzzy to compare all the records in the table for any potential duplicates which in this instance line 1 is a duplicate of line 2 (or vice versa). can someone explain how i can add two additional columns to this table (Highest Score and Record Line Num) using Fuzzy Wuzzy and pandas? thanks.
Input:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount
ABC 5/12/2019 5/10/2019 ABCDE56. 56
ABC 5/13/2019 5/10/2019 ABCDE56 56
TIM 4/15/2019 4/10/2019 RTET5SDF 100
Desired Output:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount Highest Score Record Line Num
ABC 5/12/2019 5/10/2019 ABCDE56. 56 96 2
ABC 5/13/2019 5/10/2019 ABCDE56 56 96 1
TIM 4/15/2019 4/10/2019 RTET5SDF 100 0 N/A
Since you are looking for duplicates, you should filter your data frame first using the vendor name. This is to ensure it doesn't match with invoices of other vendors and reduce the processing time. However, since you didn't mention anything about it, you can skip it.
Decide on a threshold for duplicates based on the length of your invoice references. For example if the average is 5 chars, make the threshould 80%. Then, use fuzzywuzzy to get the best match.
from fuzzywuzzy import fuzz, process
# Assuming no NaNs in invoices references
inv_list = df['Invoice Ref'].to_list()
for i, inv in enumerate(inv_list)
result = process.extractOne(inv, inv_list, scorer=fuzz.token_sort_ratio)
if result[1] >= your_threshould:
df.loc[i, 'Highest Score'] = result[1]
df.loc[i, 'Record Line Num'] = inv_list.index(result[0])

pulling a column of data with a set number of rows from multiple text files into one text file

I have several hundred text files. I want to extract a specific column with a set number of rows. The files are exactly the same the only thing different is the data values. I want to put that data into a new text file with each new column preceding the previous one.
The file is a .sed basically the same as a .txt file. this is what it looks like. The file actually goes from Wvl 350-2150.
Comment:
Version: 2.2
File Name: C:\Users\HyLab\Desktop\Curtis
Bernard\PSR+3500_1596061\PSR+3500_1596061\2019_Feb_16\Contact_00186.sed
<Metadata>
Collected By:
Sample Name:
Location:
Description:
Environment:
</Metadata>
Instrument: PSR+3500_SN1596061 [3]
Detectors: 512,256,256
Measurement: REFLECTANCE
Date: 02/16/2019,02/16/2019
Time: 13:07:52.66,13:29:17.00
Temperature (C): 31.29,8.68,-5.71,31.53,8.74,-5.64
Battery Voltage: 7.56,7.20
Averages: 10,10
Integration: 2,2,2,10,8,2
Dark Mode: AUTO,AUTO
Foreoptic: PROBE {DN}, PROBE {DN}
Radiometric Calibration: DN
Units: None
Wavelength Range: 350,2500
Latitude: n/a
Longitude: n/a
Altitude: n/a
GPS Time: n/a
Satellites: n/a
Calibrated Reference Correction File: none
Channels: 2151
Columns [5]:
Data:
Chan.# Wvl Norm. DN (Ref.) Norm. DN (Target) Reflect. %
0 350.0 1.173460E+002 1.509889E+001 13.7935
1 351.0 1.202493E+002 1.529762E+001 13.6399
2 352.0 1.232869E+002 1.547818E+001 13.4636
3 353.0 1.264006E+002 1.563467E+001 13.2665
4 354.0 1.294906E+002 1.578425E+001 13.0723
I've taken some coding classes but that was a long time ago. I figured this is a pretty straightforward problem for even a novice coder which I am not but I can't seem to find anything like this so was hoping for help on here.
I honestly don't need anything fancy just something like this would be amazing so I don't have to copy and paste each file!
12.3 11.3 etc...
12.3 11.3 etc...
12.3 11.3 etc...
etc.. etc.. etc...
In MATLAB R2016b or later, the easiest way to do this would be using readtable:
t = readtable('file.sed', delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
where
file.sed is the name of the file
'NumVariables', 5 means there are 5 columns of data
'DataLines', 36 means the data starts on the 36th line and continues to the end of the file
'Delimiter', ' ' means the character that separates the columns is a space
'ConsecutiveDelimitersRule', 'join' means treat more than one space as if they were just one (rather than as if they separate empty columns of data).
This assumes that the example file you've posted is in the exact format of your real data. If it's different you may have to modify the parameters above, possibly with reference to the help for delimitedTextImportOptions or as an alternative, fixedWidthImportOptions.
Now you have a MATLAB table t with five columns, of which column 2 is the wavelengths and column 5 is the reflectances - I assume that's the one you want? You can access that column with
t(:,5)
So to collect all the reflectance columns into one table you would do something like
fileList = something % get the list of files from somewhere - say as a string array or a cell array of char
resultTable = table;
for ii = 1:numel(fileList)
sedFile = fileList{ii};
t = readtable(sedFile, delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
t.Properties.VariableNames{5} = sprintf('Reflectance%d', ii);
resultTable = [resultTable, t(:,5)];
end
The t.Properties.VariableNames ... line is there because column 5 of t will be called Var5 every time, but in the result table each variable name needs to be unique. Here we're renaming the output table variables Reflectance1, Reflectance2 etc but you could change this to whatever you want - perhaps the name of the actual file from sedFile - as long as it's a valid unique variable name.
Finally you can save the result table to a text file using writetable. See the MATLAB help for how to use that.
In Python 3.x with numpy:
import numpy as np
file_list = something # filenames in a Python list
result_array = None
for sed_file in file_list:
reflectance_column = np.genfromtxt(sed_file, skip_header=35, usecols=4)
result_array = (reflectance_column if result_array is None else
np.column_stack((result_array, reflectance_column)))
np.savetxt('outputfile.txt', result_array)
Here
skip_header=35 ignores the first 35 lines
usecols=4 only returns column 5 (Python uses zero-based indexing)
see the help for savetxt for further details

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Computing aggregate by creating nested dictionary on the fly

I'm new to python and I could really use your help and guidance at the moment. I am trying to read a csv file with three cols and do some computation based on the first and second column i.e.
A spent 100 A spent 2040
A earned 60
B earned 48
B earned 180
A spent 40
.
.
.
Where A spent 2040 would be the addition of all 'A' and 'spent' amounts. This does not give me an error but it's not logically correct:
for row in rows:
cols = row.split(",")
truck = cols[0]
if (truck != 'A' and truck != 'B'):
continue
record = cols[1]
if(record != "earned" and record != "spent"):
continue
amount = int(cols[2])
#print(truck+" "+record+" "+str(amount))
if truck in entries:
#entriesA[truck].update(record)
if record in records:
records[record].append(amount)
else:
records[record] = [amount]
else:
entries[truck] = records
if record in records:
records[record].append(amount)
else:
entries[truck][record] = [amount]
print(entries)
I am aware that this part is incorrect because I would be adding the same inner dictionary list to the outer dictionary but I'm not sure how to go from there:
entries[truck] = records
if record in records:
records[record].append(amount)
However, Im not sure of the syntax to create a new dictionary on the fly that would not be 'records'
I am getting:
{'B': {'earned': [60, 48], 'spent': [100]}, 'A': {'earned': [60, 48], 'spent': [100]}}
But hoping to get:
{'B': {'earned': [48]}, 'A': {'earned': [60], 'spent': [100]}}
Thanks.
For the kind of calculation you are doing here, I highly recommend Pandas.
Assuming in.csv looks like this:
truck,type,amount
A,spent,100
A,earned,60
B,earned,48
B,earned,180
A,spent,40
You can do the totalling with three lines of code:
import pandas
df = pandas.read_csv('in.csv')
totals = df.groupby(['truck', 'type']).sum()
totals now looks like this:
amount
truck type
A earned 60
spent 140
B earned 228
You will find that Pandas allows you to think on a much higher level and avoid fiddling with lower level data structures in cases like this.
if record in entries[truck]:
entries[truck][record].append(amount)
else:
entries[truck][record] = [amount]
I believe this is what you would want? Now we are directly accessing the truck's records, instead of trying to check a local dictionary called records. Just like you did if there wasn't any entry of a truck.

Categories