I have a script that does things for me, but very inefficiently. I asked for some help on code reviewers, and was told to try Pandas instead. This is what I've done, but I'm having some difficulty understand how it works. I've tried to read the documentation and other questions here, but I can't find any answer.
So, I've got a dataframe with a small amount of rows (20 to couple of hundred) and a smaller number of columns. I've used the read_table pandas function to get at the original data in .txt form, which looks like this:
[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3]
[ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3]
[ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3]
[ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3]
[ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]
... along with a whole bunch of unimportant columns.
What I want to be able to do is to select all the ratios from each Sequence and perform some calculations and statistics on them (all 3 ratios for each sequence, that is). I've tried
df.groupby('Sequence')
for col in df:
do something / print(col) / print(col[0])
... but that only makes me more confused. If I pass print(col), I get some kind of df construct printed, whereas if I pass print(col[0]), I only get the sequences. As far as I can see in the construct, I should still have all the other columns and their data, since groupby() doesn't remove any data, it just groups it by some input column. What am I doing wrong?
Although I haven't gotten that far yet, due to the problems above, I also want my script to be able to select all the ratios for every ID and perform the same calculations on them, but this time every ratio by itself (i.e. Ratio1 for all rows of ID1, the same for Ratio2, etc.). And, lastly, do the same thing for every gene.
EDIT:
So, say I want to perform this calculation on every ratio in the row, and then take the median of the three resulting values:
df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)
... where spike is a dictionary, and the keys are the IDs. Ignoring the dict part, I can make calculations (thanks!), but how do I access the dictionary using the dataframe IDs? With the above code, I just get a "Unhashable type: Series" error.
Here's some real data:
ID Gene Sequence Ratio1 Ratio2 Ratio3
1 KRAS SFEDXXYR 15.822 14.119 14.488
2 KRAS VEDAXXXLVR 9.8455 8.9279 16.911
3 ELK4 IEXXXCESLNK 15.745 7.9122 9.5966
3 ELK4 IEGXXXSLNKR 1.177 NaN 12.073
df.groupby() does not modify/group df in place. So you have to assign the result to a new variable to further use it. E.g. :
grouped = df.groupby('Sequence')
BTW, in the example data you give, all data in the Sequence column are unique, so grouping on that column will not do much.
Furthermore, you normally don't need to 'iterate over the df' as you do here. To apply a function to all groups, you can do that directly on the groupby result, eg df.groupby().apply(..) or df.groupby().aggregate(..).
Can you give a more specific example of what kind of function you want to apply to the ratios?
To calculate the median of the three ratio's for each sequence (each row), you can do:
df[['Ratio1', 'Ratio2', 'Ratio3']].median(axis=1)
The axis=1 means that you do not want to take the median of one column (over the rows), but for each row (over the columns)
Another examle, to calculate the median of all Ratio1's for each ID, you can do:
df.groupby('ID')['Ratio1'].median()
Here you group by ID, select column Ratio1 and calculate the median value for each group.
UPDATE: you should probably split the questions into seperate ones, but as an answer to your new question:
data['ID'] will give you the ID column, so you cannot use it as a key. You want one specific value of that column. To apply a function on each row of a dataframe, you can use apply:
def my_func(row):
return spike[row['ID']] / float(row['Ratio 1']) * (10**-12) * (6.022*10**23) / (1*10**6)
df['Value1'] = df.apply(my_func, axis=1)
Related
May be Someone can just help me to find the solution:
I have 100 dataframes. Each dataframe contains time / High_Price / Low_price
I would like to create new Dataframe, which contains Gains from each DataFrame.
Example:
df1 = pd.DataFrame({"high":[5,4,5,2],
"low":[1,2,2,1]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
df100 = pd.DataFrame({"high":[7,5,6,7],
"low":[1,2,3,4]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
Functions:
def myfunc(data, amount):
data= data.loc[(data!=0).any(1)]
profit = (amount/data.iloc[0]['low']) * data.iloc[-1]['high']
return profit
Output should be:
output= pd.DataFrame({"Gain":[1,6]},
index=["df1","df100"])
How can I apply function to 100 DataFrames and get from them only Gains by creating the Dataframe, where we see the name of DataFrame and the Gain for this DataFrame?
Put your dataframes in a list and access them by integer index. Having variables named df1 to df100 is bad programming style because a) the dataframes belong together, so put them in a collection (e.g. list) and b) you cannot get "the" name of an object from its value, leading to complications such as the one you are facing now.
So let dfs be your list of 100 dataframes, starting at index 0.
Use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs], columns=['Gain'])
The index of output now corresponds to the index of dfs, starting at 0. There's no reason to rename it to 'df1' ... 'df100', you gain no information and the output becomes harder to handle.
In case of arbitrary dataframe names, use a dictionary that maps name to df. Let's call it dfs again. Then use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs.values()], columns=['Gain'], index=dfs.keys()])
I'm assuming myfunc is correct, I did not debug it.
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I filter the duplicates, got duplicate on the same row and join items by comma and with this below code, don't really understand why the Join_Dup column is replicated?
dd = sales_all[sales_all['Order ID'].duplicated(keep=False)]
dd['Join_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
print(dd.head())
dd = dd[['Order ID','Join_Dup']].drop_duplicates()
dd
Order ID Join_Dup
0 176558 USB-C Charging Cable,USB-C Charging Cable,USB-...
2 176559 Bose SoundSport Headphones,Bose SoundSport Hea...
3 176560 Google Phone,Wired Headphones,Google Phone,Wir...
5 176561 Wired Headphones,Wired Headphones,Wired Headph...
... ... ...
186846 259354 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186847 259355 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186848 259356 34in Ultrawide Monitor,34in Ultrawide Monitor,...
186849 259357 USB-C Charging Cable,USB-C Charging Cable,USB-...
[178437 rows x 2 columns]
I need to remove the duplicates from the cell in each row, can some please help.
IIUC, let's try to prevent the duplicates in the groupby transform statement:
dd['Join_No_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(set(x)))
Edit disregard the second part of the answer. I will delete that portion if it ends up not being useful.
So you comment you want unique product strings for each Order ID. You can get that in a single step:
dd = (
sales_all.groupby(['Order ID', 'Product'])['some_other_column']
.size().rename('quantity').reset_index()
)
Now you have unique rows of OrderID/Product with the count of repeated products (or quantity, as in a regular invoice). You can work with that or you can groupby to form a list of products:
orders = dd.groupby('Order ID').Product.apply(list)
---apply vs transform---
Please note that if you use .transform as in your question you will invariably get a result with the same shape as the dataframe/series being grouped (i.e. grouping will be reversed and you will end up with the same number of rows, thus creating duplicates). The function .apply will pass the groups of your groupby to the same function, any function, but will not broadcast back to the original shape (it will return only one row per group).
Old Answer
So you are removing ALL Oder IDs that appear in multiple rows (if ID 14 appears in two rows you discard both rows). This makes the groupby in the next line redundant, as every grouped ID will have just one line.
Ok, now that's out of the way. Then presumably each row in Product contains a list which you are joining with a lambda. This step would be a little faster with a pandas native function.
dd['Join_Dup'] = dd.Product.str.join(', ')
# perhaps choose a better name for the column, once you remove duplicates it will not mean much (does 'Join_Products' work?)
Now to handle duplicates. You didn't actually need to join in the last step if all you wanted was to remove dups. Pandas can handle lists as well. But the part you were missing is the subset attribute.
dd = dd[['Order ID', 'Join_Dup']].drop_duplicates(subset='Join_Dup')
I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)
So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.