Python 3 - create combination of list within dictionary and aggregate - python

I think I'm going about this all in a backwards way in pandas. Here's an example dataframe:
Group rstart rend qty
1 10000 11000 1000
1 10000 11000 8000
1 10000 11000 13000
1 10000 11000 1000
2 6000 8000 4000
2 6000 8000 9000
2 6000 8000 3000
In the end I'm trying to identify the quantity or combination of quantities within the group that fall between the range and put a flag in a new column (and if possible save the combination in a new column too).
Here's what I have done so far and where I'm running into an issue - been trying out all different ways since I'm new to this language.
import pandas as pd
import numpy as np
import itertools
df = pd.read_csv('test.csv')
d = df[['group','qty']]
s = d.groupby('group')['qty'].apply(list).to_dict()
comb = list(map(dict,itertools.combinations(s.items(),2)))
The comb stmt and multiple variations I tried are just printing the dictionary. Put 2 for two variations to test it out but not working - this would have to be adjusted based on the # of values in the list.
I brought in the dataset and then was thinking it would be best to create a dictionary with a list for each grouping and qty in order to create all the combinations in a separate table. Once I have the combinations and sum of each of the values - link back to the main dataframe to compare against total and flag.
I'm running into issues with creating each combination of the quantities associated with the group and summing. I can perform it if stored in a list across all dictionaries but I need to grouped by the group. For instance, group 1 should have 1000,8000 and 1000,13000 and 1000,1000 and 1000,8000,13000 and so on. The amount of combinations can vary by group.
Can anyone assist with guiding me in the right direction? Maybe my thinking is off on how to go about this.
Thank you

Here is one self explanatory solution which also uses itertools.combination in conjunction with list comprehensions:
def aggregate(sub_df):
# get boundaries and actual values
bound_low = sub_df["rstart"].iloc[0]
bound_high = sub_df["rend"].iloc[0]
values = sub_df["qty"].values
# get possible combinations, iterate all lengths of combinations
combis = [itertools.combinations(values, x+1)
for x in range(len(values))]
# flatten all combis and apply filter condition
result = [combi for sub_combi in combis
for combi in sub_combi
if bound_low <= sum(combi) <= bound_high]
return result
print(df.groupby("Group").apply(aggregate))
Group
1 [(1000, 8000, 1000)]
2 [(4000, 3000)]
dtype: object
However, I don't understand your group 1 should have 1000,8000 and 1000,13000 and 1000,1000 and 1000,8000,13000 here.

Related

how to automatically classify a list of numbers

Well, the context is: I have a list of wind speeds, let's imagine, 100 wind measurements from 0 to 50 km/h, so I want to automate the creation of a list by uploading the csv, let's imagine, every 5 km/h, that is, the ones that they go from 0 to 5, what go from 5 to 10... etc.
Let's go to the code:
wind = pd.read_csv("wind.csv")
df = pd.DataFrame(wind)
x = df["Value"]
d = sorted(pd.Series(x))
lst = [[] for i in range(0,(int(x.max())+1),5)]
this gives me a list of empty lists, i.e. if the winds go from 0 to 54 km/h will create 11 empty lists.
Now, to classify I did this:
for i in range(0,len(lst),1):
for e in range(0,55,5):
for n in d:
if n>e and n< (e+5):
lst[i].append(n)
else:
continue
My objective would be that when it reaches a number greater than 5, it jumps to the next level, that is, it adds 5 to the limits of the interval (e) and jumps to the next i to fill the second empty list in lst. I tried it in several ways because I imagine that the loops must go in a specific order to give a good result. This code is just an example of several that I tried, but they all gave me similar results, either all the lists were filled with all the numbers, or only the first list was filled with all the numbers
Your title mentions classifying the numbers -- are you looking for a categorical output like calm | gentle breeze | strong breeze | moderate gale | etc.? If so, take a look at the second example on the pd.qcut docs.
Since you're already using pandas, use pd.cut with an IntervalIndex (constructed with the pd.interval_range function) to get a Series of bins, and then groupby on that.
import pandas as pd
from math import ceil
BIN_WIDTH = 5
wind_velocity = (pd.read_csv("wind.csv")["Value"]).sort_values()
upper_bin_lim = BIN_WIDTH * ceil(wind_velocity.max() / BIN_WIDTH)
bins = pd.interval_range(
start=0,
end=upper_bin_lim,
periods=upper_bin_lim//BIN_WIDTH,
closed='left')
velocity_bins = pd.cut(wind_velocity, bins)
groups = wind_velocity.groupby(velocity_bins)
for name, group in groups:
#TODO: use `groups` to do stuff

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

Pandas - partition a dataframe into two groups with an approximate mean value

I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
​
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64

Why is Pandas DataFrame Function 'isin()' taking so much time?

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]

Python: Create a dataframe with pairs and count

I am new to python and I have a big dataframe. Want to count the pairs of column elements appearing in the dataframe:
Sample code
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
I want to create count of each pair i.e.
count = []
for index,row in pair.iterrows():
df1 = df[df['element']==row['element1']]
df2 = df[df['element']==row['element2']]
df_merg = pd.merge(df1,df2,on='compound')
count.append(len(df_merg.index))
pair['count'] = count
Problem 1
This does not work on a df of 2.8 million rows (or is very slow), Can you somebody please suggest an efficient method?
Problem 2
The pair creates duplicates due to product i.e. ['carbon','nitrogen'] as well as ['nitrogen','carbon'] are part of pair. Can I somehow have unique combinations?
Problem 3
The final dataframe 'pair' has indexes messed up. I am new to python and havent used .iloc much. What am I missing? e.g.
Image for row number
Does this work?
I think this can be better done with dicts instead of dataframes. I first convert the input dataframe into a dict so we can use it easily without having to subset repeatedly (which would be slow). This should help with Problem 1.
Problem 2 can be solved by using itertools.combinations, as shown below. Problem 3 doesn't happen in my proposed solution. For your solution, you can solve problem 3 by resetting the index (assuming index is not useful) like so: pair.reset_index(drop=True).
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
# If these are real compounds and elements, each value in the following
# dict should be small because there are only 118 elements (more are
# hypothesized but not yet made). Even if there are a large number of
# compounds, storing them as a dict should not be too different from storing
# them in a dataframe that has repeated compound names.
compound_to_elements = {
compound: set(subset['element'])
for compound, subset in df.groupby(by=['compound'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [0] * len(combos)
# For each element pair, find out how many distinct compounds does it belong to.
# The looping order can be switched depending upon whether there are more
# compounds or more 2-element combinations.
for _, elements in compound_to_elements.items():
for i, (element1, element2) in enumerate(combos):
if (element1 in elements) and (element2 in elements):
counts[i] += 1
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
Alternative Solution.
The solution above has room for improvement because we checked whether or not an element is a part of a compound multiple times (for example, we check "nitrogen" is a part of "a" multiple times -- once for each combination). The following alternative solution improves upon the previous solution by removing such an inefficiency. Which solution is feasible or faster would depend a little bit on your exact data and available memory.
# If these are real compounds and elements, then the number of keys in
# the following dict should be small because there are only 118 elements
# (more are hypothesized but not yet made). But, some values may be big
# sets of compounds (such as for Carbon).
element_to_compounds = {
element: set(subset['compound'])
for element, subset in df.groupby(by=['element'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [
len(element_to_compounds[element1]
.intersection(element_to_compounds[element2]))
for (element1, element2) in combos
]
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
## create a tuple of names
## sort the values in combined column if order doesn't matter to you
pair["combined"] = tuple(zip(pair.element1, pair.element2))
pair.groupby("combined").size().reset_index(name= "count")

Categories