I am new to python and I have a big dataframe. Want to count the pairs of column elements appearing in the dataframe:
Sample code
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
I want to create count of each pair i.e.
count = []
for index,row in pair.iterrows():
df1 = df[df['element']==row['element1']]
df2 = df[df['element']==row['element2']]
df_merg = pd.merge(df1,df2,on='compound')
count.append(len(df_merg.index))
pair['count'] = count
Problem 1
This does not work on a df of 2.8 million rows (or is very slow), Can you somebody please suggest an efficient method?
Problem 2
The pair creates duplicates due to product i.e. ['carbon','nitrogen'] as well as ['nitrogen','carbon'] are part of pair. Can I somehow have unique combinations?
Problem 3
The final dataframe 'pair' has indexes messed up. I am new to python and havent used .iloc much. What am I missing? e.g.
Image for row number
Does this work?
I think this can be better done with dicts instead of dataframes. I first convert the input dataframe into a dict so we can use it easily without having to subset repeatedly (which would be slow). This should help with Problem 1.
Problem 2 can be solved by using itertools.combinations, as shown below. Problem 3 doesn't happen in my proposed solution. For your solution, you can solve problem 3 by resetting the index (assuming index is not useful) like so: pair.reset_index(drop=True).
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
# If these are real compounds and elements, each value in the following
# dict should be small because there are only 118 elements (more are
# hypothesized but not yet made). Even if there are a large number of
# compounds, storing them as a dict should not be too different from storing
# them in a dataframe that has repeated compound names.
compound_to_elements = {
compound: set(subset['element'])
for compound, subset in df.groupby(by=['compound'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [0] * len(combos)
# For each element pair, find out how many distinct compounds does it belong to.
# The looping order can be switched depending upon whether there are more
# compounds or more 2-element combinations.
for _, elements in compound_to_elements.items():
for i, (element1, element2) in enumerate(combos):
if (element1 in elements) and (element2 in elements):
counts[i] += 1
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
Alternative Solution.
The solution above has room for improvement because we checked whether or not an element is a part of a compound multiple times (for example, we check "nitrogen" is a part of "a" multiple times -- once for each combination). The following alternative solution improves upon the previous solution by removing such an inefficiency. Which solution is feasible or faster would depend a little bit on your exact data and available memory.
# If these are real compounds and elements, then the number of keys in
# the following dict should be small because there are only 118 elements
# (more are hypothesized but not yet made). But, some values may be big
# sets of compounds (such as for Carbon).
element_to_compounds = {
element: set(subset['compound'])
for element, subset in df.groupby(by=['element'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [
len(element_to_compounds[element1]
.intersection(element_to_compounds[element2]))
for (element1, element2) in combos
]
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
## create a tuple of names
## sort the values in combined column if order doesn't matter to you
pair["combined"] = tuple(zip(pair.element1, pair.element2))
pair.groupby("combined").size().reset_index(name= "count")
Related
Well, the context is: I have a list of wind speeds, let's imagine, 100 wind measurements from 0 to 50 km/h, so I want to automate the creation of a list by uploading the csv, let's imagine, every 5 km/h, that is, the ones that they go from 0 to 5, what go from 5 to 10... etc.
Let's go to the code:
wind = pd.read_csv("wind.csv")
df = pd.DataFrame(wind)
x = df["Value"]
d = sorted(pd.Series(x))
lst = [[] for i in range(0,(int(x.max())+1),5)]
this gives me a list of empty lists, i.e. if the winds go from 0 to 54 km/h will create 11 empty lists.
Now, to classify I did this:
for i in range(0,len(lst),1):
for e in range(0,55,5):
for n in d:
if n>e and n< (e+5):
lst[i].append(n)
else:
continue
My objective would be that when it reaches a number greater than 5, it jumps to the next level, that is, it adds 5 to the limits of the interval (e) and jumps to the next i to fill the second empty list in lst. I tried it in several ways because I imagine that the loops must go in a specific order to give a good result. This code is just an example of several that I tried, but they all gave me similar results, either all the lists were filled with all the numbers, or only the first list was filled with all the numbers
Your title mentions classifying the numbers -- are you looking for a categorical output like calm | gentle breeze | strong breeze | moderate gale | etc.? If so, take a look at the second example on the pd.qcut docs.
Since you're already using pandas, use pd.cut with an IntervalIndex (constructed with the pd.interval_range function) to get a Series of bins, and then groupby on that.
import pandas as pd
from math import ceil
BIN_WIDTH = 5
wind_velocity = (pd.read_csv("wind.csv")["Value"]).sort_values()
upper_bin_lim = BIN_WIDTH * ceil(wind_velocity.max() / BIN_WIDTH)
bins = pd.interval_range(
start=0,
end=upper_bin_lim,
periods=upper_bin_lim//BIN_WIDTH,
closed='left')
velocity_bins = pd.cut(wind_velocity, bins)
groups = wind_velocity.groupby(velocity_bins)
for name, group in groups:
#TODO: use `groups` to do stuff
I am trying to do the following with NumPy array or normal array:
For push the data I am doing:
ar1 = []
#Read from Pandas dataframe column. i is row number of data - it's working fine.
ar1.append((df['rolenumber'][i]))
OUTPUT:
[34768, 34739, 34726, 34719, 34715]
This result possible to come as Ascending/Descending or combined anything possible.
Here I want to take the last 3 values to validate whether it is ascending or descending or mixed.
Ascending: If the last 3 values increased regular. Example: 34726, 34739, 34745
Descending: If the last 3 values decrease properly. Example: 34726, 34719, 34715
Mixed: If the last 3 start with a big number then small number then big number. Example: 34726, 34719, 34725
Note: No need to sort only validate.
This little snippet should get you going:
a = np.array([34768, 34739, 34726, 34719, 34715])
is_descending = np.all(np.diff(a[-3:]) < 0)
is_ascending = np.all(np.diff(a[-3:]) > 0)
is_mixed = ~(is_ascending | is_descending)
I have the following dataframe
car wire color
x 1 red
x 2 red
x 3 red
In this case each wire is colored 'red'. For the sake of this program it is safe to say that anything with the same color is connected. I want to use this information to make a NEW dataframe that would look like the following:
car wire connected
x 1 2
x 1 3
x 2 3
I have an inefficient, but effective solution. Now that I know car is important, I included in the answer. With the added complexity, this might not be the best solution because we have 3 nesting for loops which will negatively impact performance. Also negatively impacting performance is appending every time we see a new row. Long story short: this code will work but slooooooowly.
from itertools import permutations
result = pd.DataFrame(columns=['wire','connected','car']) # Initialize result
# Iterate through car options
for car in df['car'].unique():
tdf = df[df['car']==car]
# Iterate through color options
for color in tdf['color'].unique():
wires = list(tdf[tdf['color']==color]['wire']) # Get a list of wires
# Iterate thru all possible combinations of connected wires when shoosing 2 wires
for w in permutations(wires,2):
# Append that possible combination to the dataframe
result = result.append(
pd.DataFrame({
'wire':w[0],
'connected':w[1],
'car':car
})
)
# Optional: If you need the revers too. For example if 1 -> 2 also needs 2 -> 1 in the result
result = result.append(
pd.DataFrame({
'wire':w[1],
'connected':w[0],
'car':car
})
)
I have a column in a dataframe called 'CREDIT RATING' for a number of companies across rows. I need to assign a numerical category for ratings like AAA to DDD from 1(AAA) to 0(DDD). is there a quick simple way to do this and basically create a new column where i get numbers 1-0 by .1's? Thanks!
You could use replace:
df['CREDIT RATING NUMERIC'] = df['CREDIT RATING'].replace({'AAA':1, ... , 'DDD':0})
The easiest way is to simply create a dictionary mapping:
mymap = {"AAA":1.0, "AA":0.9, ... "DDD":0.0}
and then apply it to the dataframe:
df["CREDIT MAPPING"] = df["CREDIT RATING"].replace(mymap)
Ok, this was kinda though without nothing to work with but here we go:
# First getting a ratings list acquired from wikipedia than setting into a dataframe to replicate your scenario
ratings = ['AAA' ,'AA1' ,'AA2' ,'AA3' ,'A1' ,'A2' ,'A3' ,'BAA1' ,'BAA2' ,'BAA3' ,'BA1' ,'BA2' ,'BA3' ,'B1' ,'B2' ,'B3' ,'CAA' ,'CA' ,'C' ,'C' ,'E' ,'WR' ,'UNSO' ,'SD' ,'NR']
df_credit_ratings = pd.DataFrame({'Ratings_id':ratings})
df_credit_ratings = pd.concat([df_credit_ratings,df_credit_ratings]) # just to replicate duplicate records
# The set() command get the unique values
unique_ratings = set(df_credit_ratings['Ratings_id'])
number_of_ratings = len(unique_ratings) # counting how many unique there are
number_of_ratings_by_tenth = number_of_ratings/10 # Because from 0 to 1 by 0.1 to 0.1 there are 10 positions.
# the numpy's arange fills values in between from a range (first two numbers) and by which decimals (third number)
dec = list(np.arange(0.0, number_of_ratings_by_tenth, 0.1))
After this you'll need to mix the unique ratings to it's weigths:
df_ratings_unique = pd.DataFrame({'Ratings_id':list(unique_ratings)}) # list so it gets one value per row
EDIT: as Thomas suggested in another answer's comment, this sort probably wont fit you because it won't be the real order of importance of the ratings. So you'll probably need to first create a dataframe with them already in order and no neet to sort.
df_ratings_unique.sort_values(by='Ratings_id', ascending=True, inplace=True) # sorting so it matches the order of our weigths above.
Resuming the solution:
df_ratings_unique['Weigth'] = dec # adding the weigths to the DF
df_ratings_unique.set_index('Ratings_id', inplace=True) # setting the Rantings as index to map the values bellow
# now this is the magic, we're creating a new column at the original Dataframe and we'll map according to the `Ratings_id` by our unique dataframe
df_credit_ratings['Weigth'] = df_credit_ratings['Ratings_id'].map(df_ratings_unique.Weigth)
I think I'm going about this all in a backwards way in pandas. Here's an example dataframe:
Group rstart rend qty
1 10000 11000 1000
1 10000 11000 8000
1 10000 11000 13000
1 10000 11000 1000
2 6000 8000 4000
2 6000 8000 9000
2 6000 8000 3000
In the end I'm trying to identify the quantity or combination of quantities within the group that fall between the range and put a flag in a new column (and if possible save the combination in a new column too).
Here's what I have done so far and where I'm running into an issue - been trying out all different ways since I'm new to this language.
import pandas as pd
import numpy as np
import itertools
df = pd.read_csv('test.csv')
d = df[['group','qty']]
s = d.groupby('group')['qty'].apply(list).to_dict()
comb = list(map(dict,itertools.combinations(s.items(),2)))
The comb stmt and multiple variations I tried are just printing the dictionary. Put 2 for two variations to test it out but not working - this would have to be adjusted based on the # of values in the list.
I brought in the dataset and then was thinking it would be best to create a dictionary with a list for each grouping and qty in order to create all the combinations in a separate table. Once I have the combinations and sum of each of the values - link back to the main dataframe to compare against total and flag.
I'm running into issues with creating each combination of the quantities associated with the group and summing. I can perform it if stored in a list across all dictionaries but I need to grouped by the group. For instance, group 1 should have 1000,8000 and 1000,13000 and 1000,1000 and 1000,8000,13000 and so on. The amount of combinations can vary by group.
Can anyone assist with guiding me in the right direction? Maybe my thinking is off on how to go about this.
Thank you
Here is one self explanatory solution which also uses itertools.combination in conjunction with list comprehensions:
def aggregate(sub_df):
# get boundaries and actual values
bound_low = sub_df["rstart"].iloc[0]
bound_high = sub_df["rend"].iloc[0]
values = sub_df["qty"].values
# get possible combinations, iterate all lengths of combinations
combis = [itertools.combinations(values, x+1)
for x in range(len(values))]
# flatten all combis and apply filter condition
result = [combi for sub_combi in combis
for combi in sub_combi
if bound_low <= sum(combi) <= bound_high]
return result
print(df.groupby("Group").apply(aggregate))
Group
1 [(1000, 8000, 1000)]
2 [(4000, 3000)]
dtype: object
However, I don't understand your group 1 should have 1000,8000 and 1000,13000 and 1000,1000 and 1000,8000,13000 here.