Create new dataframe based on relationships from other dataframe

Create new dataframe based on relationships from other dataframe - python

I have the following dataframe
car wire color
x 1 red
x 2 red
x 3 red
In this case each wire is colored 'red'. For the sake of this program it is safe to say that anything with the same color is connected. I want to use this information to make a NEW dataframe that would look like the following:
car wire connected
x 1 2
x 1 3
x 2 3

I have an inefficient, but effective solution. Now that I know car is important, I included in the answer. With the added complexity, this might not be the best solution because we have 3 nesting for loops which will negatively impact performance. Also negatively impacting performance is appending every time we see a new row. Long story short: this code will work but slooooooowly.
from itertools import permutations
result = pd.DataFrame(columns=['wire','connected','car']) # Initialize result
# Iterate through car options
for car in df['car'].unique():
tdf = df[df['car']==car]
# Iterate through color options
for color in tdf['color'].unique():
wires = list(tdf[tdf['color']==color]['wire']) # Get a list of wires
# Iterate thru all possible combinations of connected wires when shoosing 2 wires
for w in permutations(wires,2):
# Append that possible combination to the dataframe
result = result.append(
pd.DataFrame({
'wire':w[0],
'connected':w[1],
'car':car
})
)
# Optional: If you need the revers too. For example if 1 -> 2 also needs 2 -> 1 in the result
result = result.append(
pd.DataFrame({
'wire':w[1],
'connected':w[0],
'car':car
})
)

Related

how to automatically classify a list of numbers

Well, the context is: I have a list of wind speeds, let's imagine, 100 wind measurements from 0 to 50 km/h, so I want to automate the creation of a list by uploading the csv, let's imagine, every 5 km/h, that is, the ones that they go from 0 to 5, what go from 5 to 10... etc.
Let's go to the code:
wind = pd.read_csv("wind.csv")
df = pd.DataFrame(wind)
x = df["Value"]
d = sorted(pd.Series(x))
lst = [[] for i in range(0,(int(x.max())+1),5)]
this gives me a list of empty lists, i.e. if the winds go from 0 to 54 km/h will create 11 empty lists.
Now, to classify I did this:
for i in range(0,len(lst),1):
for e in range(0,55,5):
for n in d:
if n>e and n< (e+5):
lst[i].append(n)
else:
continue
My objective would be that when it reaches a number greater than 5, it jumps to the next level, that is, it adds 5 to the limits of the interval (e) and jumps to the next i to fill the second empty list in lst. I tried it in several ways because I imagine that the loops must go in a specific order to give a good result. This code is just an example of several that I tried, but they all gave me similar results, either all the lists were filled with all the numbers, or only the first list was filled with all the numbers

Your title mentions classifying the numbers -- are you looking for a categorical output like calm | gentle breeze | strong breeze | moderate gale | etc.? If so, take a look at the second example on the pd.qcut docs.
Since you're already using pandas, use pd.cut with an IntervalIndex (constructed with the pd.interval_range function) to get a Series of bins, and then groupby on that.
import pandas as pd
from math import ceil
BIN_WIDTH = 5
wind_velocity = (pd.read_csv("wind.csv")["Value"]).sort_values()
upper_bin_lim = BIN_WIDTH * ceil(wind_velocity.max() / BIN_WIDTH)
bins = pd.interval_range(
start=0,
end=upper_bin_lim,
periods=upper_bin_lim//BIN_WIDTH,
closed='left')
velocity_bins = pd.cut(wind_velocity, bins)
groups = wind_velocity.groupby(velocity_bins)
for name, group in groups:
#TODO: use `groups` to do stuff

A dataframe splitting problem in Pandas, any thoughts?

The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.

My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))

How to switch column elements for only one specific row in pandas?

So, I am working w/data from my research lab and am trying to sort it and move it around etc. And most of the stuff isn't important to my issue and I don't want to go into detail because confidentiality stuff, but I have a big table w/columns and rows and I want to specifically switch the elements of two columns ONLY in one row.
The extremely bad attempt at code I have for it is this (I rewrote the variables to be more vague though so they make sense):
for x in df.columna.values:
*some if statements*
df.loc[df.index([df.loc[df['columna'] == x]]), ['columnb', 'columna']] = df[df.index([df.loc[df['columna'] == x]]), ['columna', 'columnb']].numpy()
I am aware that the code I have is trash (and also the method - w/the for loops and if statements. I know I can abstract it a TON but I just want to actually figure out a way to make it work and them I will clean it up and make it prettier and more efficient. I learned pandas existed on tuesday so I am not an expert), but I think my issue lies in the way I'm getting the row.
One error I was recently getting for a while is the method I was using to get the row was giving me 1 row x 22 columns and I think I needed the name/index of the row instead. Which is why the index function is now there. However, I am now getting the error:
TypeError: 'RangeIndex' object is not callable
And I am just so confused all around. Sorry I've written a ton of text, basically: is there any simpler way to just switch the elements of two columns for one specific row (in terms of x, an element in that row)?
I think my biggest issue is trying to like- get the rows "name" in the format it wants. Although I may have a ton of other problems because honestly I am just really lost.

You're sooooo close! The error you're getting stems from trying to slice df.index([df.loc[df['columna'] == x]]). The parentheses are unneeded here and this should read as: df.index[df.loc[df['columna'] == x]].
However, here's an example on how to swap values between columns when provided a value (or multiple values) to swap at.
Sample Data
df = pd.DataFrame({
"A": list("abcdefg"),
"B": [1,2,3,4,5,6,7]
})
print(df)
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
Let's say we're going to swap the values where A is either "c" or "f". To do this we need to first create a mask that just selects those rows. To accomplish this, we can use .isin. Then to perform our swap, we actually take the same exact approach you had! Including the .to_numpy() is very important, because without it Pandas will actually realign your columns for you and cause the values to not be swapped. Putting it all together:
swap_at = ["c", "f"]
swap_at_mask = df["A"].isin(swap_at) # mask where columns "A" is either equal to "c" or "f"
# Without the `.to_numpy()` at the end, pandas will realign the Dataframe
# and no values will be swapped
df.loc[swap_at_mask, ["A", "B"]] = df.loc[swap_at_mask, ["B", "A"]].to_numpy()
print(df)
A B
0 a 1
1 b 2
2 3 c
3 d 4
4 e 5
5 6 f
6 g 7

I think it was probably a syntax problem. I am assuming you are using tensorflow with the numpy() function? Try this it switches the columns based on the code you provided:
for x in df.columna.values:
# *some if statements*
df.loc[
(df["columna"] == x),
['columna', 'columnb']
] = df.loc[(df["columna"] == x), ['columnb', 'columna']].values.numpy()
I am also a beginner and would recommend you aim to make it pretty from the get go. It will save you a lot of extra time in the long run. Trial and error!

What's the best way to use fuzzywuzzy to compare each value of a column with all the values of a separate dataframe's column?

Having a really tough time with this one. Say I have two dataframes, one that has fruits and another one that has types of fruit candy. There's lots of other data in each of the dataframes. So it looks something like this:
fruit:
fruitId fruitName
0 1 banana
1 2 orange
2 3 apple
3 4 pear
4 5 lemon
candy:
candyId candyName fruitId
0 1 Orange Julius null
1 2 Bananarama null
2 3 Sour Lemon Drops null
3 4 Chocolate Bar null
4 5 Applicious null
I need to match the candyName with the proper fruit, and then put the corresponding fruitId into the fruitId column in the candy dataframe. Let's assume for my purposes that .contains doesn't work at all; there too many creative spellings and outright misspellings in the candyName column.
I have tried to define a function that uses fuzzywuzzy, and then use that in .map, but I can't get the function to work. It needs to check each value of the first df to see if it's in the second, and then move onto the next value, etc. The functions I end up building keep wanting to do comparisons where they're either (a) in the same dataframe, or (b) in the same row.
I did find a solution to this, but it's ugly because it uses iterrows() which you're not supposed to use. Here it is:
import pandas as pd
from fuzzywuzzy import fuzz
candy_file = 'candy.csv'
fruit_file = 'fruits.csv'
candy = pd.read_csv(candy_file)
fruit = pd.read_csv(fruit_file)
dict = {}
for i, row1 in candy.iterrows():
for j, row2 in fruit.iterrows():
if fuzz.partial_ratio(row1['candyName'], row2['fruitName']) >= 80:
dict[row1['candyName']] = row2['fruitId']
candy['fruitId'] = candy['candyName'].map(dict)
This takes forever. Like, 10 minutes to get through 500 rows. Is there a better way to do this? I've written like a hundred different code snippets out for faster functions without getting anywhere.
Thanks!

It's slow because you're currently working in O(N^2).
Rather than using iterrows, use dictionaries to iterate instead. This can be done with the following:
candydict = candy.to_dict{}
fruitdict = fruit.to_dict{}
for k,v in candydict.items():
for k2,v2 in fruitdict.items():
#do the rest of your comparisons here
This should speed it up significantly.

Python: Create a dataframe with pairs and count

I am new to python and I have a big dataframe. Want to count the pairs of column elements appearing in the dataframe:
Sample code
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
I want to create count of each pair i.e.
count = []
for index,row in pair.iterrows():
df1 = df[df['element']==row['element1']]
df2 = df[df['element']==row['element2']]
df_merg = pd.merge(df1,df2,on='compound')
count.append(len(df_merg.index))
pair['count'] = count
Problem 1
This does not work on a df of 2.8 million rows (or is very slow), Can you somebody please suggest an efficient method?
Problem 2
The pair creates duplicates due to product i.e. ['carbon','nitrogen'] as well as ['nitrogen','carbon'] are part of pair. Can I somehow have unique combinations?
Problem 3
The final dataframe 'pair' has indexes messed up. I am new to python and havent used .iloc much. What am I missing? e.g.
Image for row number

Does this work?
I think this can be better done with dicts instead of dataframes. I first convert the input dataframe into a dict so we can use it easily without having to subset repeatedly (which would be slow). This should help with Problem 1.
Problem 2 can be solved by using itertools.combinations, as shown below. Problem 3 doesn't happen in my proposed solution. For your solution, you can solve problem 3 by resetting the index (assuming index is not useful) like so: pair.reset_index(drop=True).
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
# If these are real compounds and elements, each value in the following
# dict should be small because there are only 118 elements (more are
# hypothesized but not yet made). Even if there are a large number of
# compounds, storing them as a dict should not be too different from storing
# them in a dataframe that has repeated compound names.
compound_to_elements = {
compound: set(subset['element'])
for compound, subset in df.groupby(by=['compound'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [0] * len(combos)
# For each element pair, find out how many distinct compounds does it belong to.
# The looping order can be switched depending upon whether there are more
# compounds or more 2-element combinations.
for _, elements in compound_to_elements.items():
for i, (element1, element2) in enumerate(combos):
if (element1 in elements) and (element2 in elements):
counts[i] += 1
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
Alternative Solution.
The solution above has room for improvement because we checked whether or not an element is a part of a compound multiple times (for example, we check "nitrogen" is a part of "a" multiple times -- once for each combination). The following alternative solution improves upon the previous solution by removing such an inefficiency. Which solution is feasible or faster would depend a little bit on your exact data and available memory.
# If these are real compounds and elements, then the number of keys in
# the following dict should be small because there are only 118 elements
# (more are hypothesized but not yet made). But, some values may be big
# sets of compounds (such as for Carbon).
element_to_compounds = {
element: set(subset['compound'])
for element, subset in df.groupby(by=['element'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [
len(element_to_compounds[element1]
.intersection(element_to_compounds[element2]))
for (element1, element2) in combos
]
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3

import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
## create a tuple of names
## sort the values in combined column if order doesn't matter to you
pair["combined"] = tuple(zip(pair.element1, pair.element2))
pair.groupby("combined").size().reset_index(name= "count")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new dataframe based on relationships from other dataframe - python

Related

how to automatically classify a list of numbers

A dataframe splitting problem in Pandas, any thoughts?

How to switch column elements for only one specific row in pandas?

What's the best way to use fuzzywuzzy to compare each value of a column with all the values of a separate dataframe's column?

Python: Create a dataframe with pairs and count

Categories

Resources