The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]
The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.
My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))
I am new to python and I have a big dataframe. Want to count the pairs of column elements appearing in the dataframe:
Sample code
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
I want to create count of each pair i.e.
count = []
for index,row in pair.iterrows():
df1 = df[df['element']==row['element1']]
df2 = df[df['element']==row['element2']]
df_merg = pd.merge(df1,df2,on='compound')
count.append(len(df_merg.index))
pair['count'] = count
Problem 1
This does not work on a df of 2.8 million rows (or is very slow), Can you somebody please suggest an efficient method?
Problem 2
The pair creates duplicates due to product i.e. ['carbon','nitrogen'] as well as ['nitrogen','carbon'] are part of pair. Can I somehow have unique combinations?
Problem 3
The final dataframe 'pair' has indexes messed up. I am new to python and havent used .iloc much. What am I missing? e.g.
Image for row number
Does this work?
I think this can be better done with dicts instead of dataframes. I first convert the input dataframe into a dict so we can use it easily without having to subset repeatedly (which would be slow). This should help with Problem 1.
Problem 2 can be solved by using itertools.combinations, as shown below. Problem 3 doesn't happen in my proposed solution. For your solution, you can solve problem 3 by resetting the index (assuming index is not useful) like so: pair.reset_index(drop=True).
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
# If these are real compounds and elements, each value in the following
# dict should be small because there are only 118 elements (more are
# hypothesized but not yet made). Even if there are a large number of
# compounds, storing them as a dict should not be too different from storing
# them in a dataframe that has repeated compound names.
compound_to_elements = {
compound: set(subset['element'])
for compound, subset in df.groupby(by=['compound'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [0] * len(combos)
# For each element pair, find out how many distinct compounds does it belong to.
# The looping order can be switched depending upon whether there are more
# compounds or more 2-element combinations.
for _, elements in compound_to_elements.items():
for i, (element1, element2) in enumerate(combos):
if (element1 in elements) and (element2 in elements):
counts[i] += 1
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
Alternative Solution.
The solution above has room for improvement because we checked whether or not an element is a part of a compound multiple times (for example, we check "nitrogen" is a part of "a" multiple times -- once for each combination). The following alternative solution improves upon the previous solution by removing such an inefficiency. Which solution is feasible or faster would depend a little bit on your exact data and available memory.
# If these are real compounds and elements, then the number of keys in
# the following dict should be small because there are only 118 elements
# (more are hypothesized but not yet made). But, some values may be big
# sets of compounds (such as for Carbon).
element_to_compounds = {
element: set(subset['compound'])
for element, subset in df.groupby(by=['element'])
}
# Generate combinations that ignore order
combos = list(itertools.combinations(list(set(df['element'])), 2))
counts = [
len(element_to_compounds[element1]
.intersection(element_to_compounds[element2]))
for (element1, element2) in combos
]
pairs = pd.DataFrame.from_records(combos, columns=['element1', 'element2'])
pairs['count'] = counts
# element1 element2 count
# 0 nitrogen hydrogen 2
# 1 nitrogen oxygen 4
# 2 nitrogen carbon 3
# 3 hydrogen oxygen 1
# 4 hydrogen carbon 1
# 5 oxygen carbon 3
import pandas as pd
import itertools
data = {'compound': ['a','a','a','b','b','c','c','d','d','d','e','e','e','e'],
'element': ['carbon','nitrogen','oxygen','hydrogen','nitrogen','nitrogen','oxygen','nitrogen','oxygen','carbon','carbon','nitrogen','oxygen','hydrogen']
}
df = pd.DataFrame(data, columns = ['compound', 'element'])
pair = pd.DataFrame(list(itertools.product(df['element'].unique() , df['element'].unique() )))
pair.columns = ['element1', 'element2']
pair=pair[pair['element1']!=pair['element2']]
## create a tuple of names
## sort the values in combined column if order doesn't matter to you
pair["combined"] = tuple(zip(pair.element1, pair.element2))
pair.groupby("combined").size().reset_index(name= "count")
I have the following dataframe
car wire color
x 1 red
x 2 red
x 3 red
In this case each wire is colored 'red'. For the sake of this program it is safe to say that anything with the same color is connected. I want to use this information to make a NEW dataframe that would look like the following:
car wire connected
x 1 2
x 1 3
x 2 3
I have an inefficient, but effective solution. Now that I know car is important, I included in the answer. With the added complexity, this might not be the best solution because we have 3 nesting for loops which will negatively impact performance. Also negatively impacting performance is appending every time we see a new row. Long story short: this code will work but slooooooowly.
from itertools import permutations
result = pd.DataFrame(columns=['wire','connected','car']) # Initialize result
# Iterate through car options
for car in df['car'].unique():
tdf = df[df['car']==car]
# Iterate through color options
for color in tdf['color'].unique():
wires = list(tdf[tdf['color']==color]['wire']) # Get a list of wires
# Iterate thru all possible combinations of connected wires when shoosing 2 wires
for w in permutations(wires,2):
# Append that possible combination to the dataframe
result = result.append(
pd.DataFrame({
'wire':w[0],
'connected':w[1],
'car':car
})
)
# Optional: If you need the revers too. For example if 1 -> 2 also needs 2 -> 1 in the result
result = result.append(
pd.DataFrame({
'wire':w[1],
'connected':w[0],
'car':car
})
)
I have sorted a roughly 1 million row dataframe by a certain column. I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this.
Example below:
import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]
df = pd.DataFrame({'value1':value1,'value2':value2})
df.sort_values('value1', ascending = False)
df['wanted_result'] = [1,1,1,2,2,2]
Like this example, I want to sum my column (example column value1) and assign groups to have as close to equal value1 sums as they can. Is there a build in function to this?
Greedy Loop
Using Numba's JIT to quicken it up.
from numba import njit
#njit
def partition(c, n):
delta = c[-1] / n
group = 1
indices = [group]
total = delta
for left, right in zip(c, c[1:]):
left_diff = total - left
right_diff = total - right
if right > total and abs(total - right) > abs(total - left):
group += 1
total += delta
indices.append(group)
return indices
df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))
value1 value2 result
4 28 0.50 1
1 27 0.43 1
0 25 0.34 1
3 22 0.43 2
2 20 0.54 2
5 20 0.70 2
This is NOT optimal. This is a greedy heuristic. It goes through the list and finds where we step over to the next group. At that point it decides whether it's better to include the current point in the current group or the next group.
This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once.
But like I said, it should be good enough.
I think, this is a kind of optimalisation problem (non-linear)
and Pandas is definitively not any good candidate to solve it.
The basic idea to solve the problem can be as follows:
Definitions:
n - number of elements,
groupNo - the number of groups to divide into.
Start from generating an initial solution, e.g. take consecutive
groups of n / groupNo elements into each bin.
Define the goal function, e.g. sum of squares of differences between
sum of each group and sum of all elements / groupNo.
Perform an iteration:
for each pair of elements a and b from different bins,
calculate the new goal function value, if these elements were moved
to the other bin,
select the pair that gives the greater improvement of the goal function
and perform the move (move a from its present bin to the bin, where b is,
and vice versa).
If no such pair can be found, then we have the final result.
Maybe someone will propose a better solution, but at least this solution is
some concept to start with.