normal = []
nine_plus []
tw_plus = []
for i in df['SubjectID'].unique():
x= df.loc[df['SubjectID']==i]
if(len(x['Year Term ID'].unique())<=8):
normal.append(i)
elif(len(x['Year Term ID'].unique())>=9 and len(x['Year Term ID'].unique())<13):
nine_plus.append(i)
elif(len(x['Year Term ID'].unique())>=13):
tw_plus.append(i)
Hello, I am dealing with a dataset that has 10 million rows. The dataset is about student records and I am trying to classify the students into three groups according to how many semesters they have attended. I feel like I am using very crude method right now, and there could be more efficient way of categorizing. Any suggestions?
You go through a lot of repeated iterations that are likely to make your data frame slower than a simple Python list. Use the data frame organization in your favor.
Group your rows by Subject_ID, then Year_Term_ID.
Extract the count of rows in each sub-group -- which you currently have as len(x(...
Make a function, lambda, or extra column that represents the classification; call that len expression load:
0 if load <= 8 else 1 if load <= 12 else 3
Use that expression to re-group your students into the three desired classifications.
Do not iterate through the rows of the data frame: this is a "code smell" that you're missing a vectorized capability.
Does that get you moving?
Related
I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.
I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).
The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:
{Corpus ID : Similarity_Score}
I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:
Query_String | Corpus_ID | Similarity_Score
The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply with the pandarallel module:
def retrieve_match_text(row, corpus_list):
dict_obj = row['dictionary']
corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
score = dict_obj['score']
matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)
return [matched_corpus_keyword, score]
.....
.....
# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)
This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.
If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.
Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.
Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.
Step 1. If you do this:
target_corpus = pd.Series(sentence_list_2, name="target_corpus")
Then you have an indexed series of one corpus (formerly the "list lookup").
Step 2. Get columns of score and corpus_id in your main dataframe
Step 3. Use pd.merge to join the input corpus on corpus_id vs the index of target_corpus and using how="left" (only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.
Develop and test the solution vs a small subset (1K) to iterate quicky then grow.
In setup for a collaborative filtering model on the MovieLens100k dataset in a Jupyter notebook, I'd like to show a dense crosstab of users vs movies. I figure the best way to do this is to show the most frequent n user against the most frequent m movie.
If you'd like to run it in a notebook, you should be able to copy/paste this after installing the fastai2 dependencies (it exports pandas among other internal libraries)
from fastai2.collab import *
from fastai2.tabular.all import *
path = untar_data(URLs.ML_100k)
# load the ratings from csv
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
names=['user','movie','rating','timestamp'])
# show a sample of the format
ratings.head(10)
# slice the most frequent n=20 users and movies
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
denser_ratings = ratings[ratings.user.isin(most_frequent_users)]
denser_movies = ratings[ratings.movie.isin(most_rated_movies)]
# crosstab the most frequent users and movies, showing the ratings
pd.crosstab(denser_ratings.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
Results:
Expected:
The desired output is much denser than what I've done. My example seems to be a little bit better than random, but not by much. I have two hypothesis to why it's not as dense as I want:
The most frequent users might not always rate the most rated movies.
My code has a bug which is making it index into the dataframe incorrectly for what I think I'm doing
Please let me know if you see an error in how I'm selecting the most frequent users and movies, or grabbing the matches with isin.
If that is correct (or really, regardless) - I'd like to see how I would make a denser set of users and movies to crosstab. The next approach I've thought of is to grab the most frequent movies, and select the most frequent users from that dataframe instead of the global dataset. But I'm unsure how I'd do that- between searching for the most frequent user across all the top m movies, or somehow more generally finding the set of n*m most-linked users and movies.
I will post my code if I solve it before better answers arrive.
My code has a bug which is making it index into the dataframe
incorrectly for what I think I'm doing
True, there is a bug.
most_frequent_users = list(ratings.user.value_counts()[:20])
most_rated_movies = list(ratings.movie.value_counts()[:20])
actually is grabbing the value counts. So if users 1, 2, and 3 made 100 reviews each, the above code would return [100, 100, 100] when really we wanted the ids [1,2,3]. To get the id of the most frequent entries instead of the tally, you'd add .index to value_counts
most_frequent_users = list(ratings.user.value_counts().index[:20])
most_rated_movies = list(ratings.movie.value_counts().index[:20])
This alone improves the density to almost what the final result is below. What I was doing before was actually just a random sample (erroneously using value totals as the lookup for a movie id)
Furthermore, the approach I mentioned at the end of the post is a more robust general solution for crosstabbing with highest density as the goal. Find the most frequent X, and within that specific set, find the most frequent Y. This will work well even in sparse datasets.
n_users = 10
n_movies = 20
# list the ids of the most frequent users (those who rated the most movies)
most_frequent_users = list(ratings.user.value_counts().index[:n_users])
# grab all the ratings made by these most frequent users
denser_users = ratings[ratings.user.isin(most_frequent_users)]
# list the ids of the most frequent movies within this group of users
dense_users_most_rated = list(denser_ratings.movie.value_counts().index[:n_movies])
# grab all the most frequent movies rated by the most frequent users
denser_movies = ratings[ratings.movie.isin(dense_users_most_rated)]
# plot the crosstab
pd.crosstab(denser_users.user, denser_movies.movie, values=ratings.rating, aggfunc='mean').fillna('-')
This is exactly what I was looking for.
Only questions remaining are how standard was this approach? And why are some values floats?
I am doing some analysis in which I am applying a certain filter to remove certain users who have seen a particular content(tactic).
The original dataframe looks like:
user content response
100 esample 0
101 esample 1
.................
106 esample 0
Now this dataframe is simulated in various iterations. So in every iteration we have different dataset with same columns. Now at every iteration I want to filter the users who have seen 'esample' and bring them after two subsequent iterations. For instance users in tactic1_list(those who have seen esample) should not be considered for analysis in 3rd and 4th iteration. They are brought again in analysis in 5th iteration.
for i in range(2,10):
#Apply a filter on original dataframe(df) and create a list at each iteration
#segregate those users who have seen a Esample content
tactic1_list=df.loc[(df.tactic=='Esample') & (df.response > 0)]['muid'].tolist()
# exclude these users from original dataframe
tactic1_sample_muids= list(set(df.muid).difference(tactic1_list))
###Further analysis
Now I want to code in such a way such that the users in tactic1_list should be used again after subsequent 2 iterations. I was thinking to use continue, but not sure how to do that. Any help would be appreciated.
This answer is about how to use continue in your case to achieve your goal
for i in range(2,10):
if i in [3,4]: # if 3 and 4 or 3rd and 4th iteration number
continue
# otherwise process user
I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]