How to group my dataframe dataset by the column default using Pandas?
From the cheat sheet:
An this is my code:
dataset.groupby(by = "default")
Which returns:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb4ac4f2490>
It seens you question is a little on the theoretical side, so I am going to explain to you how to do what you want and also what #It_is_Chris, meant on his comment.
So how does groupby work in Pandas?
A: The idea is pretty simple, imagine you are grouping by one column with two different values. Python will create "small data frames" filtered accordingly to the column being grouped. Such generated "small data frames" come out as groupby generator Objects, which to be honest is outside the given subject. But just think of them as "small separated data frames with each one having a given set of characteristics/attributes".
Now about your question, didnt quite catch what you want but I guess is just the amount of events, or the size of the dataframe. Nonetheless, just take a look at the pandas docs and you will be able to find out which method fits better
dataset.groupby(by = "default").count()
Related
I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.
I have tried doing this in pandas but it takes way too long to run.
Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.
There are few tricks you can do.
Some ideas:
you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.
Dunno if this will work out of the box, but it would be something like:
df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.
Similar idea: instead of doing join do something like:
unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.
So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates
I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?
You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.