New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]
Related
Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?
We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64
I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?
Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135
I have a pandas dataframe as follows:
You will note here that there are many rows with the same code_module,code_presentation,id_student combination
What I want to do is merge all of these duplicate rows, and in so sum the sum_clicks with each group
An example of this is for the top rows they would be merged into one row looking as follows:
code_module code_presentation id_student sum_click
0 AAA 2013J 28400 18
In SQL terms, the private key should be a code_module,code_presentation,id_student combination
In my progress on this, I tried to use groupby in the following way:
groupby(['id_student','code_presentation','code_module']).aggregate({'sum_click': 'sum',})
But this didn't work as it gave student ids that aren't even in my dataset, which I don't understand why
Also, groupby doesn't seem to be quite what I'm looking for as it has a datastructure different to a standard pandas dataframe, which is what I would be looking for.
The problem can be seen in the following output
sum_click
id_student code_presentation code_module
6516 2014J AAA 2791
8462 2013J DDD 646
2014J DDD 10
11391 2013J AAA 934
Row 1 and 2 (indexing from 0) should be distinct rows, instead of the group as they are
Try this -
df.groupby(['code_module', 'code_presentation', 'id_student']).agg(sum_clicks=('sum_click', 'sum')).reset_index()
I am trying to rewrite a .xlsx file from scratch using Python. The excel sheet has 99 rows and 11 columns. I have generated 99 rows x 8 columns already and I am currently working on generating the 99 rows x 9th column.
This 9th column is calculated based on a SUM-IFS formula in excel. It takes into account columns 2, 4 and 7.
Col. 2 has numerical int values.
Col. 4 has three letter airport code values like NYC for New York City
Col. 7 also has three letter airport code values like DEL for Delhi.
The sum-if formula for column 9 cells
SUMIFS(B:B, D:D, D2, G:G, G2)
Hence it sums the numerical values in column 2 for corresponding cities in col. 4 and col. 7. If there is only one occurrence of the pair of cities in col. 4 and col. 7 then there is nothing to sum and the cell in col.9 = int value of cell in col. 2
However, if are multiple occurrences of the pair of cities in col. 4 and col. 7 then the corresponding values in col. 2 are SUMMED and that becomes the value of the cell in col. 9
Example:
In this example, col. 2 is Sales, col.4 is Origin City, col. 7 is Destination City and col. 9 is the Result that utilizes =SUMIFS(B:B,C:C,C2,D:D,D2)
I am trying to calculate the column 9 using python on the large data set that I have. For now, I have been able to create a list of dictionaries, where I have made the key as origin_city-destination_city and the value as the integer value of col. 2. The list of dicts has 99 rows like the excel file, hence each row of the excel file is represented as a dict. On printing the dictionary, it is something like this:
{'YTO-YVR': 570}
{'YVR-YTO': 542}
{'YTO-YYC': 420}
{'YYT-YTO': 32}
{'YWG-YYC': 115}
I have been contemplating if it is possible to loop over the list of dicts and create a SUMIFS version of it --- resulting in 99 dicts in the list, with each dict having the sumif value. After this I have to write all these values to the column in the excel file..
I hope someone here can help !! Thank you very much in advance :)
You can use pandas' groupby with transform:
import pandas as pd
df = pd.DataFrame({'Sales': [100,110,200,300,150,200,100],
'Origin': ['YYZ','YEA','CDG','YYZ','YEA','YVR','YEA'],
'Dest': ['DEL','NYC','YUL','DEL','YTO','HKG','NYC']})
df['Result'] = df.groupby(['Origin','Dest']).Sales.transform('sum')
Result:
Sales Origin Dest Result
0 100 YYZ DEL 400
1 110 YEA NYC 210
2 200 CDG YUL 200
3 300 YYZ DEL 400
4 150 YEA YTO 150
5 200 YVR HKG 200
6 100 YEA NYC 210
DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.