Modify pandas iteration from iterrows to apply/vectorized format [duplicate] - python

This question already has answers here:
How can repetitive rows of data be collected in a single row in pandas?
(3 answers)
pandas group by and find first non null value for all columns
(3 answers)
Closed 7 months ago.
While using iterrows to implement the logic takes lot of time.Can some suggest a way on how I could optimize the code with vectorized/apply()
Below is the input table..From a partition of (ITEMSALE,ITEMID),I need to populate rows with rank=1 .If any column value is null in rank=1,I need to populate the next available value in that column.This has to be done for all columns in dataset.
Below is the output format expected
I have tried below logic using iterrows where am accessing values rowise.Performance is too low using this method.

This should get you what you need
df.loc[df.loc[df['Item_ID'].isna()].groupby('Item_Sale')['Date'].idxmin()]

Related

How do I get a count on the affected rows from an operation on a Pandas dataframe? [duplicate]

This question already has answers here:
Python Pandas Counting the Occurrences of a Specific value
(8 answers)
Closed 11 months ago.
Given the following data set, loaded into a Pandas DataFrame
BARCODE
ALTERNATE_BARCODE
123
456
789
Imagine I have the following Pandas python Statement:
users.loc[users["BARCODE"] == "", "BARCODE"] = users["ALTERNATE_BARCODE"]
Is there any way - without rewriting this terse statement too much - that would allow me to access the number of rows in the DataFrame that got affected?
Edit: I am mainly on the lookout for the existence of a library or something build into Pandas that has knowledge of the last operation and could provide me with some metadata about it. Deltas is a good workaround, but not what I am after, since it would clutter the code.
Prior to replacing the values, get the length output of the .loc command.
len(users.loc[users["BARCODE"] == "", "BARCODE"].index)

Python Grouping on one column and detailing min and max alphabetical values from another column from a number of rows [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
I am fairly new to using Python and having come from using SQL I have been using PANDAS to build reports from CSV files with reasonable success. I have been able to answer most of questions thanks mainly to this site, but I dont seem to be able to find an answer to my question:
I have a dataframe which has 2 columns I want to be able to group on the first column and display the lowest and highest alphabetical values from the second column concatenated into a third column. I could do this fairly easy in SQL but as I say I am struggling getting my head around it in Python/Pandas
example:
source data:
LINK_NAME, CITY_NAME
Linka, Citya
Linka, Cityz
Linkb,Cityx
Linkb,Cityc
Desired output:
LINK_NAME,LINKID
Linka, CityaCityz
Linkb,CitycCityx
Edit:
Sorry for missing part of your question.
To sort the strings within each group alphabetically, you could define a function to apply to the grouped items:
def first_and_last_alpha(series):
sorted_series = series.sort_values()
return "".join([sorted_series.iloc[0], sorted_series.iloc[-1]])
df.groupby("LINK_NAME")["CITY_NAME"].apply(first_and_last_alpha)
Original:
Your question seems to be a duplicate of this one.
The same effect, with your data, is achieved by:
df.groupby("LINK_NAME")["CITY_NAME"].apply(lambda x: "".join(x))
where df is your pandas.Dataframe object
In future, it's good to provide a reproducible example, including anything you've attempted before posting. For example, the output from df.to_dict() would allow me to recreate your example data instantly.

Dropping rows based on specific values through python pandas? [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
This is the dataset that I am attempting to use:
https://storage.googleapis.com/hewwo/NCHS_-_Leading_Causes_of_Death__United_States.csv
I am wondering how I can specifically drop rows that contain certain values. In this example, many rows from the "Cause Name" column have values of "All causes". I want to drop any row that has this value for that column. This is what I have tried so far:
death2[death2['cause_name' ]!= 'All Causes']
While this did not give me any errors, it also did not seem to do anything to my dataset. Rows with "All causes" were still present. Am I doing something wrong?
No changes were made to your DataFrame. You need to reassign it if you want to change it.
death2 = death2[death2['cause_name' ]!= 'All Causes']

Is there any way to extract data from dataframe based on conditions? [duplicate]

This question already has answers here:
Pandas loc multiple conditions [duplicate]
(3 answers)
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 3 years ago.
I have a dataframe with say some investment data. I need to extract data from this dataframe based on certain conditions like say funding_type. There are many funding_types available I just need to extract data matching particular fund types.
e.g: funding_type has values venture,seed,angel,equity and so on.
I just need data matching funding type say seed and angel
I tried out following
MF1[MF1['funding_round_type']=='seed']
here MF1 is my dataframe. This gives all the data related to seed fund type
I need condition somewhat like
MF1[MF1['funding_round_type']=='seed' and MF1['funding_round_type']=='angel']
But pandas doesnt allow it.
Any clues?
And doesn't work here, you would need to use & and likewise for or |, but if you use the expression for the same column, of course it can only have one of the values so with & the expression would be true for none of the columns. You need to use | (or) here:
MF1[(MF1['funding_round_type']=='seed') | (MF1['funding_round_type']=='angel')]
Or as already stated by someone else:
MF1[(MF1['funding_round_type'].isin(['seed', 'angel'])]

python pandas loc - filter for list of values [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 5 years ago.
This should be incredibly easy, but I can't get it to work.
I want to filter my dataset on two or more values.
#this works, when I filter for one value
df.loc[df['channel'] == 'sale']
#if I have to filter, two separate columns, I can do this
df.loc[(df['channel'] == 'sale')&(df['type']=='A')]
#but what if I want to filter one column by more than one value?
df.loc[df['channel'] == ('sale','fullprice')]
Would this have to be an OR statement? I can do something like in SQL using in?
There is a df.isin(values) method wich tests
whether each element in the DataFrame is contained in values.
So, as #MaxU wrote in the comment, you can use
df.loc[df['channel'].isin(['sale','fullprice'])]
to filter one column by multiple values.

Categories