This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
This is the dataset that I am attempting to use:
https://storage.googleapis.com/hewwo/NCHS_-_Leading_Causes_of_Death__United_States.csv
I am wondering how I can specifically drop rows that contain certain values. In this example, many rows from the "Cause Name" column have values of "All causes". I want to drop any row that has this value for that column. This is what I have tried so far:
death2[death2['cause_name' ]!= 'All Causes']
While this did not give me any errors, it also did not seem to do anything to my dataset. Rows with "All causes" were still present. Am I doing something wrong?
No changes were made to your DataFrame. You need to reassign it if you want to change it.
death2 = death2[death2['cause_name' ]!= 'All Causes']
Related
This question already has answers here:
How can repetitive rows of data be collected in a single row in pandas?
(3 answers)
pandas group by and find first non null value for all columns
(3 answers)
Closed 7 months ago.
While using iterrows to implement the logic takes lot of time.Can some suggest a way on how I could optimize the code with vectorized/apply()
Below is the input table..From a partition of (ITEMSALE,ITEMID),I need to populate rows with rank=1 .If any column value is null in rank=1,I need to populate the next available value in that column.This has to be done for all columns in dataset.
Below is the output format expected
I have tried below logic using iterrows where am accessing values rowise.Performance is too low using this method.
This should get you what you need
df.loc[df.loc[df['Item_ID'].isna()].groupby('Item_Sale')['Date'].idxmin()]
This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Closed 8 months ago.
I have a CSV file called Project.csv
I am reading this file using pandas df = pd.read_csv('Project.csv',low_memory=False)
Inside this CSV file, there are rows with duplicate Project ID and Name but the other Column data are unique. I was looking for a way to find duplicate rows based on Project ID and merge records with ',' if they are unique.
Project.csv
I am looking to store this record in a data frame and filter it to make it look like this:
A simply groupby will do the job
after_df = your_df.groupby(['Project Id'])[['Project Name','Project Types','Owner']].agg(set)
This will give you a similar result to what you want. If you want to take out the key symbool of the strings parameters so you have a nice looking string do this.
after_df.astype(str).replace(r'{|}|\'','',regex=True)
This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 1 year ago.
I am a complete novice when it comes to Python so this might be badly explained.
I have a pandas dataframe with 2485 entries for years from 1960-2020. I want to know how many entries there are for each year, which I can easily get with the .value_counts() method. My issue is that when I print this, the output only shows me the top 5 and bottom 5 entries, rather than the number for every year. Is there a way to display all the value counts for all the years in the DataFrame?
Use pd.set_options and set display.max_rows to None:
>>> pd.set_option("display.max_rows", None)
Now you can display all rows of your dataframe.
Options and settings
pandas.set_option
If suppose the name of dataframe is 'df' then use
counts = df.year.value_counts()
counts.to_csv('name.csv',index=false)
As our terminal can't display entire columns they just display the top and bottom by collapsing the remaining values so try saving in a csv and see the records
This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
I am fairly new to using Python and having come from using SQL I have been using PANDAS to build reports from CSV files with reasonable success. I have been able to answer most of questions thanks mainly to this site, but I dont seem to be able to find an answer to my question:
I have a dataframe which has 2 columns I want to be able to group on the first column and display the lowest and highest alphabetical values from the second column concatenated into a third column. I could do this fairly easy in SQL but as I say I am struggling getting my head around it in Python/Pandas
example:
source data:
LINK_NAME, CITY_NAME
Linka, Citya
Linka, Cityz
Linkb,Cityx
Linkb,Cityc
Desired output:
LINK_NAME,LINKID
Linka, CityaCityz
Linkb,CitycCityx
Edit:
Sorry for missing part of your question.
To sort the strings within each group alphabetically, you could define a function to apply to the grouped items:
def first_and_last_alpha(series):
sorted_series = series.sort_values()
return "".join([sorted_series.iloc[0], sorted_series.iloc[-1]])
df.groupby("LINK_NAME")["CITY_NAME"].apply(first_and_last_alpha)
Original:
Your question seems to be a duplicate of this one.
The same effect, with your data, is achieved by:
df.groupby("LINK_NAME")["CITY_NAME"].apply(lambda x: "".join(x))
where df is your pandas.Dataframe object
In future, it's good to provide a reproducible example, including anything you've attempted before posting. For example, the output from df.to_dict() would allow me to recreate your example data instantly.
This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 5 years ago.
This should be incredibly easy, but I can't get it to work.
I want to filter my dataset on two or more values.
#this works, when I filter for one value
df.loc[df['channel'] == 'sale']
#if I have to filter, two separate columns, I can do this
df.loc[(df['channel'] == 'sale')&(df['type']=='A')]
#but what if I want to filter one column by more than one value?
df.loc[df['channel'] == ('sale','fullprice')]
Would this have to be an OR statement? I can do something like in SQL using in?
There is a df.isin(values) method wich tests
whether each element in the DataFrame is contained in values.
So, as #MaxU wrote in the comment, you can use
df.loc[df['channel'].isin(['sale','fullprice'])]
to filter one column by multiple values.