Pandas Split Dataframe by Unique Column Value [duplicate] - python

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 11 months ago.
I have a Dataframe that is being output to a spreadsheet called 'All Data'. Let's say this data contains a business addresses (column for street, city, zip, state). However, I also want to create a worksheet for each unique state containing the exact same columns.
My basic idea was to iterate over every row using df.iterrows() and divide the dataframe like that by appending it to a new dataframe but that seems extremely inefficient. Is there a better way to do this?
I found this answer but that is just a boolean index.

The groupby answers on the other question will work for you too. In your case, something like:
df_list = [d for _, d in df.groupby(['state'])]
This uses a list comprehension to return a list of dataframes, with one dataframe for each state.

A simple way to do it would be to get the unique states and then filtering them out and saving them as individual CSVs or do any other operation after
Here's an example:
# df[column].unique() returns a list of unique values in that particular column
for state in df['state'].unique():
# Filter the dataframe using that column and value from the list
df[df['state']==state].to_csv()

Related

Writing a For loop to delete words in a list from a dataframe [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 8 days ago.
I have a listof strings:
non_dogs =['tiger_shark', 'upright', 'walking_stick', 'water_bottle']
i want to delete the strings in that list from the dataframe, how do i do that using a for loop using a code like this:
clean_breeds =clean_images[(clean_images['dog_breed']== 'tiger_shark')].index
clean_images.drop(clean_breeds,inplace = True)
I tried writing a for loop but it was not working
To delete strings in a list non_dogs from a dataframe clean_images, you can use a for loop and the drop method of a dataframe. The code would look like this:
for breed in non_dogs:
clean_breeds = clean_images[(clean_images['dog_breed'] == breed)].index
clean_images.drop(clean_breeds, inplace=True)
This code will loop through each string in the non_dogs list. For each iteration, it will find all the rows in the dataframe clean_images where the value in the dog_breed column is equal to the current string. The indices of those rows will be stored in the variable clean_breeds. Then, those rows will be dropped from the dataframe using the drop method. The inplace=True argument makes sure that the changes are made to the dataframe itself, rather than creating a new dataframe with the changes.
Here's an example to help illustrate the code:
import pandas as pd
# Create a sample dataframe
clean_images = pd.DataFrame({'dog_breed': ['tiger_shark', 'labrador', 'walking_stick', 'beagle', 'water_bottle']})
# Create a list of strings to delete from the dataframe
non_dogs = ['tiger_shark', 'walking_stick', 'water_bottle']
# Use a for loop to delete the strings in the list from the dataframe
for breed in non_dogs:
clean_breeds = clean_images[(clean_images['dog_breed'] == breed)].index
clean_images.drop(clean_breeds, inplace=True)
# The resulting dataframe only contains rows with dog breeds
print(clean_images)
The output of this code will be:
dog_breed
1 labrador
3 beagle

concatenate 2 dataframes while matching multiple columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 12 months ago.
I have 2 almost identical pandas dataframes with 5 common columns.
I want to add the second dataframe to the first which has a new column.
Dataframe 1
Dataframe 2
But I want it to update the same row given that columns 'Lot name', 'wafer' and 'site' match (green). If the columns do not match, I want to have the value of NaN as shown below.
Desired output
I have to do this with over 160 discrete columns but with possible matching Lot name, WAFER and SITE values.
I have tried the various merging(left right outer) and concat options, just cant seem to get it right. Any help\comments is appreciated.
Edit, follow up question:
I am trying to use this in a loop, where each iteration generates a new dataframe assigned to TEMP that needs to be merged with the previous dataframe. I cannot merge with an empty dataframe as it gives a merge error. How can I achieve this?
alldata = pd.DataFrame()
for i in range(len(operation)):
temp = data[data['OPE_NO'].isin([operation[i]])]
temp = temp[temp['PARAM_NAME'].isin([parameter[i]])]
temp = temp.reset_index(drop=True)
temp = temp[["LOT",'Lot name','WAFER',"SITE","PRODUCT",'PARAM_VALUE_NUMBER']]
temp = temp.rename(columns={'PARAM_VALUE_NUMBER':'PMRM28LEMCKLYTFR.1~'+operation[i]+'~'+parameter[i]})
alldata.merge(temp,how='outer')
example can be done with the following code
df1.merge(df2, how="outer")
If I'm misunderstanding problem, please tell me problem.
my english is not good but i have good heart to help you

For a requirement I need to transform a DataFrame into by creating rows out of values from of lists that are in a column of that dataFrame [duplicate]

This question already has answers here:
Pandas column of lists, create a row for each list element
(10 answers)
Closed 1 year ago.
I need to transform the below Dataframe into the required format without using a loop(or any other inefficient logic) as the size of dataframe is huge i.e., 950 thousand rows and also the value in the Points column has a list with lengths more than 1000. I'm getting this data after de-serializing a blob data from the database and will need to use this data create some ML Models.
input:
output:
for index,val in df.iterrows():
tempDF = pd.DataFrame(
[[
df['I'][index],df['x'][index],
df['y'][index],df['points'][index],
]]* int(df['points'][index]))
tempDF["Data"] = df['data'][index]
tempDF["index"] = list(range(1,int(df['k'][index])+1))
FinalDF = FinalDF.append(tempDF, ignore_index = True)
I have tried using for loop but for 950 thousand rows it takes so much time that using that logic is just not feasible. please help me in finding a pandas logic or if not then some other method to do that.
*I had to post screenshot because i was unable to post the dataframe with a table. Sorry I'm new to stackoverflow.
explode:
df.explode('points')

Creating a new dataframe colum as a function of other columns [duplicate]

This question already has answers here:
Pandas Vectorized lookup of Dictionary
(2 answers)
Closed 2 years ago.
I have a dataframe with Country column. It has rows for around 15 countries. I want to add a Continent column using a mapping dictionary, ContinentDict, that has mapping from country name to Continent name)
I see that these two work
df['Population'] = df['Energy Supply'] / df['Energy Supply per Capita']
df['Continent'] = df.apply(lambda x: ContinentDict[x['Country']], axis='columns')
but this does not
df['Continent'] = ContinentDict[df['Country']]
Looks like the issue is that df['Country'] is a series object and so the statement is not smart enough to treat the last statement to be same as 2.
Questions
Would love to understand why statement 1 works but not 3. Is it because "dividing two series objects" is defined as an element wise divide?
Any way to change 3 to tell I want element wise operation without having to go the apply route?
From your statement a mapping dictionary, ContinentDict, it looks like ContinentDict is a Python dictionary. In this case,
ContinentDict[some_key]
is a pure Python call, regardless of what object some_key is. That's why the 3rd call fails since the df['Country'] is not in the dictionary key (and it never can be since dictionary keys are not mutable).
In which case, Python only allows to index the exact key and throws an error when the key is not in the dictionary.
Pandas does provide a tool for you to replace/map the values:
df['Continent'] = df['Country'].map(ContinentDict)
df['Continent']=df['Country'].map(ContinentDict)
In case 1, you are dealing with two pandas series, so it knows how to deal with them.
In case 2, you have a python dictionary and pandas series, pandas don't know how to deal with a dictionary(df['country'] is pandas series but not a key in the dictionary)

Lookup in a pandas Dataframe [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe which is similar to:
grades=pd.DataFrame(columns=["person","course_code","grade"],data=[[1,101,2.0],[2,102,1.0],[3,103,3.0],[2,104,4.0],[1,102,5.0],[3,104,2.5],[2,101,1.0]])
On each row is the grade of a certain student in certain subject.
And want to convert it to another that looks like this:
students=pd.DataFrame(columns=[101,102,103,104],data [[2.0,5.0,"NaN","NaN"],[1.0,1.0,"Nan",4.0],["Nan","Nan",3.0,2.5]])
On each row is a student (codex of the row) with the different grades obtained in every subject (every column is a different subject).
I have tried doing this:
for subj in grades["COURSE_CODE"].unique():
grades_subj=grades[grades["COURSE_CODE"]==subj]
grades_subj = grades_subj.set_index("EXPEDIENT_CODE", drop = True)
for st in grades["EXPEDIENT_CODE"].unique():
grade_num=grades_subj.loc[st]["GRADE"]
student.loc[st][subj]=grade_num
But I get:
KeyError: 'the label [304208] is not in the [index]'
I have tried other ways too and get always errors...
Can someone help me, please?
try:
grades.pivot_table(index='person', columns='course_code', values='grade')
The value argument let you to choose the aggregation column.
In order to answer your comment below, you can always add different levels when indexing. This is simply done by passing a list rather than a single string to index. Note you can do the same in columns. SO, based in the example you provide.
grades.pivot_table(index=['person','school'], columns='course_code', values ='grade')
After this I usually recommend to reset_index() unless you are fluent slicing and indexing with MultiIndex.
Also, if the correspondence is 1 to 1, you could merge both dataframes using the appropiate join.
Here you have all the information about Reshaping and Pivot Tables in Pandas.

Categories