Dropping a column in a dataframe based on another column - python

I have a dataframe called jobs
position software salary degree location industry
architect autoCAD 400 masters london AEC
data analyst python 500 bachelors New York Telecommunications
personal assistant excel 200 bachelors London Media
.....
I have another dataframe called 'preference'
name value
position 2
software 4
salary 3
degree 1
location 3
industry 1
I'd like to drop columns from the 'jobs' dataframe whose preference value is less than 2 so that I have
position software salary location
architect autoCAD 400 london
data analyst python 500 New York
personal assistant excel 200 London
.....
This is what I have
jobs.drop(list(jobs.filter(preference['value'] < 2), axis = 1, inplace = True)
but it doesn't seem to drop the (degree and industry) columns. Any help would be appreciated

Your attempt is almost there I think. Here's what I have:
>>>jobs.drop(preference.loc[preference['value'] < 2,'name'], axis=1, inplace=True)
position software salary location
0 architect autoCAD 400 london
1 data analyst python 500 New York
2 personal assistant excel 200 London

This should work for you:
jobs.drop(preferences.loc[preferences.value < 2, 'name'], axis=1, inplace=True)
This is why your line of code did not work:
first of all, there is a closing parenthesis missing (but I guess that's just a typo)
the filter method should be applied to preferences instead of jobs
filter is not really what you want to use here to get a list of names: preferences.loc[preferences.value < 2, 'name'] returns a list of all names with value < 2

Related

How to access the current row data using column name in pandas dataframe?

I want to create a new column in a dataframe where I want to search the dataframe using the values in current row.
My columns are :
Category, city,store, date, sales.
Category City Store OrderDate Sales
0 Bakes & Cakes New York Bakesmith 2019-12-23 300
1 Bakes & Cakes New York Bakesmith 2020-01-18 500
2 Bakes & Cakes New York Cream Nd Cakes 2019-12-19 600
3 Bakes & Cakes New York Cream Nd Cakes 2020-01-12 400
4 Bakes & Cakes London Cream Nd Cakes 2019-12-31 1000
I want to do something like this:
df['Last month'] = df[(df['store']==df.current.store) & (df['city']==df.current.city) & (df['Category']==df.current.Category) & (df['date'] = some_date))
So basically I want to create a new column in a dataframe by slicing the same dataframe based on current row values.
How can I do this?
Can someone help me on this please?

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Python remove row if cell value in dataframe contain characters less than 5

I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]

Python Pandas inserting new rows with colums values

I'm doing a python script to clean a CSV file we receive from Qualtrics for an entrepreneurship competition.
So far, I've sliced the data and I wrote it back in an Excel file with Pandas. However, I have some columns that I would need to create new rows with.
For example for each team submission we have
Team Name Nb of teammates Team Leader One Team Leader Two
1 x 2 Joe Joey
2 y 1 Jack
...
I would need to return
Team Name Nb of teammates Team Leader
1 x 2 Joe
2 Joey
3 y 1 Jack
...
This is a very simplified example of the real data I have, because there's more column, but I was wondering how I could do that in Pandas/Python.
I'm aware of these discussions on Inserting Row and Indexing: Setting with enlargement, but I don't know what should I do.
Thanks for your help !
you can use melt:
#set up frame
df =pd.DataFrame({'Team Name':['x','y'], 'Nb of teammates':[2,1], 'Team Leader One':['Joe','Jack'],'Team Leader Two':['Joey',None]})
Melt the frame:
pd.melt(df,id_vars=['Team Name','Nb of teammates'],value_vars=['Team Leader One','Team Leader Two']).dropna()
returns:
Team Name Nb of teamates variable value
0 x 2 Team Leader One Joe
1 y 1 Team Leader One Jack
2 x 2 Team Leader Two Joey

Getting unique rows conditioned on year pandas python dataframe

I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation
If all you want to do is keep the first entry for each name, you can use drop_duplicates Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008

Categories