Python Pandas inserting new rows with colums values - python

I'm doing a python script to clean a CSV file we receive from Qualtrics for an entrepreneurship competition.
So far, I've sliced the data and I wrote it back in an Excel file with Pandas. However, I have some columns that I would need to create new rows with.
For example for each team submission we have
Team Name Nb of teammates Team Leader One Team Leader Two
1 x 2 Joe Joey
2 y 1 Jack
...
I would need to return
Team Name Nb of teammates Team Leader
1 x 2 Joe
2 Joey
3 y 1 Jack
...
This is a very simplified example of the real data I have, because there's more column, but I was wondering how I could do that in Pandas/Python.
I'm aware of these discussions on Inserting Row and Indexing: Setting with enlargement, but I don't know what should I do.
Thanks for your help !

you can use melt:
#set up frame
df =pd.DataFrame({'Team Name':['x','y'], 'Nb of teammates':[2,1], 'Team Leader One':['Joe','Jack'],'Team Leader Two':['Joey',None]})
Melt the frame:
pd.melt(df,id_vars=['Team Name','Nb of teammates'],value_vars=['Team Leader One','Team Leader Two']).dropna()
returns:
Team Name Nb of teamates variable value
0 x 2 Team Leader One Joe
1 y 1 Team Leader One Jack
2 x 2 Team Leader Two Joey

Related

How to labeling data in pandas based on value of column have similar value in another column

If there is someone who understands, please help me to resolve this. I want to label user data using python pandas, where there are two columns in my dataset, namely author, and retweeted_screen_name. I want to do a label with the criteria if every user in the author column has the same value in the retweeted_screen_name column then are 1 and the others that do not have the same value are 0.
Author
RT_Screen_Name
Label
Alice
John
1
Sandy
John
1
Lisa
Mario
0
Luna
Mark
0
Luna
John
1
Luke
Anthony
0
df['Label']=0
df.loc[df["RT_Screen_Name"]=="John", ["Label"]] = 1
It is unclear what condition you are using to decide the label variable, but if you are clear on your condition you can change out the conditional statement within this code. Also if you edit your question to clarify the condition, notify me and I will adjust my answer.
IIUC, try with groupby:
df["Label"] = (df.groupby("RT_Screen_Name")["Author"].transform("count")>1).astype(int)
>>> df
Author RT_Screen_Name Label
0 Alice John 1
1 Sandy John 1
2 Lisa Mario 0
3 Luna Mark 0
4 Luna John 1
5 Luke Anthony 0

Dropping a column in a dataframe based on another column

I have a dataframe called jobs
position software salary degree location industry
architect autoCAD 400 masters london AEC
data analyst python 500 bachelors New York Telecommunications
personal assistant excel 200 bachelors London Media
.....
I have another dataframe called 'preference'
name value
position 2
software 4
salary 3
degree 1
location 3
industry 1
I'd like to drop columns from the 'jobs' dataframe whose preference value is less than 2 so that I have
position software salary location
architect autoCAD 400 london
data analyst python 500 New York
personal assistant excel 200 London
.....
This is what I have
jobs.drop(list(jobs.filter(preference['value'] < 2), axis = 1, inplace = True)
but it doesn't seem to drop the (degree and industry) columns. Any help would be appreciated
Your attempt is almost there I think. Here's what I have:
>>>jobs.drop(preference.loc[preference['value'] < 2,'name'], axis=1, inplace=True)
position software salary location
0 architect autoCAD 400 london
1 data analyst python 500 New York
2 personal assistant excel 200 London
This should work for you:
jobs.drop(preferences.loc[preferences.value < 2, 'name'], axis=1, inplace=True)
This is why your line of code did not work:
first of all, there is a closing parenthesis missing (but I guess that's just a typo)
the filter method should be applied to preferences instead of jobs
filter is not really what you want to use here to get a list of names: preferences.loc[preferences.value < 2, 'name'] returns a list of all names with value < 2

Python remove row if cell value in dataframe contain characters less than 5

I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]

nested for loops with pandas dataframe

I am looping through a dataframe column of headlines (sp500news) and comparing against a dataframe of company names (co_names_df). I am trying to update the frequency each time a company name appears in a headline.
My current code is below and is not updating the frequency columns. Is there a cleaner, faster implementation - maybe without the for loops?
for title in sp500news['title']:
for string in title:
for co_name in co_names_df['Name']:
if string == co_name:
co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
co_names_df['Frequency'][co_names_index] += 1
co_names_df sample
Name Frequency
0 3M 0
1 A.O. Smith 0
2 Abbott 0
3 AbbVie 0
4 Accenture 0
5 Activision 0
6 Acuity Brands 0
7 Adobe Systems 0
...
sp500news['title'] sample
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO
You can probably speed this up; you're using dataframes where other structures would work better. Here's what I would try.
from collections import Counter
counts = Counter()
# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])
for title in sp500news['title']:
for word in title: # did you mean title.split(" ")? or is title a list of strings?
if word in company_names:
counts.update([word])
counts is then a dictionary {company_name: count}. You can just do a quick loop over the elements to update the counts in your dataframe.

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Categories