How to add rows in Data Frame while for loop? - python

I want to add a row in an existing data frame, where I don't have a matching regex value.
For example,
import pandas as pd
import numpy as np
import re
lst = ['Sarah Kim', 'Added by January 21']
df = pd.DataFrame(lst)
df.columns = ['Info']
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
for index, row in dff.iterrows():
if re.findall(name_pat, str(row['Info'])):
print("Name matched")
elif re.findall(title_pat, str(row['Info'])):
print("Title matched")
if re.findall(title_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
elif re.findall(date_pat, str(row['Info'])):
print("Date matched")
if re.findall(date_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
So here in my dataframe df, I do not have a title, but just Name and Date. While looping df, I want to add an empty column for a title.
The output is:
Info
0 Sarah Kim
1 Added on January 21
My expected output is:
Info
0 Sarah Kim
1 None
2 Added on January 21
Is there any way that I can add an empty column, or is there a better way?
+++
The dataset I'm working with is just one column with many rows. The rows have some structure, that repeat data of "name, title, date". For example,
Info
0 Sarah Kim
1 Added on January 21
2 Jesus A. Moore
3 Marketer
4 Added on May 30
5 Bobbie J. Garcia
6 CEO
7 Anita Jobe
8 Designer
9 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I have sliced the data frame, so I can only extract data frame looks like this:
Info
0 Sarah Kim
1 Added on January 21
And I'm trying to run a loop for each section, and if a date or title is missing, I will fill with an empty row. So that in the end, I will have:
Info
0 Sarah Kim
1 **NULL**
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 **NULL**
9 Anita Jobe
10 Designer
11 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13

I see you have a long dataframe with information and each set of information is different. I think the your goal is possibly to have a dataframe where you have 3 columns.
Name,Title and Date
Here is a way I would approach this problem and some code samples. I would take advantage of the df.shift method so I could tie information and use your existing dataframe to create a new one.
I am also making some assumptions based on what you have listed above. First I will assume that only the Title and Date field could be missing. Second I will assume that the order of the is Name,Title and Date like you have mentioned above.
#first step create test data
test_list = ['Sarah Kim','Added on January 21','Jesus A. Moore','Marketer','Added on May 30','Bobbie J. Garcia','CEO','Anita Jobe','Designer','Added on January 3']
test_df =pd.DataFrame(test_list,columns=['Info'])
# second step use your regex to get what type of column each info value is
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
test_df['Col'] = test_df['Info'].apply(lambda x: 'Name' if re.findall(name_pat, x) else ('Date' if re.findall(date_pat,x) else 'Title'))
# third step is to get the next values from our dataframe using df.shift
test_df['Next_col'] = test_df['Col'].shift(-1)
test_df['Next_col2'] = test_df['Col'].shift(-2)
test_df['Next_val1'] = test_df['Info'].shift(-1)
test_df['Next_val2'] = test_df['Info'].shift(-2)
# Now filter to only the names and apply a function to get our name, title and date
new_df = test_df[test_df['Col']=='Name']
def apply_func(row):
name = row['Info']
title = None
date = None
if row['Next_col']=='Title':
title = row['Next_val1']
elif row['Next_col']=='Date':
date = row['Next_val1']
if row['Next_col2']=='Date':
date = row['Next_val2']
row['Name'] = name
row['Title'] = title
row['date'] = date
return row
final_df = new_df.apply(apply_func,axis=1)[['Name','Title','date']].reset_index(drop=True)
print(final_df)
Name Title date
0 Sarah Kim None Added on January 21
1 Jesus A. Moore Marketer Added on May 30
2 Bobbie J. Garcia CEO None
3 Anita Jobe Designer Added on January 3
There is probably a way that we could do this in less lines of code. I welcome anyone who can make this more efficient, but I believe this should work. Also if you wanted to flatten this back into an array.
flattened_df = pd.DataFrame(final_df.values.flatten(),columns=['Info'])
print(flattened_df)
Info
0 Sarah Kim
1 None
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 None
9 Anita Jobe
10 Designer
11 Added on January 3

Related

Comparing Similar Data Frames with Like-Columns in Python

I'd like to compare the difference in data frames. xyz has all of the same columns as abc, but it has an additional column.
In the comparison, I'd like match up the two like columns (Sport) but only show the SportLeague in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz)
Here's a screenshot of the two data frames:
import pandas as pd
import numpy as np
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz
Code already tried:
abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)
The error I get is the following (since the data frames don't have the exact same columns):
Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames
Further clarification of the logic:
Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].
Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']".
What I've explained above is similar to the .compare method. However, the data frames in this example may not be the same length and have different amounts of variables.
If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.
Note: in the example here I have added one sport ('Cricket') to your df abc, to trigger the condition abc['Sport'] does not exist in xyz['Sport'].
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)
Sport Year ID
0 Football 2021 1
1 Basketball 2021 2
2 Baseball 2022 3
3 Hockey 2022 4
4 Cricket 2022 5
I've left xyz unaltered. Now, let's merge these two dfs:
df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default). Like this:
conditions = [ df.Year_abc.isnull(),
df.Year_xyz.isnull(),
(df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
df.Year_xyz != df.Year_abc,
df.ID_xyz != df.ID_abc
]
choices = [ 'Sport not in abc',
'Sport not in xyz',
'Change year and ID to xyz',
'Change year to xyz',
'Change ID to xyz']
df['action'] = np.select(conditions, choices, default=np.nan)
Result as below with a new column action with notes on which course of action to take.
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc \
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
action
0 Change year and ID to xyz # match, but mismatch year and ID
1 Change year and ID to xyz # match, but mismatch year and ID
2 Sport not in abc # no match: Sport in xyz, but not in abc
3 Change ID to xyz # match, but mismatch ID
4 Change year and ID to xyz # match, but mismatch year and ID
5 nan # complete match: no action needed
6 Sport not in xyz # no match: Sport in abc, but not in xyz
Let me know if this is a correct interpretation of what you are looking to achieve.

Finding earliest date after groupby a specific column

I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!
Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

How Solve a Data Science Question Using Python's Panda Data Structure Syntax

Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
Second, the question asked is, "Which country has won the most gold medals in summer games?"
Third, a hint given me as to how to answer using Python's panda syntax is this:
"This function should return a single string value."
Fourth, I tried entering this as the answer in Python's panda syntax:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
def answer_one():
if df.columns[:2]=='00':
df.rename(columns={col:'Country'+col[4:]}, inplace=True)
df_max = df[df[max('Gold')]]
return df_max['Country']
answer_one()
Fifth, I have tried other various answers like this in Coursera's auto-grader, but
it keeps giving this error message:
There was a problem evaluating function answer_one, it threw an exception was thus counted as incorrect.
0.125 points were not awarded.
Could you please help me solve that question? Any hints/suggestions/comments are welcome for that.
Thanks, Kevin
You can use pandas' loc function to find the country name corresponding to the maximum of the "Gold" column:
data = [('Afghanistan', 13),
('Algeria', 12),
('Argentina', 23)]
df = pd.DataFrame(data, columns=['Country', 'Gold'])
df['Country'].loc[df['Gold'] == df['Gold'].max()]
The last line returns Argentina as answer.
Edit 1:
I just noticed you import the .csv file using pd.read_csv('olympics.csv', index_col=0, skiprows=1). If you leave out the skiprows argument you will get a dataframe where the first line in the .csv file correspond to column names in the dataframe. This makes handling of your dataframe much easier in pandas and is encouraged. Second, I see that using the index_col=0 argument you use the country names as indices in the dataframe. In this case you should choose to use index over the loc function as follows:
df.index[df['Gold'] == df['Gold'].max()][0]
import pandas as pd
def answer_one():
df1=pd.Series.max(df['Gold'])
df1=df[df['Gold']==df1]
return df1.index[0]
answer_one()
Function argmax() returns the index of the maximum element in the data frame.
return df['Gold'].argmax()

Top 5 movies with most number of ratings

I'm currently facing a little problem. I'm working with the movie-lens 1M data, and trying to get the top 5 movies with the most ratings.
movies = pandas.read_table('movies.dat', sep='::', header=None, names= ['movie_id', 'title', 'genre'])
users = pandas.read_table('users.dat', sep='::', header=None, names=['user_id', 'gender','age','occupation_code','zip'])
ratings = pandas.read_table('ratings.dat', sep='::', header=None, names=['user_id','movie_id','rating','timestamp'])
movie_data = pandas.merge(movies,pandas.merge(ratings,users))
The above code is what I have written to merge the .dat files into one Dataframe.
Then I need the top 5 from that movie_data dataframe, based on the ratings.
Here is what I have done:
print(movie_data.sort('rating', ascending = False).head(5))
This seem to find the top 5 based on the rating. However, the output is:
movie_id title genre user_id \
0 1 Toy Story (1995) Animation|Children's|Comedy 1
657724 2409 Rocky II (1979) Action|Drama 101
244214 1012 Old Yeller (1957) Children's|Drama 447
657745 2409 Rocky II (1979) Action|Drama 549
657752 2409 Rocky II (1979) Action|Drama 684
rating timestamp gender age occupation_code zip
0 5 978824268 F 1 10 48067
657724 5 977578472 F 18 3 33314
244214 5 976236279 F 45 11 55105
657745 5 976119207 M 25 6 53217
657752 5 975603281 M 25 4 27510
As you can see Rocky II appears 3 times. I would like to know if I can somehow remove duplicates fast, other than going through the list again, and remove duplicates that way.
I have looked at a pivot_table, but i'm not quite sure how they work, so if it can be done with such a table, i need some explaination of how they work
EDIT.
First comment did indeed remove the duplicates.
movie_data.drop_duplicates(subset='movie_id').sort('rating', ascending = False).head(5)
Thank you :)
You can drop the duplicate entries by calling drop_duplicates and pass param subset='movie_id':
movie_data.drop_duplicates(subset='movie_id').sort('rating', ascending = False).head(5)

Categories