Splitting a dataframe column on a pattern of characters and numerals - python

I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.

Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN

import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)

Related

How to split two strings into different columns in Python with Pandas?

I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!
The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun
have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])

check if a name in one dataframe exist in other dataframe python

I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.
Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22

How to compare two data row before concatenating them?

I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)

Duplicate & identify certain rows in a Pandas Dataframe - regex

I did'nt find any solution about my issue.
I want to identify & duplicate with regex certains rows of my DataFrame.
For example my df :
var1
0 House A and B
1 2 garage + garden
2 fridges
The result that i want in var2 (keep my var1 too) :
var1 var2
0 House A and B House A
1 House A and B House B
2 2 garage + garden Garage 1
3 2 garage + garden Garage 2
4 2 garage + garden Garden
5 fridges fridge 1
6 fridges fridge 2
I don't know exactly how to do that, I think with regex it's a good idea, but i'm not sur.
I tried with str.contains but the results was not good.
Thank for your help.
If those three cases are exhaustive, then you may use my solution, my solution uses a combination of regex matching and split.
#the hard part
def my_scan(t):
#Split
#only '+' and 'and' are considered
cond = re.findall(r'(.+)(and|\+)(.+)' , t)
if len(cond):
t = [_.strip() for _ in cond[0]]
else:
t = [t]
#Process
#Case 1 'House': and
if 'and' in t:
t.remove('and')
#add 'House' to the second element
t[1] = re.split(' ', t[0])[0]+' '+t[1]
#Case 2 'Garage + Garden': + with numeral
elif '+' in t:
t.remove('+')
x = []
##check for numerals in front
for _ in t:
if (re.match(r'^\d+', _)):
temp = _[(re.match(r'^\d+', _)).end()+1:] #'garage'
#append by the number of numeral times
for i in range(int(re.match(r'^\d+', _)[0])):
x.append(temp+' '+str(i+1))
else:
x.append(_)
t = x
#Case 3 'Fridges': single word that ends with an s
else:
if (re.match(r'^[A-Za-z]+s$', t[0])):
t = t[0][:-1]
t = [t+' 1', t+' 2']
else:
t[0] = t[0]+' 1'
return t
#the easier part
def get_df(t):
output1 = []
output2 = []
for _ in t:
dummy = my_scan(_)
for i in range(len(dummy)):
output1.append(_)
output2.append(dummy[i])
df = pd.DataFrame({'var1':output1,'var2':output2})
return df
#test it
data = {'var1':['House A and B','2 Garage + Garden', 'Fridges']}
df = get_df(data['var1'])
print(df)
#bonus test
data1 = {'var1':['House AZ and PL','5 Garage + 3 Garden', 'Fridge']}
df = get_df(data1['var1'])
print(df)
Printed df output from your original data, data = {'var1':['House A and B','2 Garage + Garden', 'Fridges']}.
var1 var2
0 House A and B House A
1 House A and B House B
2 2 Garage + Garden Garage 1
3 2 Garage + Garden Garage 2
4 2 Garage + Garden Garden
5 Fridges Fridge 1
6 Fridges Fridge 2
Printed df output from an additional test data, data1 = {'var1':['House AZ and PL','5 Garage + 3 Garden', 'Fridge']}.
var1 var2
0 House AZ and PL House AZ
1 House AZ and PL House PL
2 5 Garage + 3 Garden Garage 1
3 5 Garage + 3 Garden Garage 2
4 5 Garage + 3 Garden Garage 3
5 5 Garage + 3 Garden Garage 4
6 5 Garage + 3 Garden Garage 5
7 5 Garage + 3 Garden Garden 1
8 5 Garage + 3 Garden Garden 2
9 5 Garage + 3 Garden Garden 3
10 Fridge Fridge 1
Maybe, regular expression wouldn't be the best idea to do this task, yet you can write some expressions to split them, how-to-code-it or how to find the plural words (which you'd probably want some NLP library for it) would be some other different stories:
([A-Za-z]+?)\s([A-Z])(?=\s+and|$)|([0-9]+)?\s+([A-Za-z]*?)(?=\s+\+|$)
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories