Removing from pandas dataframe all rows having less than 3 characters - python

I have this dataframe
Word Frequency
0 : 79
1 , 60
2 look 26
3 e 26
4 a 25
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
I would like to remove all words having less than 3 characters.
I tried as follows:
df['Word']=df['Word'].str.findall('\w{3,}').str.join(' ')
but it does not remove them from my datataset.
Can you please tell me how to remove them?
My expected output would be:
Word Frequency
2 look 26
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2

Try with
df = df[df['Word'].str.len()>=3]

Instead of attempting a regular expression, you can use .str.len() to get the length of each string of your column. Then you can simply filter based on that length for >= 3
Should look like:
df.loc[df["Word"].str.len() >= 3]

Please Try
df[df.Word.str.len()>=3]

Related

Doing vlookup-like things on Python with multiple lookup values

Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?
Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.

Strip the last character from a string if it is a letter in python dataframe

It is possibly done with regular expressions, which I am not very strong at.
My dataframe is like this:
import pandas as pd
import regex as re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
df = pd.DataFrame(data)
df
postcode total
0 DG14 44
1 EC3M 54
2 BN45 56
3 M2 78
4 WC2A 87
5 W1C 35
6 PE35 36
I want to get these strings in my column with the last letter stripped like so:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1C 35
6 PE35 36
Probably something using re.sub('', '\D')?
Thank you.
You could use str.replace here:
df["postcode"] = df["postcode"].str.replace(r'[A-Za-z]$', '')
One of the approaches:
import pandas as pd
import re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
data['postcode'] = [re.sub(r'[a-zA-Z]$', '', item) for item in data['postcode']]
df = pd.DataFrame(data)
print(df)
Output:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1 35
6 PE35 36

Replace every nth row in df1 with every row from df 2

(Absolute beginner here)
The following code should replace every 9th row of the template df with EVERY row of the data df. However it replaces every 9th row of template with every 9th row of data.
template.iloc[::9, 2] = data['Question (en)']
template.iloc[::9, 3] = data['Correct Answer']
template.iloc[::9, 4] = data['Incorrect Answer 1']
template.iloc[::9, 5] = data['Incorrect Answer 2']
Thank you for your help
The source of the problem with your code is that the initial step to
any operation on 2 DataFrames is their alignment by indices.
To avoid this step, take the underlying Numpy array from one of arrays, invoking values.
Since Numpy array has no index, Pandas can't perform the mentioned alignment.
Another correction is:
to take from the second DataFrame only as many rows as it is needed,
and only these columns that are to be saved in the target array,
perform the whole update "in one go" (see the code below).
To create both source test arrays, I defined the following function:
def getTestDf(nRows : int, tt : str, valShift=0):
qn = np.array(list(map(lambda i: tt + str(i),np.arange(nRows, dtype=int))))
ans = np.arange(nRows * 3, dtype=int).reshape((-1, 3)) + valShift
return pd.concat([pd.DataFrame({'Question (en)' : qn}), pd.DataFrame(ans,
columns=['Correct Answer', 'Incorrect Answer 1', 'Incorrect Answer 2'])], axis=1)
and called it:
template = getTestDf(80, 'Question_')
data = getTestDf(9, 'New question ', 1000)
Note that after I created template I counted that just 9 rows in data
are needed, so I created data with just 9 rows.
This way the initial part of template contains:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 Question_0 0 1 2
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
...
and data (in full):
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 New question 1 1003 1004 1005
2 New question 2 1006 1007 1008
3 New question 3 1009 1010 1011
4 New question 4 1012 1013 1014
5 New question 5 1015 1016 1017
6 New question 6 1018 1019 1020
7 New question 7 1021 1022 1023
8 New question 8 1024 1025 1026
Now, to copy selected rows, run just:
template.iloc[::9] = data.values
The initial part of template contains now:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
5 Question_5 15 16 17
6 Question_6 18 19 20
7 Question_7 21 22 23
8 Question_8 24 25 26
9 New question 1 1003 1004 1005
10 Question_10 30 31 32
11 Question_11 33 34 35
12 Question_12 36 37 38
13 Question_13 39 40 41
14 Question_14 42 43 44
15 Question_15 45 46 47
16 Question_16 48 49 50
17 Question_17 51 52 53
18 New question 2 1006 1007 1008
19 Question_19 57 58 59
I am pretty sure that there are simpler/nicer ways, but just off the top of my head:
template_9=template.iloc[::9,0:2].copy()
# outer join
template_9['key'] = 0
data['key'] = 0
template_9.merge(data, how='left') # you don't need left here, but I think it's clearer
template_9.drop('key', axis=1, inplace=True)
template = pd.concat([template,template_9]).drop_duplicates(keep='last')
In case you want to keep the index replace:
template_9.reset_index().merge(data, how='left').set_index('index')
and then you can sort by index in the end.
P.S. I'm assuming column names are the same in both data frames, but it should be straightforward to adapt it anyway.

How to merge two columns of a dataframe based on values from a column in another dataframe?

I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

Python Pandas search for strings with metacharacters

Currently I have a DataFrame as below:
index Name Value
0 j_smith[1] 32
1 j_smith[32] 46
2 r_lee[2] 52
3 m_brent[3] 61
4 j_perry[4] 75
5 j_perry[6] 81
6 j[3] 92
7 j[4] 72
8 r[4] 63
9 m_jackson[3] 78
10 r_j[11] 98
In the dataframe, the names are formatted as
'first name initial'_'last name'[numbers]
'first name initial'[Numbers]
'first name initial'_'last name initial'[Numbers]
I tried to use the pd.str.contains function to find the rows with 'j_perry' and 'j'(not item with r_j) as below:
Score = DF[DF['Name'].str.contains('j_perry[\d+]|j[\d+]')]
I got nothing from Score DataFrame. I think the problem is from the metacharacters: [ ]. How can I solve this problem?
Simply escape the [ and ] characters using \:
Score = DF[DF['Name'].str.contains('j_perry\[\d+\]|j\[\d+\]')]
>>> Score
index Name Value
4 4 j_perry[4] 75
5 5 j_perry[6] 81
6 6 j[3] 92
7 7 j[4] 72
10 10 r_j[11] 98
To make sure you don't get r_j, use the ^ to specify that your string needs to start with j:
Score = DF[DF['Name'].str.contains('^j_perry\[\d+\]|^j\[\d+\]')]
>>> Score
index Name Value
4 4 j_perry[4] 75
5 5 j_perry[6] 81
6 6 j[3] 92
7 7 j[4] 72
You need to escape those chars with special meaning in regex:
In [41]: DF[DF['Name'].str.contains(r'^(?:j_perry\[\d+\]|j\[\d+\])')]
Out[41]:
Name Value
index
4 j_perry[4] 75
5 j_perry[6] 81
6 j[3] 92
7 j[4] 72

Categories