Python Pandas search for strings with metacharacters

Python Pandas search for strings with metacharacters - python

Currently I have a DataFrame as below:
index Name Value
0 j_smith[1] 32
1 j_smith[32] 46
2 r_lee[2] 52
3 m_brent[3] 61
4 j_perry[4] 75
5 j_perry[6] 81
6 j[3] 92
7 j[4] 72
8 r[4] 63
9 m_jackson[3] 78
10 r_j[11] 98
In the dataframe, the names are formatted as
'first name initial'_'last name'[numbers]
'first name initial'[Numbers]
'first name initial'_'last name initial'[Numbers]
I tried to use the pd.str.contains function to find the rows with 'j_perry' and 'j'(not item with r_j) as below:
Score = DF[DF['Name'].str.contains('j_perry[\d+]|j[\d+]')]
I got nothing from Score DataFrame. I think the problem is from the metacharacters: [ ]. How can I solve this problem?

Simply escape the [ and ] characters using \:
Score = DF[DF['Name'].str.contains('j_perry\[\d+\]|j\[\d+\]')]
>>> Score
index Name Value
4 4 j_perry[4] 75
5 5 j_perry[6] 81
6 6 j[3] 92
7 7 j[4] 72
10 10 r_j[11] 98
To make sure you don't get r_j, use the ^ to specify that your string needs to start with j:
Score = DF[DF['Name'].str.contains('^j_perry\[\d+\]|^j\[\d+\]')]
>>> Score
index Name Value
4 4 j_perry[4] 75
5 5 j_perry[6] 81
6 6 j[3] 92
7 7 j[4] 72

You need to escape those chars with special meaning in regex:
In [41]: DF[DF['Name'].str.contains(r'^(?:j_perry\[\d+\]|j\[\d+\])')]
Out[41]:
Name Value
index
4 j_perry[4] 75
5 j_perry[6] 81
6 j[3] 92
7 j[4] 72

Related

Compare row wise elements of a single column. If there are 2 continuous L then select lowest from High column and ignore other. Conversly if 2 L

High D_HIGH D_HIGH_H
33 46.57 0 0L
0 69.93 42 42H
1 86.44 68 68H
34 56.58 83 83L
35 67.12 125 125L
2 117.91 158 158H
36 94.51 186 186L
3 120.45 245 245H
4 123.28 254 254H
37 83.20 286 286L
In column D_HIGH_H there is L & H at end.
If there are two continuous H then the one having highest value in High column has to be selected and other has to be ignored(deleted).
If there are two continuous L then the one having lowest value in High column has to be selected and other has to be ignored(deleted).
If the sequence is H,L,H,L then no changes to be made.
Output I want is as follows:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I tried various options using list map but did not work out.Also tried with groupby but no logical conclusion.

Here's one way:
g = ((l := df['D_HIGH_H'].str[-1]) != l.shift()).cumsum()
def f(x):
if (x['D_HIGH_H'].str[-1] == 'H').any():
return x.nlargest(1, 'D_HIGH')
return x.nsmallest(1, 'D_HIGH')
df.groupby(g, as_index=False).apply(f)
Output:
High D_HIGH D_HIGH_H
0 33 46.57 0 0L
1 1 86.44 68 68H
2 34 56.58 83 83L
3 2 117.91 158 158H
4 36 94.51 186 186L
5 4 123.28 254 254H
6 37 83.20 286 286L

You can use extract to get the letter, then compute a custom group and groupby.apply with a function that depends on the letter:
# extract letter
s = df['D_HIGH_H'].str.extract('(\D)$', expand=False)
# group by successive letters
# get the idxmin/idxmax depending on the type of letter
keep = (df['High']
.groupby([s, s.ne(s.shift()).cumsum()], sort=False)
.apply(lambda x: x.idxmin() if x.name[0] == 'L' else x.idxmax())
.tolist()
)
out = df.loc[keep]
Output:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L

Strip the last character from a string if it is a letter in python dataframe

It is possibly done with regular expressions, which I am not very strong at.
My dataframe is like this:
import pandas as pd
import regex as re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
df = pd.DataFrame(data)
df
postcode total
0 DG14 44
1 EC3M 54
2 BN45 56
3 M2 78
4 WC2A 87
5 W1C 35
6 PE35 36
I want to get these strings in my column with the last letter stripped like so:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1C 35
6 PE35 36
Probably something using re.sub('', '\D')?
Thank you.

You could use str.replace here:
df["postcode"] = df["postcode"].str.replace(r'[A-Za-z]$', '')

One of the approaches:
import pandas as pd
import re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
data['postcode'] = [re.sub(r'[a-zA-Z]$', '', item) for item in data['postcode']]
df = pd.DataFrame(data)
print(df)
Output:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1 35
6 PE35 36

How to merge two columns of a dataframe based on values from a column in another dataframe?

I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.

The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

How to read one column data as one by one row in csv file using python

Here I have a dataset with three inputs. Three inputs x1,x2,x3. Here I want to read just x2 column and in that column data stepwise row by row.
Here I wrote a code. But it is just showing only letters.
Here is my code
data = pd.read_csv('data6.csv')
row_num =0
x=[]
for col in data:
if (row_num==1):
x.append(col[0])
row_num =+ 1
print(x)
result : x1,x2,x3
What I expected output is:
expected output x2 (read one by one row)
65
32
14
25
85
47
63
21
98
65
21
47
48
49
46
43
48
25
28
29
37
Subset of my csv file :
x1 x2 x3
6 65 78
5 32 59
5 14 547
6 25 69
7 85 57
8 47 51
9 63 26
3 21 38
2 98 24
7 65 96
1 21 85
5 47 94
9 48 15
4 49 27
3 46 96
6 43 32
5 48 10
8 25 75
5 28 20
2 29 30
7 37 96
Can anyone help me to solve this error?

If you want list from x2 use:
x = data['x2'].tolist()

I am not sure I even get what you're trying to do from your code.
What you're doing (after fixing the indentation to make it somewhat correct):
Iterate through all columns of your dataframe
Take the first character of the column name if row_num is equal to 1.
Based on this guess:
import pandas as pd
data = pd.read_csv("data6.csv")
row_num = 0
x = []
for col in data:
if row_num == 1:
x.append(col[0])
row_num = +1
print(x)
What you probably want to do:
import pandas as pd
data = pd.read_csv("data6.csv")
# Make a list containing the values in column 'x2'
x = list(data['x2'])
# Print all values at once:
print(x)
# Print one value per line:
for val in x:
print(val)

When you are using pandas you can use it. You can try this to get any specific column values by using list to direct convert into a list.For loop not needed
import pandas as pd
data = pd.read_csv('data6.csv')
print(list(data['x2']))

Debugging a print DataFrame issue in Pandas

How do I debug a problem with printing a Pandas DataFrame ? I call this function and then print the output (which is a Pandas DataFrame).
n=ion_tab(y_ion,cycles,t,pH)
print(n)
The last part of the output looks like this:
58 O2 1.784306e-35 4 86 7.3
60 HCO3- 5.751170e+02 5 86 7.3
61 Ca+2 1.825748e+02 5 86 7.3
62 CO2 3.928413e+01 5 86 7.3
63 CaHCO3+ 3.755015e+01 5 86 7.3
64 CaCO3 4.616840e+00 5 86 7.3
65 SO4-2 1.393365e+00 5 86 7.3
66 CO3-2 8.243118e-01 5 86 7.3
67 CaSO4 7.363058e-01 5 86 7.3
... ... ... ... ...
[65 rows x 5 columns]
But if I do an n.tail() command, I see the missing data that ... seems to suggest.
print n.tail()
Species ppm as ion Cycles Temp F pH
68 OH- 5.516061e-03 5 86 7.3
69 CaOH+ 6.097815e-04 5 86 7.3
70 HSO4- 5.395493e-06 5 86 7.3
71 CaHSO4+ 2.632098e-07 5 86 7.3
73 O2 1.783007e-35 5 86 7.3
[5 rows x 5 columns]
If I count the number of rows showing up on the screen, I get 60. If I add the 5 extra that show up with n.tail(), I get the expected 65 rows. Is there some limit in print that would only allow 60 rows ? What's causing ... at the end of my DataFrame ?
Initially I though there was something in the ion_tab function that was limiting the printing. But one I saw the missing data in the n.tail() statement, I got confused.
Any help in debugging this would be appreciated.

Pandas limits the number of rows printed by default. You can change that with pd.set_option
In [4]: pd.get_option('display.max_rows')
Out[4]: 60
In [5]: pd.set_option('display.max_rows', 100)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas search for strings with metacharacters - python

You need to escape those chars with special meaning in regex: In [41]: DF[DF['Name'].str.contains(r'^(?:j_perry\[\d+\]|j\[\d+\])')] Out[41]: Name Value index 4 j_perry[4] 75 5 j_perry[6] 81 6 j[3] 92 7 j[4] 72

Related

Compare row wise elements of a single column. If there are 2 continuous L then select lowest from High column and ignore other. Conversly if 2 L

Strip the last character from a string if it is a letter in python dataframe

How to merge two columns of a dataframe based on values from a column in another dataframe?

How to read one column data as one by one row in csv file using python

Debugging a print DataFrame issue in Pandas

Categories

Resources