Protect one specific case in regex in python - python

I need to replace german phone numbers in python, which is well-explained here:
Regexp for german phone number format
Possible formats are:
06442) 3933023
(02852) 5996-0
(042) 1818 87 9919
06442 / 3893023
06442 / 38 93 02 3
06442/3839023
042/ 88 17 890 0
+49 221 549144 – 79
+49 221 - 542194 79
+49 (221) - 542944 79
0 52 22 - 9 50 93 10
+49(0)121-79536 - 77
+49(0)2221-39938-113
+49 (0) 1739 906-44
+49 (173) 1799 806-44
0173173990644
0214154914479
02141 54 91 44 79
01517953677
+491517953677
015777953677
02162 - 54 91 44 79
(02162) 54 91 44 79
I am using the following code:
df['A'] = df['A'].replace(r'(\(?([\d \-\)\–\+\/\(]+)\)?([ .\-–\/]?)([\d]+))', r'\TEL', regex=True)
The Problem is I have dates in the text:
df['A']
2017-03-07 13:48:39 Dear Sear Madam...
This is necassary to keep, how can I exclude the format: 2017-03-07and 13:48:39from my regex replacement?
Short Example:
df['A']
2017-03-077
2017-03-07
0211 11112244
desired output:
df['A']
TEL
2017-03-07
TEL

Any way you slice it you are not dealing with regular data and regular expressions work best with regular data. You are always going to run into "false positives" in your situation.
Your best bet is to write out each pattern individually as a giant OR. Below is the pattern for the first three phone numbers so just do the rest of them.
\d{5}\) \d{7}|\(\d{5}\) \d{4}-\d|\(\d{3}\) \d{4} \d{2} \d{4}
https://regex101.com/r/6NPzup/1

Related

Using groupby() for a dataframe in pandas resulted Index Error

I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt

Replace Not Correctly Removing for Object Variable Type

I'm trying to remove the ":30" portion of values in my First variable. The First variable data type is object.
Here are a few examples of of the First variable, and the counts, ignore the counts:
11a 211
7p 178
4p 127
2:30p 112
11:30a 108
1p 107
12p 105
9a 100
10p 85
2p 24
10:30a 12
6p 5
9:30a 2
9p 2
12:30a 2
8p 2
I wrote the following code which runs without any errors; however, when I run the value counts, it still shows times with a ":30". The NewFirst variable dataype is int64.Not quite sure what I'm doing wrong here.
bad_chars = ":30"
DF["NewFirst"] = DF.First.replace(bad_chars,'')
DF["NewFirst"].value_counts()
The desired output would have the NewFirst values like:
11a 211
7p 178
4p 127
2p 112
11a 108
1p 107
12p 105
9a 100
10p 85
2p 24
10a 12
6p 5
9a 2
9p 2
12a 2
8p 2
You shouldn't be looping over the characters in bad_chars. That will remove all 3 and 0 characters, so 10p will become 1p, and 3a will become a.
You should just replace the whole bad_chars string, with no loop.
You also need to use the .str accessor.
DF["NewFirst"] = DF["First"].str.replace(bad_chars,'')

Write a pandas DataFrame mixing integers and floats in a csv file

I'm working with pandas DataFrames full of float numbers, but with integers in one every three lines (the whole line is made of integers). When I make a print df, all the values displayed are shown as floats (the integers values have a ``.000000```added) for example :
aromatics charged polar unpolar
Ac_obs_counts 712.000000 1486.000000 2688.000000 2792.000000
Ac_obs_freqs 0.092732 0.193540 0.350091 0.363636
Ac_pvalues 0.524752 0.099010 0.356436 0.495050
Am_obs_counts 10.000000 59.000000 62.000000 50.000000
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.495050 0.980198 0.356436 0.009901
Ap_obs_counts 18.000000 34.000000 83.000000 78.000000
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366
When I use df.iloc[range(0, len(df.index), 3)], I see integers displayed :
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Am_obs_counts 10 59 62 50
Ap_obs_counts 18 34 83 78
Pa_obs_counts 47 81 125 144
Pf_obs_counts 31 58 99 109
Pg_obs_counts 27 106 102 108
Ph_obs_counts 7 49 42 36
Pp_obs_counts 15 83 45 65
Ps_obs_counts 57 125 170 216
Pu_obs_counts 14 62 102 84
When I use df.to_csv("mydf.csv", sep=",", encoding="utf-8") , the integers are written as floats ; how can I force the writing as integers for these lines ? Would it be better to split the data in two DataFrames ?
Thanks in advance.
Simply call object
df.astype('object')
Out[1517]:
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Ac_obs_freqs 0.092732 0.19354 0.350091 0.363636
Ac_pvalues 0.524752 0.09901 0.356436 0.49505
Am_obs_counts 10 59 62 50
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.49505 0.980198 0.356436 0.009901
Ap_obs_counts 18 34 83 78
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366

How to split string with different phones using re?

For example there are such phones:
phones = '+35(123) 456 78 90 (123) 555 55 55 (908)985 88 89 (593)592 56 95'
I need to get:
phones_list = ['+35(123) 456 78 90', '(123) 555 55 55', '(908)985 88 89', (593)592 56 95]
Trying to solve using re, but quite a hard task to me.
This approach uses the + or ( to signal the beginning of a phone number. It does not require multiple-spaces:
>>> phones = '+35(123) 456 78 90 (123) 555 55 55 (908)985 88 89 (593)592 56 95'
>>> re.split(r' +(?=[(+])', phones)
['+35(123) 456 78 90', '(123) 555 55 55', '(908)985 88 89', '(593)592 56 95']
This splits the string based on one-or-more spaces followed by either ( or +.
In the regular expression, + matches one or more spaces. (?=[(+]) is a look-ahead. It requires that the spaces be followed by either ( or + but does not consume the ( or +. Because we are using a look-ahead instead of a plain match, the the leading ( and + remain part of the phone number.

How do i convert a string of ascii values to there original character/number in python

i have a string with numbers that i previously converted with my encoder but now i am trying to decode it ive searched around and no answers seem to work
if you have any i dear how to do this then let me know
string = 91 39 65 97 66 98 67 99 32 49 50 51 39 93
outcome = ABCabc 123
outcome = "".join([your_decoder.decode(x) for x in string.split(" ")])

Categories