Multiple regex replacements in a loop do not work [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm attempting to replace URL's and #username mentions of twitter data using Python's regular expression and a for loop.
d = df['text']
for i, e in enumerate(d):
d[i] = re.sub('((www.\.[\s]+)|(https?://[^\s]+))','URL', e)
d[i] = re.sub('#[^\s]+', 'AT_USER', e)
The problem is that the for loop only works for the second line of regex code ('AT_USER'). I want to replace the URL AND #username mentions. I was thinking of making two separate for loops for each but surely there's a more effective way?

So, the issue with your code as of now is here -
# vvv
d[i] = re.sub('#[^\s]+', 'AT_USER', e)
You should be passing d[i] instead of e. The fact that you pass e means you overwrite the result of the first replacement. Change it, and it should work.
You're using pandas. It's time to ditch the loop. First, initialise a dictionary of regex-replacement pairs -
p_dict = {r'((www.\.[\s]+)|(https?://[^\s]+))' : 'URL', r'#[^\s]+' : 'AT_USER'}
Now, pass this to df.replace with the regex switch -
df['text'] = df['text'].replace(p_dict, regex=True)
Here's a little example with some dummy data -
s
0 12.2
1 12.5
2 12.6
3 15.1
4 15.3
5 15.0
dtype: object
s[0]
Out[190]: '12.2' # a string
p_dict = {'\d' : '<DIGIT>', '\.' : '<DOT>'}
s.replace(p_dict, regex=True)
0 <DIGIT><DIGIT><DOT><DIGIT>
1 <DIGIT><DIGIT><DOT><DIGIT>
2 <DIGIT><DIGIT><DOT><DIGIT>
3 <DIGIT><DIGIT><DOT><DIGIT>
4 <DIGIT><DIGIT><DOT><DIGIT>
5 <DIGIT><DIGIT><DOT><DIGIT>
dtype: object

Related

How do I remove unwanted parts from strings in a Python DataFrame column

Based on the script originally suggested by u/commandlineluser at reddit, I (as a Python novice) attempted to revise the original code to remove unwanted parts that vary across column values. The Python script involves creating a dictionary with keys and values and using a list comprehension with str.replace.
(part of the original script by u/commandlineluser at reddit)
extensions = "dat", "ssp", "dta", "v9", "xlsx"
(The next line is my revision to the above part, and below is the complete code block)
extensions = "dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"
Some of the results are different than what I desire. Please see below (what I tried).
import pandas as pd
import re
data = {"full_url": ['https://meps.ahrq.gov/data_files/pufs/h225/h225dat.zip',
'https://meps.ahrq.gov/data_files/pufs/h51bdat.zip',
'https://meps.ahrq.gov/data_files/pufs/h47f1dat.zip',
'https://meps.ahrq.gov/data_files/pufs/h225/h225ssp.zip',
'https://meps.ahrq.gov/data_files/pufs/h220i/h220if1dta.zip',
'https://meps.ahrq.gov/data_files/pufs/h220h/h220hv9.zip',
'https://meps.ahrq.gov/data_files/pufs/h220e/h220exlsx.zip',
'https://meps.ahrq.gov/data_files/pufs/h224/h224xlsx.zip',
'https://meps.ahrq.gov/data_files/pufs/h036brr/h36brr20dta.zip',
'https://meps.ahrq.gov/data_files/pufs/h036/h36u20dta.zip',
'https://meps.ahrq.gov/data_files/pufs/h197i/h197if1dta.zip',
'https://meps.ahrq.gov/data_files/pufs/h197i/h197if2dta.zip']}
df = pd.DataFrame(data)
extensions = ["dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"]
replacements = dict.fromkeys((f"{ext}[.]zip$" for ext in extensions), "")
df["file_id"] = df["full_url"].str.split("/").str[-1].replace(replacements, regex=True)
print(df["file_id"])
Annotated output
0 h225 (looks good)
1 h51b (looks good)
2 h47f1 (h47 -> desired)
3 h225 (looks good)
4 h220if1 (h220i -> desired)
5 h220h (looks good)
6 h220e (looks good)
7 h224 (looks good)
8 h36brr20 (h36brr -> desired)
9 h36u20 (h36 -> desired)
10 h197if1 (h197i -> desired)
11 h197if2 (h197i -> desired)
You have two issues here, and they are all in this line:
extensions = ["dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"]
First issue
The first issue is in the order of the elements of this list. "dat" and "dta" are substrings of other elements in this string and they are at the front of this list. Let's take an example: h47f1dat.zip needs to become h47. But in these lines:
replacements = dict.fromkeys((f"{ext}[.]zip$" for ext in extensions), "")
df["file_id"] = df["full_url"].str.split("/").str[-1].replace(replacements, regex=True)
You keep the order, meaning that you'll first be filtering with the "dat" string, which becomes h47f1. This can be easily fixed by reordering your list.
Second issue
You missed an entry in your extensions list: if you want h47f1dat.zip to become h47 you need to have "f1dat" in your list but you only have "f1dta".
Conclusion
You were almost there! There was simply a small issue with the order of the elements and one extension was missing (or you have a typo in your URLs).
The following extensions list:
extensions = ["ssp", "20dta", "u20dta", "f1dat", "f1dta", "f2dta", "v9", "dat", "dta", "xlsx"]
Together with the rest of your code gives you the result you want:
0 h225
1 h51b
2 h47
3 h225
4 h220i
5 h220h
6 h220e
7 h224
8 h36brr
9 h36u
10 h197i
11 h197i
Good catch about the issue of the order of the elements and the missing extension! Thank you.
Question 1: Do you mean the list extensions is not sorted alphabetically? Can I not use the Python sort() method to sort the list? I have over one thousand rows in the actual dataframe, and I prefer to sort the list programmatically. I hope I do not misunderstand your comments.
Question 2: I don't understand why I am getting h36u instead of the desired value h36 in the output even after reordering the list as you suggested. Any thoughts?
I have tried another approach (code below) using Jupyter Lab, which provides the output in which the first two values are different from the desired output (also shown below), but the other values seem to be what I desire including h36.
df["file_id"] = df["full_url"].str.split("/").str[-1].str.replace(r'(\dat.zip \
|f1dat.zip|dta.zip|f1dta.zip|f2dta.zip|20dta.zip|u20dta.zip|xlsx.zip|v9.zip|ssp.zip)' \
,'', regex=True)
print(df["file_id"])
Output (annotated)
0 h225dat.zip (not desired; h225 desired)
1 h51bdat.zipn (not desired; h51b desired)
2 h47
3 h225
4 h220i
5 h220h
6 h220e
7 h224
8 h36brr
9 h36
10 h197i
11 h197i
Question 3: Any comments on the above alternative code snippets?

Binary numbers to list [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have written the following program in Python:
s = []
for e in random_key:
s = str(e)
print(s)
where the list random_key is
random_key = ['0011111011100101', '0000010111111011', '0011100110110100',
'1000010101010010', '0011001011001111', '1101101101110011',
'1100001111111011', '0000100000110100', '0101111010100101',
'1001100101100001']
The output of the program is
1111011010110011
1011000110011100
0011011001100010
0000011100100001
1111111010000100
0110110101100011
1011100011000101
1011101011100010
1101101101001010
1000011110110000
which is not correct. How can I fix the code?
If I am able to read your thoughts (not sure about that ..). Would you like them to 10 based numbers?
random_key = ['0011111011100101', '0000010111111011', '0011100110110100',
'1000010101010010', '0011001011001111', '1101101101110011',
'1100001111111011', '0000100000110100', '0101111010100101',
'1001100101100001']
numbers = [int(x, 2) for x in random_key]
print(numbers)
output
[16101, 1531, 14772, 34130, 13007, 56179, 50171, 2100, 24229, 39265]
Do you mean this?
s = list()
for e in random_key:
s.append(str(e))
print(s)
Returns:
['0011111011100101', '0000010111111011', '0011100110110100', '1000010101010010', '0011001011001111', '1101101101110011', '1100001111111011', '0000100000110100', '0101111010100101', '1001100101100001']

my 'if' function won't work with strings? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
It's a simple question:
lacount=0
ptcount=0
for line in list1:
print(str(line))
if 'LA'==str(line):
lacount+=1
if 'PT'==str(line):
print('pt works')
ptcount+=1
I'm trying to count how many 'PT' and 'LA' there are in a list but it seems like the if statements are not working, as my value still remains as a zero. Can someone help please?
The list I print out via the coding above comes out as:
PMID
TI
DP
AU
AU
AU
JT
LA
PT
PMID
TI
DP
AU
JT
LA
PT
PMID
TI
LID
DP
JT
AU
AU
LA
PT
PT = 0
LA = 0
Adding a strip() will remove any whitespaces that might've been in the string:
lacount=0
ptcount=0
for line in list1:
print(str(line))
if 'LA'==str(line).strip():
lacount+=1
if 'PT'==str(line).strip():
print('pt works')
ptcount+=1
I can't see your reference text that feeds into this function but try this:
lacount, ptcount = 0, 0
for line in list1:
print(str(line))
if 'LA' in str(line):
lacount+=1
if 'PT' in str(line):
print('pt works')
ptcount+=1
If you have multiple occurrences in a line:
lacount=0
ptcount=0
for line in list1:
laccount += str(line).count('LA')
ptcount += str(line).count('PT')

Convert string to dictionary with counts of variables [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
This is a snippet of the output:
{...,"resultMap":
{..."SEARCH_RESULTS":
[{..."resultList":[
{"userClientId":"1"","preferenceValues":["48","51","94"],"MyDate":"7/26/2017 8:30:00 AM"},
{"userClientId":"2","preferenceValues":["42","11","84"],"MyDate":"7/26/2017 9:40:00 AM"},
{"userClientId":"3","preferenceValues":["4","16","24"],"MyDate":"7/26/2017 4:20:00 PM"},
{"userClientId":"4","preferenceValues":["7","2","94"],"MyDate":"7/27/2017 8:00:00 AM"},
{"userClientId":"1","preferenceValues":["48","22","94"],"MyDate":"7/27/2017 1:50:00 PM"},
{"userClientId":"2","preferenceValues":["42","11"],"MyDate":"7/27/2017 2:00:00 PM"},
{"userClientId":"3","preferenceValues":["4","24"],"MyDate":"7/27/2017 6:15:00 PM"},
{"userClientId":"4","preferenceValues":"7","MyDate":"7/27/2017 9:30:00 PM"}]
}]
}
}
I am looking to get a variable pageIdCount that is in dictionary format, where the key is page_id and the values are a counts of occurrences of page_id, by user_id. So for userId 1 it should look like:
{"userClientId":"1","preferenceValues":{48:2, 51:1, 94:2, 22:1}}
Note that when there is only 1 variable inside preferenceValues- there are no brackets. There is also a field "preferenceValue" where there are no brackets no matter what and it is identical to "preferenceValues" otherwise.
Is that possible?
In Python 2.7, I specify user, password and url and then I have the following:
req = requests.post(url = url, auth=(user, password))
ans = req.json()
print ["resultMap"]["SEARCH_RESULTS"][0]["resultList"]
Any help is greatly appreciated.
your_data # this is your data
final_data = {}
for line in yourdata:
uid = line["userId"]
pids = line["PageId"]
if uid not in final_data :
final_data[uid] = {}
for pid in pids :
pid = int(pid)
if pid not in final_data[uid]:
final_data[uid][pid]=0
final_data[uid][pid] += 1
res = [{"userId":uid,"PageIDCount":pids} for uid,pids in final_data.items()]
I suppose you are beginning, if so, the most tricky part of this code will probably be the last line, it uses list comprehension. here is a good lesson about it.

convert csv columns to dictionary in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a csv file with the following data.
Column-1 Column-2 Column-3
bob sweet 4
alice uber 4.5
bob uber 4
alice sweet 4.5
razi fav 2.5
razi uber 3.5
bob fav 4
I want to convert it to a dictionary as shown,
A={'bob':{'sweet':'4', 'uber':'4', 'fav':'4'},
'alice':{'uber':'4.5', 'sweet':'4.5'},
'razi':{'fav':'2.5', 'uber':'3.5'}}
in python
For that i am willing to do like this..convert the csv to list like this and then get my output. I am unable to do so, coz keys are repeated as shown.
A={'bob':['sweet','4'],
'alice':['uber','4.5'],
'bob':['uber','4'],
'alice':['sweet','4.5'],
'razi':['fav','2.5'],
'razi':['uber','3.5'],
'bob':['fav','4']}
Can any one suggest a way to solve problem?
Assuming you don't have any space in your datas, and all your actual data rows have exactly 3 fields:
import logging
logging.basicConfig(level=logging.INFO) # <- in a real application,
# should be set application-wide
# from a config file
logger = logging.getLogger("CSV import")
result = {}
nlines = 0
ok = 0
warnings = 0
with open("my_file.csv") as f:
f.readline() # Skip header. Assuming only one line of heading
for row in (line.split() for line in f):
nlines += 1
try:
k1,k2, val = row
result.setdefault(k1,{})[k2] = val
ok += 1
except ValueError:
logger.warning("Format mismatch: %s", row)
warnings += 1
# what to do next?
logger.info("%d lines read. %d imported. %d warnings",nlines,ok,warnings)
from pprint import pprint
pprint(result)
Given your sample data file, this produces:
INFO:CSV import:7 lines read. 7 imported. 0 warnings
{'alice': {'sweet': '4.5', 'uber': '4.5'},
'bob': {'fav': '4', 'sweet': '4', 'uber': '4'},
'razi': {'fav': '2.5', 'uber': '3.5'}}
The trick here is to use setdefault to access to outer dictionary. It will either return the value if the key was already present -- or a new dictionary if this is the first time we encounter that key. After that, this is simply a matter of adding the value to the inner dictionary as usual.

Categories