I've got the next csv file:
Summary,Issue key,Issue id,Issue Type,Status,Project key,Attachment,Attachment.1,Attachment.2,Attachment.3,Attachment.4,Attachment.5
Find issue,IS-11,576,Task,Solved,One-1,10/28/21 11:49;Olga_Sokolova;SALUPRJBKK-1663_2021-10-28 14-38-01-372.mp4;file://SALUPRJBKK/SALUPRJBKK-1663/SALUPRJBKK-1663_2021-10-28 14-38-01-372.mp4
I need to choose all the attachments values and replace the "space" in the filename to "%20".
The main problem is to skip the first 'space' after the date in the attachment value and also to get all the attachment value.
I tried to use standart csv reader, pandas and etc. but I can only get name of the column
import pandas as pd
data = pd.read_csv("SALUPRJBKK_new_10.csv")
for i in data:
if "Attachment" in i:
print(i)
Select 'Attachment' columns using filter and replace all whitespaces by '%20' then update your dataframe in place:
df.update(df.filter(like='Attachment').replace(' ', '%20', regex=True))
My advise if you need to escape HTML entities is to use quote from urllib module:
from urllib.parse import quote
df.update(df.filter(like='Attachment').fillna('').applymap(quote))
Update
Try:
out = df.filter(like='Attachment').unstack().str.split(';').explode()
out = out.where(~(out.str.startswith('file://').fillna(False)),
out.str.replace(' ', '%20'))
df.update(out.dropna().groupby(level=[0, 1]).apply(';'.join).unstack(0))
consider specifying the separotor:
data = pd.read_csv("SALUPRJBKK_new_10.csv", sep=",")
Note: there is a mix of seperator in your example "," or ";"
and if you want to acces only to "Attachment" do:
data["Attachment"]
I think do you need a url encode.
Try with this:
import urllib.parse
query = 'Hellóóó W r l d # Pyt## h on.mp4'
newUrl = urllib.parse.quote(query)
print(newUrl)
And now the encode result:
Hell%C3%B3%C3%B3%C3%B3%20W%20r%20l%20d%20%40%20Pyt%40%40%20h%20on.mp4
The text or url is encoded and change all the special characters in string.
Related
I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']
how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg
I am trying to extract a string from a longer string in one of my columns.
Here is a sample of what I have tried:
df['Campaign'] = df.full_utm.str.extract('utm_campaign=([^&]*)')
and this is a sample of the string I am referring to:
?utm_source=Facebook&utm_medium=CPC&utm_campaign=April+Merchants+LAL+-+All+SA+-+CAP+250&utm_content=01noprice
The problem is that this only returns this:
A
The desired output in this context would be
April+Merchants+LAL+-+All+SA+-+CAP+250
Use urlparse
Ex:
import urllib.parse as urlparse
df['Campaign'] = df["full_utm"].apply(lambda x: urlparse.parse_qs(urlparse.urlparse(x).query)["utm_campaign"])
print(df)
I need to convert the form data below to a slightly different format to be able to submit correctly.
I have this form data.
PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:
Wanted format:
PaReq=eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz%2FlO6en5PrLZCs6j%0D%0ANWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA%2F6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp%0D%0ACEXF7Pg8X9JRgAIICbhCWz9wMY%2Boj%2FEYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY%0D%0AYMu12NOtlOUUgKZp3N%2BikGUsRbF3WeHWO0CAVphXgMdnkFWtiap%2FY5sldBGFjf1Yuzzv0PL8evrc%0D%0ApDMCtMLqk1hyiqCHoT%2F0HIimCE%2FHmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ%0D%0AFfC2LHKuzqg%2B3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ%2BtakbbhOfr6h9sjC65rpSehE%0D%0Ad4Yy1TXkQb9zlNkWEmD%2Br642A6n71A0vHRBwP9j%2F7TDLBQ%3D%3D%0D%0A&TermUrl=https%3A%2F%2Fwww.footpatrol.co.uk%2Fcheckout%2F3d&MD=
I have tried this but seems to be a different format than what I need to submit correctly.
Code:
import urllib.parse
print(urllib.parse.quote_plus('''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''))
Is this obtainable with python? And what do i need to do to achieve the wanted end result?
if your paraneters are separated by newlines you can use the splitlines method to get a list of parameters, and use re.split on each item to get a list with name, value.
Then apply quote_plus on each name and value, '='.join them and '&'.join all parameters.
import urllib.parse
import re
data = '''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6jNWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNpCEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQYYMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrcpDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQFfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehEd4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''
data = [re.split(':(?!//)', line) for line in data.splitlines()]
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If your data is split by newlines arbitrarily, you could join the lines and split by name. Then zip names and values, quote and join.
data = ''.join(data.splitlines())
data = zip(['PaReq', 'TermUrl', 'MD'], re.split('PaReq:|TermUrl:|MD:', data)[1:])
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If you want to keep the newline cheracter, use only the last two lines in the second code snippet.
I have a variable that looks like this:
data = {"add_content": {"errata_ids": [advisory]},"content_view_version_environments": [{"content_view_version_id": version_id}]}
I need to add single quotes to this variable , i.e. if I will assign the variables:
advisory and version_id and add the single quotes to data variable like this:
data = '{"add_content": {"errata_ids": ["RHSA-2017:1390"]},"content_view_version_environments": [{"content_view_version_id": 160}]}'
I am able to post to the API
I have tried to add the single quotes in variety of ways:
new_data = "'" + str(data) + "'"
>>> new_data
'\'{\'add_content\': {\'errata_ids\': [\'"RHSA-2017:1390"\']}, \'content_view_version_environments\': [{\'content_view_version_id\': \'160\'}]}\''
or using:
'"%s"'%(data)
and a few more ways.
How can I add the single quotes to the outer to the data variable before and after the opening { and closing }?
This is exactly what JSON does:
import json
new_data = json.dumps(data)
If in contrary of previous answers and comments you are not trying to convert to a json string then use string.format around the variables:
data = {"add_content": {"errata_ids": '[{}]'.format(advisory)},"content_view_version_environments": [{"content_view_version_id": '{}'.format(version_id)}]}