Is something like this possible? Id like to use a dictionary or set as the key for my file renamer. I have a lot of key words that id like to filter out of the file names but the only way iv found to do it so far is to search by string such as key720 = "720" this make it functions correctly but creates bloat. I have to have a version of the code at bottom for each keyword I want to remove.
how do I get the list to work as keys in the search?
I tried to take the list and make it a string with:
str1 = ""
keyres = (str1.join(keys))
This was closer but it makes a string of all the entry's I think and didn't pick up any keywords.
so iv come to this at the moment.
keys = ["720p", "720", "1080p", "1080"]
for filename in os.listdir(dirName):
if keys in filename:
filepath = os.path.join(dirName, filename)
newfilepath = os.path.join(dirName, filename.replace(keys,""))
os.rename(filepath, newfilepath)
Is there a way to maybe go by index and increment it one at a time? would that allow the strings in the list to be used as strings?
What I'm trying to do is take a file name and rename it by removing all occurrences of the key words.
How about using Regular Expressions, specifically the sub function?
from re import sub
KEYS = ["720p", "720", "1080p", "1080"]
old_filename = "filename1080p.jpg"
new_filename = sub('|'.join(KEYS),'',old_filename)
print(new_filename)
I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.
Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!
If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}
## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()
Firstly python 2.7.11
Overview, I'm gathering the directory names in a given path and passing them into a subprocess cmd. From that subprocess I'm iterating over the output line by line, the directory name is the key and the subprocess.stdout is the value.
What I need is to keep the key the same but save the unique values and add them to a dict so I can write to a csv later.
Snip it of code showing 2 methods I have already tried (one is commented out). Both overwrite the existing key:value in the dict.
data = []
for dname in listdir(path):
header = dname
if isfile:
entrydict = dict()
cmd = "ct lsh -fmt \"%u \\n\" -since 01-Oct-2015 -all " + dname
# output of cmd is "name \r\n"
p1 = subp.Popen(cmd, stdout=subp.PIPE, stderr=subp.PIPE)
usr = []
for name in iter(p1.stdout.readline, ''):
if name.rstrip() not in usr:
usr.append(name.rstrip())
else:
entrydict[header] = usr
for n in usr:
entrydict[header] = n
data.append(entrydict)
Thanks!
Yes, you could collect all of the unique values as a list like names = ['f0', 'f1', 'f2'] and then assign it to your dict with a header as a key like
entrydict[header] = names
Just make sure that all of header are different.
I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]
my data is located in a .txt file (no, I can't change it to a different format) and it looks like this:
varaiablename = value
something = thisvalue
youget = the_idea
Here is my code so far (taken from the examples in Pyparsing):
from pyparsing import Word, alphas, alphanums, Literal, restOfLine, OneOrMore, \
empty, Suppress, replaceWith
input = open("text.txt", "r")
src = input.read()
# simple grammar to match #define's
ident = Word(alphas + alphanums + "_")
macroDef = ident.setResultsName("name") + "= " + ident.setResultsName("value") + Literal("#") + restOfLine.setResultsName("desc")
for t,s,e in macroDef.scanString(src):
print t.name,"=", t.value
So how can I tell my script to edit a specific value for a specific variable?
Example:
I want to change the value of variablename, from value to new_value.
So essentially variable = (the data we want to edit).
I probably should make it clear that I don't want to go directly into the file and change the value by changing value to new_value but I want to parse the data, find the variable and then give it a new value.
Even though you have already selected another answer, let me answer your original question, which was how to do this using pyparsing.
If you are trying to make selective changes in some body of text, then transformString is a better choice than scanString (although scanString or searchString are fine for validating your grammar expression by looking for matching text). transformString will apply token suppression or parse action modifications to your input string as it scans through the text looking for matches.
# alphas + alphanums is unnecessary, since alphanums includes all alphas
ident = Word(alphanums + "_")
# I find this shorthand form of setResultsName is a little more readable
macroDef = ident("name") + "=" + ident("value")
# define values to be updated, and their new values
valuesToUpdate = {
"variablename" : "new_value"
}
# define a parse action to apply value updates, and attach to macroDef
def updateSelectedDefinitions(tokens):
if tokens.name in valuesToUpdate:
newval = valuesToUpdate[tokens.name]
return "%s = %s" % (tokens.name, newval)
else:
raise ParseException("no update defined for this definition")
macroDef.setParseAction(updateSelectedDefinitions)
# now let transformString do all the work!
print macroDef.transformString(src)
Gives:
variablename = new_value
something = thisvalue
youget = the_idea
For this task you do not need to use special utility or module
What you need is reading lines and spliting them in list, so first index is left and second index is right side.
If you need these values later you might want to store them in dictionary.
Well here is simple way, for somebody new in python. Uncomment lines whit print to use it as debug.
f=open("conf.txt","r")
txt=f.read() #all text is in txt
f.close()
fwrite=open("modified.txt","w")
splitedlines = txt.splitlines():
#print splitedlines
for line in splitedlines:
#print line
conf = line.split('=')
#conf[0] is what it is on left and conf[1] is what it is on right
#print conf
if conf[0] == "youget":
#we get this
conf[1] = "the_super_idea" #the_idea is now the_super_idea
#join conf whit '=' and write
newline = '='.join(conf)
#print newline
fwrite.write(newline+"\n")
fwrite.close()
Actually, you should have a look at the config parser module
Which parses exactly your syntax (you need only to add [section] at the beginning).
If you insist on your implementation, you can create a dictionary :
dictt = {}
for t,s,e in macroDef.scanString(src):
dictt[t.name]= t.value
dictt[variable]=new_value
ConfigParser
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read('example.txt')
variablename = config.get('variablename', 'float')
It'll yell at you if you don't have a [section] header, though, but it's ok, you can fake one.