I need to search by part of filenames.
I have a column in my database that contains the names of products, and each product has a file for description, so I need to add this file to the same raw in my database.
I want to add a new column called description and add the contents of the file that has the same name on the column name, but the names in column and file are different, for example, the product called cs12 in the database and datasheet-cs12 or guide-cs12 or anything like this in the file
You will need to figure out how to
get a list of files in a folder
Look for a substring in each element of that list
Re 1.: https://stackoverflow.com/search?q=%5Bpython%5D+get+list+of+files+in+directory
Re 2.:
You have a logical problems. There might be multiple files that match any string. So, I don't think you can solve this fully automatically at all. What if you have two files datasheet_BAT54alternative.txt and info_BAT54A, and two rows containing the string BAT54 and BAT54A. A "BAT54" is not the same as a "BAT54A". So, you'll always have to deal with a list of candidates. If you're lucky, that list has only one entry:
def give_candidates(list_of_file_names, substring):
return [ fname for fname in file_names if substring.lower() in fname.lower() ]
Related
First, I am relatively new at programming; Python is the only language I have any familiarity with using. Secondly, I put DB in the question because that's what seems right to me after searching around, but I am open to not using a DB at all if that's easier or more efficient.
What I Have to Work With
I have a folder with ~75,000 JSON files. They all have the same structure; here is an example of what they look like (more on that below):
{
"id": 93480
"author": "",
"joined by": [],
"date_created": "2010-04-28T16:07:21Z"
"date_modified": "2020-02-21T21:42:45.655644Z"
"type": "010combined",
"page_count": null,
"plain_text": "",
"html": "",
"extracted_by_ocr": false,
"cited": [
]
}
One way that the real files differ from the above is that either the "plain_text" or the "html" key will have an actual value, namely text (whether plaintext or HTML). The length of that text can vary from a couple of sentences to over 200 pages worth of text. Thus, the JSON files range in size from 907 bytes at the smallest to 2.1 MB.
What I'm Trying to Do
I want to be able, essentially, to search through all the files for a word or phrase contained in either the plain_text or HTML fields and, at a minimum, return a list of files containing that word or phrase. [Ideally, I'd do other things with them, as well, but I can figure that stuff out later. What I'm stumped on is where to begin.]
What I Can't Figure Out
Whether to even bother with a document-store db like MongoDB (or PostgreSQL). If that's the appropriate way to handle this, I'm open to working my way through it. But I can't even tell if that's how I should attack the problem, or if I should instead just use a Python script to iterate over the files in the folder directly. Can you use populate a DB with all the files in a folder, then search for a substring in each row? The fact that some of these files have a ton of text in one of the values makes it seem weird to me to use a DB at all, but again: I don't know what I'm doing.
I think I know to iterate over the files directly with Python. I know how to open files, and I know how to get a list of keys from JSON files. But how do you search for a matching substring in two JSON values? And then, if the substring is found in one of them, how do you return the "id" field to a list, close the file, and move to the next one? (I mean, obviously, the basic structure is a conditional. Here's the logical structure of what I'm thinking here:
Variable = "substring I want to match"
List = [] # Will hold ids of files containing variable
Open file
Read file to the end
Search file [or just the two JSON keys?] for variable
If variable found append "id" to list
Close file
Move to the next one in the directory
It's the actual code part that I'm stumbling over.
Idea using pandas since I don't know about search engines, some copied from: How to read multiple json files into pandas dataframe?
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
Creating it will take forever but once that's done you can search and do operations quickly. E.g. if you want to find all id where author is not empty:
id_list = temp.loc[temp['author'] != '']['id'].tolist()
If the combined size of all your files is gigantic, you may want to consult the docs to store things more efficiently https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html or use another method.
I have many different strings (which are files) that look like this:
20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
20201225_00_ec_op_2m_temp_chinawheat_timeseries.nc
20201225_00_ec_op_snowfall_romaniawheat_timeseries.nc
And many more. I want to be able to loop through all of these files and store their file path in a dictionary. To do that, I want the key to be the text that is between the last two instances of an underscore. For example, if this was the file 20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc, then the dict would be
{'argentinacorn': path/to/file/20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
How can I loop through and do this pythonically?
You can use regexes to extract the key from the strings like this:
import re
input_string = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc"
dict_key = re.findall(".*_(.+)_[^_]+", input_string)[0]
gives
'argentinacorn'
Or with just a simple split:
dict_key = input_string.split("_")[-2]
Regarding file names, you can get the list from current working directory like this:
import os
file_names = os.listdir()
You can just loop through this list and apply the split/regex as shown above.
A simple split and pick:
parts = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc".split("_")
key = parts[-2:-1]
I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted
just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)
So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1
Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!
Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.
I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.
Consider using ConfigParser. Then you will have to split each value by hand, as you describe.
import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.
Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2