I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
UDPport 2019
TCPport 9129
As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
# parse here?
return csv_rows
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
I have a json file, from which I'm extracting quotes. It's the file from Kaggle (formatted exactly the same way).
My goal is to extract all the quotes (just the quotes, not the authors or other metadata) into a simple text document. The first 5 lines would be:
# Don't cry because it's over, smile because it happened.
# I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.
# Be yourself; everyone else is already taken.
# Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.
# Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.
The challenge is that some quotes repeat and I only want to write each quote once. What's a good way to only write down unique values into a text doc?
The best i came up with was this:
import json
with open('quotes.json', 'r') as json_f:
data = json.load(json_f)
quote_list = []
with open('quotes.txt', 'w') as text_f:
for quote_object in data:
quote = quote_object['Quote']
if quote not in quote_list:
But it feels grossly inefficient to have to create and maintain a separate list with 40,000 values.
I tried reading the file on each iteration of the write function, but somehow read always comes back empty:
with open('quotes.json', 'r') as json_f:
data = json.load(json_f)
with open('quotes.txt', 'w+') as text_f:
for quote_object in data:
quote = quote_object['Quote']
print(text_f.read()) # prints nothing?
# if it can't read the doc, I can't check if quote already there
Would love to understand why text_f.read() comes back empty, and what's a more elegant solution.
You can use a set:
import json
with open('quotes.json', 'r') as json_f:
data = json.load(json_f)
quotes = set()
with open('quotes.txt', 'w') as text_f:
for quote_object in data:
quote = quote_object['Quote']
Adding the same quote to the set multiple times will have no effect: only a single object is kept!
I have a folder with 300+ .txt files with total size of 15GB+. These files contain tweets. Each line is a different tweet. I have a list of keywords I'd like to search the tweets for. I have created a script that searches each line of every file for every item on my list. If the tweet contains the keyword, then it writes the line into another file. This is my code:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
This is the format each line has:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"#deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
The script works but it takes a lot of time. For ~40 keywords it needs more than 2 hours. Obviously my code is not optimized. What can I do to improve the speed?
p.s. I have read some relevant questions regarding searching and speed but I suspect that the problem in my script lies in the fact that I'm using a list for the keywords. I've tried some of the suggested solutions but to no avail.
1) External library
If you're willing to lean on external libraries (and time to execute is more important than the one-off time cost to install), you might be able to gain some speed by loading each file into a simple Pandas DataFrame and performing the keyword search as a vector operation. To get the matching tweets, you would do something like:
import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index = dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`.
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.
Dataframe operations within Pandas can be really quick so might be worth investigating.
2) Group your regex
Looks like you're not logging which keyword you matched against. If this is true, you could group your keywords into a single regex query like so:
for line in f:
keywords_combined = "|".join(keywords)
if re.search(keywords_combined, line, re.IGNORECASE):
I've not tested this but by reducing the number of loops per line, that could trim some time off.
Why it's slow
You are regex searching through a json dump, which is not always a good idea. For example, if you keywords include words like user, time, profile and image each line will result in a match because the json format for tweets has all these terms as dictionary keys.
Besides the raw JSON is huge, each tweet will be more than 1kb in size (this one is 2.1kb) but the only part that's relevent in your sample is:
"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",
And this is less than 100 bytes, a typical tweet is still less than 140 characters despite recent changes to the API.
Things to try:
pre compile the regex as suggested by Padraic Cunningham
Option 1. Load this data into a postgresql JSONB field. JSONB fields are indexable and can be searched very quickly
Option 2. Load this into any old database, with the context of the text field having it's own column so that this column can be searched easily.
Option 3. last but not least, extract just the text field into it's own file. You can have a CSV file where the first column is the screen name and the second is the text of the tweet. Your 15GB will be shrunk to about 1GB
In short what you are doing now is searching the whole farm for the needle when you only need to search the haystack.
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
return output_list
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
return failure
CSV structure
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.