python required fields from input file [duplicate] - python

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 2 years ago.
I have an input file I am using for a python script.
Example of a file is here:
Name: Joe
Surname: Doe
Country: DE
Gender:
Anybody would suggest how to parse the file and make sure that all required info is supplied?
I am trying to avoid if/else statements and trying to implement in more efficient way!
Here is what I do but I am sure there is a better way.
for line in file_content:
if re.match(r'Name\d+:\s+(\w+)', line, re.IGNORECASE):
file_validation['name'] = True
elif re.match(r'Surname:\s+(\w+)', line, re.IGNORECASE):
file_validation['surname'] = True
...
Any suggestions?
ZDZ

Something like this:
>>> re.match(r'^(.+)\s*:\s*(.*)$', 'Surname: Doe').groups()
('Surname', 'Doe')

Firstly, you should parse using regex and construct a dict from the file. The regex we'll be using is-
^(\w+):\s+(\w+)$
This will only select combinations of key and values. So it will not match Gender: since it is empty.
Check out the demo
Now we just have to construct a corresponding dictionary
# File contents
content = '''Name: Joe
Surname: Doe
Country: DE
Gender:
'''
data = {k:v for k, v in re.findall(r'(\w+):\s+(\w+)', content, re.M)}
Now if you look at data, it should look like-
>>> data
{'Name': 'Joe', 'Surname': 'Doe', 'Country': 'DE'}
Now all you have to do, is verify all the required fields exist in data.keys()
Initialize the required fields
required_fields = {'Name', 'Surname', 'Country', 'Gender'}
Check if required_fields is a subset of data.keys() - if you want to allow extra keys in input, or, use == if you want only valid keys to exist in data.keys().
>>> set.issubset(required_fields, set(data.keys()))
False
>>> data.keys() == required_fields
False
Let's try the same thing with valid data-
# File contents
content = '''Name: Joe
Surname: Doe
Country: DE
Gender: Male'''
required_fields = {'Name', 'Surname', 'Country', 'Gender'}
data = {k:v for k, v in re.findall(r'(\w+):\s+(\w+)', content, re.M)}
print(data.keys() == required_fields) # True
print(set.issubset(required_fields, set(data.keys()))) # True
Output-
True
True

I would like to suggest using csvreader because of its simplicity:
import csv
fields_to_validate = ["name", "surname", "country", "gender"]
with open('data.csv') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=':')
for row in csv_reader:
field_key = row[0].lower()
field_value = row[1].strip()
print("\n{} {}".format(field_key, field_value))
if field_key in fields_to_validate and field_value:
print("{} validated correctly!".format(field_key))
else:
print("{} NOT validated correctly.".format(field_key))
Output
name Joe
name validated correctly!
surname Doe
surname validated correctly!
country DE
country validated correctly!
gender
gender NOT validated correctly.

Related

How can I alter the value in a specific column of a certain row in python without the use of pandas?

I was playing around with the code provided here: https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/ and couldn't seem to figure out how to change the value in a specific column of the row without it bringing up an error.
Say I wanted to change the status of the row belonging to the name Molly Singh, how would I go about it? I've tried the following below only to get an error and the CSV file turning out empty. I'd also prefer the solution be without the use of pandas tysm.
For example the row in the csv file will originally be
Sno Registration Number Name RollNo Status
1 11913907 Molly Singh RK19TSA01 P
What I want the outcome to be
Sno Registration Number Name RollNo Status
1 11913907 Molly Singh RK19TSA01 N
One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well? Thanks!
the error I get as you can see, the name column is changed to true then false etc
import csv
op = open("AllDetails.csv", "r")
dt = csv.DictReader(op)
print(dt)
up_dt = []
for r in dt:
print(r)
row = {'Sno': r['Sno'],
'Registration Number': r['Registration Number'],
'Name'== "Molly Singh": r['Name'],
'RollNo': r['RollNo'],
'Status': 'P'}
up_dt.append(row)
print(up_dt)
op.close()
op = open("AllDetails.csv", "w", newline='')
headers = ['Sno', 'Registration Number', 'Name', 'RollNo', 'Status']
data = csv.DictWriter(op, delimiter=',', fieldnames=headers)
data.writerow(dict((heads, heads) for heads in headers))
data.writerows(up_dt)
op.close()
Issues
Your error is because the field name in the input file is misspelled as Regristation rather than Registration
Correction is to just read the names from the input file and propagate to the output file as below.
Alternately, you can your code to:
headers = ['Sno', 'Regristation Number', 'Name', 'RollNo', 'Status']
"One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well"
I'm not sure what is meant by this. In the code below you would just have:
r['Sno'] = (some compute value)
Code
import csv
with open("AllDetails.csv", "r") as op:
dt = csv.DictReader(op)
headers = None
up_dt = []
for r in dt:
# get header of input file
if headers is None:
headers = r
# Change status of 'Molly Singh' record
if r['Name'] == 'Molly Singh':
r['Status'] = 'N'
up_dt.append(r)
with open("AllDetails.csv", "w", newline='') as op:
# Use headers from input file above
data = csv.DictWriter(op, delimiter=',', fieldnames=headers)
data.writerow(dict((heads, heads) for heads in headers))
data.writerows(up_dt)

Python - Replacing a specific value in a CSV file while keeping the rest

So I have a CSV file that looks something like this:
Username,Password,Name,DOB,Fav Artist,Fav Genre
Den1994,Denis1994,Denis,01/02/1994,Eminem,Pop
Joh1997,John1997,John,03/04/1997,Daft Punk,House
What I need to be able to do is let the user edit and change their Fav Artist and Fav Genre so that their new values are saved to the file in place of the old ones. I'm not the very advanced when it comes to CSV so I'm not sure where to begin with it, therefore any help and pointers will be greatly appreciated.
Thanks guys.
EDIT:
Adding the code I have so far so it doesn't seem like I'm just trying to get some easy way out of this, generally not sure what to do after this bit:
def editProfile():
username = globalUsername
file = open("users.csv", "r")
for line in file:
field = line.split(",")
storedUsername = field[0]
favArtist = field[4]
favGenre = field[5]
if username == storedUsername:
print("Your current favourite artist is:", favArtist,"\n" +
"Your current favourite genre is:",favGenre,"\n")
wantNewFavArtist = input("If you want to change your favourite artist type in Y, if not N: ")
wantNewFavGenre = input("If you want to change your favourite genre type in Y, if not N: ")
if wantNewFavArtist == "Y":
newFavArtist = input("Type in your new favourite artist: ")
if wantNewFavGenre == "Y":
newFavGenre = input("Type in your new favourite genre: ")
This is how it would look like using pandas
import pandas as pd
from io import StringIO
# Things you'll get from a user
globalUsername = "Den1994"
field = 'Fav Artist'
new_value = 'Linkin Park'
# Things you'll probably get from a data file
data = """
Username,Password,Name,DOB,Fav Artist,Fav Genre
Den1994,Denis1994,Denis,01/02/1994,Eminem,Pop
Joh1997,John1997,John,03/04/1997,Daft Punk,House
"""
# Load your data (e.g. from a CSV file)
df = pd.read_csv(StringIO(data)).set_index('Username')
print(df)
# Now change something
df.loc[globalUsername][field] = new_value
print(df)
Here df.loc[] allows you to access a row by the index. In this case Username is set as index. Then, [field] selects the column in that row.
Also, consider this:
df.loc[globalUsername][['Fav Artist', 'Fav Genre']] = 'Linkin Park', 'Nu Metal'
In case you have a my-data.csv file you can load it with:
df = pd.read_csv('my-data.csv')
The code above will return
Password Name DOB Fav Artist Fav Genre
Username
Den1994 Denis1994 Denis 01/02/1994 Eminem Pop
Joh1997 John1997 John 03/04/1997 Daft Punk House
and
Password Name DOB Fav Artist Fav Genre
Username
Den1994 Denis1994 Denis 01/02/1994 Linkin Park Pop
Joh1997 John1997 John 03/04/1997 Daft Punk House
Try this
import pandas as pd
data = pd.read_csv("old_file.csv")
data.loc[data.Username=='Den1994',['Fav Artist','Fav Genre']] = ['Beyonce','Hard rock']
data.to_csv('new_file.csv',index=False)
python has a built-in module dealing with csv, there are examples in the docs that will guide you right.
One way to do is to use the csv module to get the file you have into a list of lists, then you can edit the individual lists (rows) and just rewrite to disk what you have in memory.
Good luck.
PS: in the code that you have posted there is no assignment to the "csv in memory" based on the user-input
a minimal example without the file handling could be:
fake = 'abcdefghijkl'
csv = [list(fake[i:i+3]) for i in range(0, len(fake), 3)]
print(csv)
for row in csv:
if row[0] == 'd':
row[0] = 'changed'
print(csv)
the file handling is easy to get from the docs, and pandas dependance is avoided if that is on the wishlist

Split string, unicode, unicode, string in python

I was trying to split combination of string, unicode in python. The split has to be made on the ResultSet object retrieved from web-site. Using the code below, I am able to get the details, actually it is user details:
from bs4 import BeautifulSoup
import urllib2
import re
url = "http://www.mouthshut.com/vinay_beriwal"
profile_user = urllib2.urlopen(url)
profile_soup = BeautifulSoup(profile_user.read())
usr_dtls = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
for dt in usr_dtls:
usr_dtls = " ".join(dt.text.split())
print(usr_dtls)
The output is as below:
i love yellow..
Name: Vinay Beriwal
Age: 39 years
Hometown: New Delhi, India
Country: India
Member since: Feb 11, 2016
What I need is to create distinct 5 variables as Name, Age, Hometown, Country, Member since and store the corresponding value after ':' for same.
Thanks
You can use a dictionary to store name-value pairs.For example -
my_dict = {"Name":"Vinay","Age":21}
In my_dict, Name and Age are the keys of the dictionary, you can access values like this -
print (my_dict["Name"]) #This will print Vinay
Also, it's nice and better to use complete words for variable names.
results = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
user_data={} #dictionary initialization
for result in results:
result = " ".join(result.text.split())
try:
var,value = result.strip().split(':')
user_data[var.strip()]=value.strip()
except:
pass
#If you print the user_data now
print (user_data)
'''
This is what it'll print
{'Age': ' 39 years', 'Country': ' India', 'Hometown': 'New Delhi, India', 'Name': 'Vinay Beriwal', 'Member since': 'Feb 11, 2016'}
'''
You can use a dictionary to store your data:
my_dict = {}
for dt in usr_dtls:
item = " ".join(dt.text.split())
try:
if ':' in item:
k, v = item.split(':')
my_dict[k.strip()] = v.strip()
except:
pass
Note: You should not use usr_dtls inside your for loop, because that's would override your original usr_dtls

Python: split string with delimiters from a list

I'd like to split a string with delimiters which are in a list.
The string has this pattern: Firstname, Lastname Email
The list of delimiters has this: [', ',' '] taken out of the pattern.
I'd like to split the string to get a list like this
['Firstname', 'Lastname', 'Email']
For a better understanding of my problem, this is what I'm trying to achieve:
The user shall be able to provide a source pattern: %Fn%, %Ln% %Mail% of data to be imported
and a target pattern how the data shall be displayed:
%Ln%%Fn%; %Ln%, %Fn; %Mail%
This is my attempt:
data = "Firstname, Lastname Email"
for delimiter in source_pattern_delimiter:
prog = re.compile(delimiter)
data_tuple = prog.split(data)
How do I 'merge' the data_tuple list(s)?
import re
re.split(re.compile("|".join([", ", " "])), "Firstname, Lastname Email")
hope it helps
Seems you want something like this,
>> s = "Firstname, Lastname Email"
>>> delim = [', ',' ']
>>> re.split(r'(?:' + '|'.join(delim) + r')', s)
['Firstname', 'Lastname', 'Email']
A solution without regexes and if you want to apply a particular delimiter at a particular position:
def split(s, delimiters):
for d in delimiters:
item, s = s.split(d, 1)
yield item
else:
yield s
>>> list(split("Firstname, Lastname Email", [", ", " "]))
["Firstname", "Lastname", "Email"]
What about splitting on spaces, then removing any trailing commas?
>>> data = "Firstname, Lastname Email"
>>> [s.rstrip(',') for s in data.split(' ')]
['Firstname', 'Lastname', 'Email']
You are asking for a template based way to reconstruct the split data. The following script could give you an idea how to progress. It first splits the data into the three parts and assigns each to a dictionary entry. This can then be used to give a target pattern:
import re
data = "Firstname, Lastname Email"
# Find a list of entries and display them
entries = re.findall("(\w+)", data)
print entries
# Convert the entries into a dictionary
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
# Use dictionary-based string formatting to provide a template system
print "%(Ln)s%(Fn)s; %(Ln)s, %(Fn)s; %(Mail)s" % dEntries
This displays the following:
['Firstname', 'Lastname', 'Email']
LastnameFirstname; Lastname, Firstname; Email
If you really need to use the exact template system you have provided then the following could be done to first convert your target pattern into one suitable for use with Python's dictionary system:
def display_with_template(data, target_pattern):
entries = re.findall("(\w+)", data)
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
for item in ["Fn", "Ln", "Mail"]:
target_pattern= target_pattern.replace("%%%s%%" % item, "%%(%s)s" % item)
return target_pattern % dEntries
print display_with_template("Firstname, Lastname Email", r"%Ln%%Fn%; %Ln%, %Fn%; %Mail%")
Which would display the same result, but uses a custom target pattern:
LastnameFirstname; Lastname, Firstname; Email

Are there any tools can build object from text directly, like google protocol buffer?

At most log process systems, log file is tab separated text files, the schema of the file is provided separately.
for example.
12 tom tom#baidu.com
3 jim jim#baidu.com
the schema is
id : uint64
name : string
email : string
In order to find record like this person.name == 'tom' , The code is
for each_line in sys.stdin:
fields = each_line.strip().split('\t')
if feilds[1] == 'tom': # magic number
print each_line
There are a lot of magic numbers 1 2 3.
Are there some tools like google protocol buffer(It's for binary), So we can build the object from text directly?
Message Person {
uint64 id = 1;
string name = 2;
string email = 3;
}
so we than build person like this: person = lib.BuildFromText(line)
for each_line in sys.stdin:
person = lib.BuildFromText(each_line) # no magic number
if person.name == 'tom':
print each_line
import csv
Person = {
'id': int,
'name': str,
'email': str
}
persons = []
for row in csv.reader(open('CSV_FILE_NAME', 'r'), delimiter='\t'):
persons.append({item[0]: item[1](row[index]) for index, item in enumerate(Person.items())})
How does lib.BuildFromText() function suppose to know how to name fields? They are just values in the line you pass to it, right? Here is how to do it in Python:
import sys
from collections import namedtuple
Person = namedtuple('Person', 'id, name, email')
for each_line in sys.stdin:
person = Person._make(each_line.strip().split('\t'))
if person.name == 'tom':
print each_line

Categories