How do I transform a non-CSV text file into a CSV using Python/Pandas? - python

I have a text file that looks like this:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
I'd like the file to look like this:
Id Number
Location
Street
Buyer
12345678
1234561791234567090-8.9
999 Street AVE
john doe
12345688
3582561791254567090-8.9
123 Street AVE
Jane doe # buyer % LLC
12345689
8542561791254567090-8.9
854 Street AVE
Jake and Bob: Owner%LLC: Inc
I have tried the following:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\\File Path\\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
Here is where got the last line (#3): https://stackoverflow.com/a/62481057/10448224
When I do the above and export to CSV the headers are arranged like the following:
(index)
Street
Buyer
Id Number
Location
The data is filled in nicely but at some point the Buyer field becomes inaccurate but the rest of the fields are accurate through the entire DF.
My guesses:
When I run #1 part of my script I get the following errors 507 times:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
At the tail end of the new DF I am missing exactly 507 entries for the Byer field. So I think when I drop my bad lines, the field is pushing my data up.
Pain Points:
The Buyer field will sometimes have extra colons and other odd characters. So when I try to use a colon as a delimiter I run into problems.
I am new to Python and I am very new to using functions. I primarily use Pandas to manipulate data at a somewhat basic level. So in the words of the great Michael Scott: "Explain it to me like I'm five." Many many thanks to anyone willing to help.

Here's what I meant by reading in and using split. Very similar to other answers. Untested and I don't recall if inputline include eol, so I stripped it too.
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)

This is a minimal example that demonstrates the basics:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe # buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)

I would try reading the file line by line, splitting the key-value pairs into a list of dicts to look something like:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)

Related

Python: Search for string from dictionaryCSV file and display matching rows

I have this program right now where it allows users to choose from a category(pulling from the file). Then it will print the University data using dictionary.
What I want to do next on my code is for users to search for a specific string from that file and it will display all of the keys. It can be the whole word or part of the string from that file.
I need help on searching for a given string or part of a string and display matching categories (NameID, StudentName, University, Phone, State).
Example:
search: on
output: (Note: that this is in dictionary format)
{'NameID': 'JSNOW', ' StudentName': ' Jon Snow', ' University': ' UofWinterfell', ' Phone': ' 324234423', ' State': 'Westeros'}
{'NameID': 'JJONS', ' StudentName': ' Joe Jonson', ' University': ' NYU', ' Phone': ' 123432333', ' State': 'New York'}
My text file looks like this:
NameID, StudentName, University, Phone, State
JJONS, Joe Jonson, NYU, 123432333, New York
SROGE, Steve Rogers, UofI, 324324423, New York
JSNOW, Jon Snow, UofWinterfell, 324234423, Westeros
DTARG, Daenerys Targaryen, Dragonstone, 345345, NULL
This is what I have so far:
import csv
def load_data(file_name):
university_data=[]
with open("file.csv", mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, skipinitialspace=True)
for col in csv_reader:
university_data.append(dict(col))
print(university_data)
return university_data
# def search_file():
# for l in data:
# no idea what to do here
def main():
filename='file.csv'
university_data = load_data(filename)
print('[1] University\n[2] Student Name\n[3] Exit\n[4] Search')
while True:
choice=input('Enter choice 1/2/3? ')
if choice=='1':
for university in university_data:
print(university['University'])
elif choice=='2':
for university in university_data:
print(university['StudentName'])
elif choice =='3':
print('Thank You')
break
elif choice =='4':
search_file()
else:
print('Invalid selection')
main()
I need choice 4 to work. I would ignore the choice 1 and 2 because they just display the names and is not in dictionary format.
You have to figure out which field you are searching by and then iterate over the list of dicts.
def search_file(field, query):
for l in data:
if l.get(field, None) == query:
return l

Error when creating dictionaries from text files

I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))

Pyscripter and printing line of CSV file

Here is my code:
import csv,math
StudentsTXT=open('students.txt')
csv_students=csv.reader(StudentsTXT, delimiter=',')
am =input('please search ip adress')
I also tried this code :
import csv,math
StudentsTXT=open('students.txt')
csv_students=csv.reader(StudentsTXT, delimiter=',')
print(csv_students['013'])
But this one comes up with error saying:object has no attribute'getitem'
How can I make one of these codes to print out line of my text file already translated into csv.
The Text file looks something like this
010,Jane,Jones,30/11/2001,32|Ban Road,H.Num:899 421 223,Female,11Ca,JJ#school.com
012,John,Johnson,23/09/2001,43|Can Street,H.Num:999 123 323,Male,11Ca,JoJo#school.com
025,Jack,Jackson,29/02/2002,61|Cat grove,H.Num:998 434 656,Male,11Ca,JaJa#school.com
I want to be able to search for any of the names of any students and then print all information about them .
Question: i want it to print things like for example:
002,John,Smith,01/01/2001,1 example road,000 000 000,male,11ca,js#school.com instead of: 001, john 002,jane
Change to csv.DictReader, for instance:
Note: Assuming you have NO Headers in the CSV File!
with open('students.txt') as fh:
# Define Header Fieldnames
fieldnames = ['Name', 'SName', 'ID', 'Date', 'Address', 'Kontakt', 'F/M', 'Class', 'EMail']
csv_students = csv.DictReader(fh, fieldnames=fieldnames)
# Iterate csv_students
for student in csv_students:
# Create a List of Data Ordered using Fieldnames
student_list = [student[f] for f in fieldnames]
print('{}'.format(', '.join(student_list)))
# Print only Part of the Fields
print('{s[ID]}: {s[Name]} {s[Address]}'.format(s=student))
Output:
Jane, Jones, 010, 30/11/2001, 32|Ban Road, H.Num:899 421 223, Female, 11Ca, JJ#school.com
010: Jane 32|Ban Road
John, Johnson, 012, 23/09/2001, 43|Can Street, H.Num:999 123 323, Male, 11Ca, JoJo#school.com
012: John 43|Can Street
Jack, Jackson, 025, 29/02/2002, 61|Cat grove, H.Num:998 434 656, Male, 11Ca, JaJa#school.com
025: Jack 61|Cat grove
Tested with Python: 3.4.2

Python text extractor & organizer

I need to extract details of some costumers and save it in a new database all I have its only a txt file so we are talking about 5000 costumers or more that txt file its saved all in this way:
first and last name
NAME SURNAME
zip country n. phone number mobile
United Kingdom +1111111111
e-mail
email#email.email
guest first and last name 1°
NAME SURNAME
guest first and last name 2°
NAME SURNAME
name address city province
NAME SURNAME London London
zip
AAAAA
Cancellation of the reservation.
Since the file is always like this I was thinking there could be a way to scrape so I did some research as far, this is what I have came up with but not really what I need:
with open('input.txt') as infile, open('output.txt', 'w') as outfile:
copy = False
for line in infile:
if (line.find("first and last name") != -1):
copy = True
elif (line.find("Cancellation of the reservation.") != -1):
copy = False
elif copy:
outfile.write(line)
The codes works but simply reads the file from a line to other and copies the content I need something that will copy the content in an other format like this I am able to uploaded on the database the format I need is this:
first and last name | zip country n. phone number mobile|e-mail|guest first and last name 1°|name address city province|zip
So in this case I need it like this:
NAME SURNAME | United Kingdom +1111111111|email#email.email|NAME SURNAME London London |AAAAA
For every line in the output.txt
these are some good scraping tools for what you're looking to do:
data = '''first and last name
NAME SURNAME
zip country n. phone number mobile
United Kingdom +1111111111
e-mail
email#email.email
guest first and last name 1
NAME SURNAME
guest first and last name 2
NAME SURNAME
name address city province
NAME SURNAME London London
zip
AAAAA
Cancellation of the reservation.
'''
# split on space, convert to list
ldata = data.split()
# strip leading and trailing white space from each item
ldata = [i.strip() for i in ldata]
# split on line break, convert to list
ndata = data.split('\n')
ndata = [i.strip() for i in ndata]
#convert list to string
sdata = ' '.join(ldata)
print ldata
print ndata
print sdata
# two examples of split after, split before
name_surname = sdata.split('first and last name')[1].split('zip')[0]
print name_surname
country_phone = sdata.split('mobile')[1].split('e-mail')[0]
print country_phone
>>>
['first', 'and', 'last', 'name', 'NAME', 'SURNAME', 'zip', 'country', 'n.', 'phone', 'number', 'mobile', 'United', 'Kingdom', '+1111111111', 'e-mail', 'email#email.email', 'guest', 'first', 'and', 'last', 'name', '1', 'NAME', 'SURNAME', 'guest', 'first', 'and', 'last', 'name', '2', 'NAME', 'SURNAME', 'name', 'address', 'city', 'province', 'NAME', 'SURNAME', 'London', 'London', 'zip', 'AAAAA', 'Cancellation', 'of', 'the', 'reservation.']
['first and last name', 'NAME SURNAME', 'zip country n. phone number mobile', 'United Kingdom +1111111111', 'e-mail', 'email#email.email', 'guest first and last name 1', 'NAME SURNAME', 'guest first and last name 2', 'NAME SURNAME', 'name address city province', 'NAME SURNAME London London', 'zip', 'AAAAA', 'Cancellation of the reservation.', '']
first and last name NAME SURNAME zip country n. phone number mobile United Kingdom +1111111111 e-mail email#email.email guest first and last name 1 NAME SURNAME guest first and last name 2 NAME SURNAME name address city province NAME SURNAME London London zip AAAAA Cancellation of the reservation.
NAME SURNAME
United Kingdom +1111111111

File content into dictionary

I need to turn this file content into a dictionary, so that every key in the dict is a name of a movie and every value is the name of the actors that plays in it inside a set.
Example of file content:
Brad Pitt, Sleepers, Troy, Meet Joe Black, Oceans Eleven, Seven, Mr & Mrs Smith
Tom Hanks, You have got mail, Apollo 13, Sleepless in Seattle, Catch Me If You Can
Meg Ryan, You have got mail, Sleepless in Seattle
Diane Kruger, Troy, National Treasure
Dustin Hoffman, Sleepers, The Lost City
Anthony Hopkins, Hannibal, The Edge, Meet Joe Black, Proof
This should get you started:
line = "a, b, c, d"
result = {}
names = line.split(", ")
actor = names[0]
movies = names[1:]
result[actor] = movies
Try the following:
res_dict = {}
with open('my_file.txt', 'r') as f:
for line in f:
my_list = [item.strip() for item in line.split(',')]
res_dict[my_list[0]] = my_list[1:] # To make it a set, use: set(my_list[1:])
Explanation:
split() is used to split each line to form a list using , separator
strip() is used to remove spaces around each element of the previous list
When you use with statement, you do not need to close your file explicitly.
[item.strip() for item in line.split(',')] is called a list comprehension.
Output:
>>> res_dict
{'Diane Kruger': ['Troy', 'National Treasure'], 'Brad Pitt': ['Sleepers', 'Troy', 'Meet Joe Black', 'Oceans Eleven', 'Seven', 'Mr & Mrs Smith'], 'Meg Ryan': ['You have got mail', 'Sleepless in Seattle'], 'Tom Hanks': ['You have got mail', 'Apollo 13', 'Sleepless in Seattle', 'Catch Me If You Can'], 'Dustin Hoffman': ['Sleepers', 'The Lost City'], 'Anthony Hopkins': ['Hannibal', 'The Edge', 'Meet Joe Black', 'Proof']}

Categories