How to load a dataframe from a file containing unwanted characters?

How to load a dataframe from a file containing unwanted characters? - python

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?

On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).

Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.

Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)

A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...

For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.

If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.

As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

How to turn a string into a json with separated lines using python

I am using python, and I have a large 'outputString' that consists of several outputs, each on a new line, to look something like this:
{size:1, title:"Hello", space:0}
{size:21, title:"World", space:10}
{size:3, title:"Goodbye", space:20}
However, there is so much data that I cannot see it all in the terminal, and would like to write code that automatically writes a json file. I am having trouble getting the json to keep the separated lines. Right now, it is all one large line in the json file. I have attached some code that I have tried. I have also attached the code used to make the string that I want to convert to a json. Thank you so much!
for value in outputList:
newOutputString = json.dumps(value)
outputString += (newOutputString + "\n")
with open('data.json', 'w') as outfile:
for item in outputString.splitlines():
json.dump(item, outfile)
json.dump("\n",outfile)

If the input really is a string, you'll probably have to make sure it's some properly formated as json:
outputString = '''{"size":1, "title":"Hello", "space":0}
{"size":21, "title":"World", "space":10}
{"size":3, "title":"Goodbye", "space":20}'''
You could then use pandas to manipulate your data (so it's not a problem of screen size anymore).
import pandas as pd
import json
pd.DataFrame([json.loads(line) for line in outputString.split('\n')])
Which gives:
size title space
0 1 Hello 0
1 21 World 10
2 3 Goodbye 20
On the other hand, from what I understand outputString is not a string but a list of dictionaries, so you could write a simpler version of this:
outputString = [{'size':1, 'title':"Hello", 'space':0},
{'size':21, 'title':"World", 'space':10},
{'size':3, 'title':"Goodbye", 'space':20}]
pd.DataFrame(outputString)
Which gives the same DataFrame as before. Using this DataFrame will allow you to query your data and it will be much more confortable than a JSON. For example
>>> df = pd.DataFrame(outputString)
>>> df[df['size'] >= 3]
size title space
1 21 World 10
2 3 Goodbye 20
You could also to try ipython (or even jupyter/jupyterlab) as it will probably also make your life easier.

You can use below code:
json_data = json.loads(outputString)
with open('data.json', 'w') as outfile:
json.dump(json_data, outfile, indent= 5)

Python performance with replace that seldom replaces

I'm writing a list of strings as tab delimited file, using python 3.6
It is rare, but hypothetically possible, that there are tabs in the data. If so, I need to replace them with spaces. Which I do like this :
row = [x.replace("\t", " ") for x in row]
The trouble is, this one line is responsible for about 1/4 of the runtime of the whole program, and it almost never is actually doing anything.
Is there a faster way to purge tabs from my data?
Is there any way to take advantage of the fact that it probably doesn't have any tabs anyway?
I've tried working in bytes instead of strings, and that made no difference.

I tried various approaches and the fastest one is to perform a conditional replacement at indexes where a tab is present
def testReplace(sList):
return [s.replace("\t"," ") for s in sList]
noTabs = str.maketrans("\t"," ")
def testTrans(sList):
return [s.translate(noTabs) for s in sList]
def joinSplit(sList):
return "\n".join(sList).replace("\t"," ").split("\n")
def conditional(sList):
result = sList.copy() # not needed if you intend to replace the list
for i,s in enumerate(sList):
if "\t" in s:
result[i] = s.replace("\t"," ")
return result
performance checks:
from timeit import timeit
count = 100
strings = ["Hello World"*10]*1000 # ["Hello \t World"*10]*1000
t = timeit(lambda:testReplace(strings),number=count)
print("replace",t)
t = timeit(lambda:testTrans(strings),number=count)
print("translate",t)
t = timeit(lambda:joinSplit(strings),number=count)
print("joinSplit",t)
t = timeit(lambda:conditional(strings),number=count)
print("conditional",t)
output:
# With tabs
replace 0.03365320100000002
translate 0.08165113099999993
joinSplit 0.027709890000000015
conditional 0.007067911000000038
# without tabs
replace 0.015160736000000008
translate 0.07439537500000004
joinSplit 0.017001820000000056
conditional 0.0065534649999999806

Untested on a performance question, but I would use the csv module that knows about fields containing new lines or separators, and automatically quotes them:
import csv
with open(filename, 'w', newline='') ad fd:
wr = csv.writer(fd, delimiter='\t')
...
wr.writerow(row)

Reading irregular colunm data into python 3.X using pandas or numpy

Below is my piece of code.
import numpy as np
filename1=open(f)
xf = np.loadtxt(filename1, dtype=float)
Below is my data file.
0.14200E+02 0.18188E+01 0.44604E-03
0.14300E+02 0.18165E+01 0.45498E-03
0.14400E+02-0.17694E+01 0.44615E+03
0.14500E+02-0.17226E+01 0.43743E+03
0.14600E+02-0.16767E+01 0.42882E+03
0.14700E+02-0.16318E+01 0.42033E+03
0.14800E+02-0.15879E+01 0.41196E+03
as one can see there are negative values that take up the space between 2 values this causes numpy to give
ValueError: Wrong number of columns at line 3
This is just small snippet of my code. I want to read this data using numpy or pandas. Any suggestion would be great.
Edit 1:
#ZarakiKenpachi I used your suggestion of sep=' |-' but it gives me extra 4th column with NaN values.
Edit 2:
#Serge Ballesta nice suggestion but all these are some kind of pre-processing. I want some kind of inbuild function to do this in pandas or numpy.
Edit 3:
Important Note it should be noted that there also negative sign in 0.4373E-03
Thank-you

np.loadtext can read from a (byte string) generator, so you can filter the input file while loading it to add an additional before a minus:
...
def filter(fd):
rx = re.compile(rb'\d-')
for line in fd:
yield rx.sub(b' -', line)
xf = np.loadtxt(filter(open(f, 'b')), dtype=float)
This does not require to preload everything into memory, so it is expected to be memory efficient.
The regex is required to avoid to change something like 0.16545E-012.
In my tests for 10k lines, this should be at most 10% slower than loading everything in memory but will require far less memory

You can do a preprocess your data to add an additional space before your - signs. While there are many ways of doing it, the best approach would be in my opinion (in order to avoid adding whitespaces at the start of the line) is using regex re.sub:
import re
with open(f) as file:
raw_data = file.readlines()
processed_data = re.sub(r'(?:\d)-', " -", raw_data)
xf = np.loadtxt(processed_data, dtype=float)
This replaces every - preceded by a number with -.

Try the below code :
with open('app.txt') as f:
data = f.read()
import re
data_mod = []
for number in data.split('\n')[:-1]:
num = re.findall(r'[\w\.-]+-[\w\.-]',number)
for n in num:
number = number.replace('-',' -')
data_mod.append(number)
with open('mod_text.txt','w') as f:
for data in data_mod:
f.write(data+"\n")
filename1='mod_text.txt'
xf = np.loadtxt(filename1, dtype=float)
Actually you have to per-process the data, using regex. After that you can load that data as you required.
I hope this helps.

Python: How split data into different data types into 2D array

I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?

Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.

Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load a dataframe from a file containing unwanted characters? - python

For now the following solution was found: sep = '[|"|]' Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Related

Python Readline Loop and Subloop

How to turn a string into a json with separated lines using python

Python performance with replace that seldom replaces

Reading irregular colunm data into python 3.X using pandas or numpy

Python: How split data into different data types into 2D array

Categories

Resources