Python: Issue with delimiter inside column data - python

This is no duplicate of another question, as I do not want to drop the rows. The accepted answer in the aforementioned post is very different from this one, and not aimed at maintaining all the data.
Problem:
Delimiter inside column data from badly formatted csv-file
Tried solutions: csv module , shlex, StringIO (no working solution on SO)
Example data
Delimiters are inside the third data field, somewhere enclosed by (multiple) double-quotes:
08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test
Desired sample output
Fixed length of 7:
['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']
Parsing through csv reader doesn't fix the problem (skipinitialspace is not the problem), shlex is no use and StringIO is also of no help...
My initial idea was to import row by row, and replace ';' element by element in row.
But the importing is the problem, as it splits on every ';'.
The data comes from a larger file with 300.000+ rows (not all the rows have this problem).
Any advice is welcome.

As you know the number of input fields, and as only one field is badly formatted, you can simply split on ; and then combine back the median fields into one single one:
for line in file:
temp_l = line.split(';')
lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields
I did not even try to process the double quotes, because I could not understand how you pass from Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\" to 'Network routing 21,5\" 4,8GHz1TB hddQwerty'...

you can use the standart csv module.
To achieve what you are trying to accomplish just change the csv delimiter in question to ';'
Test the following in the terminal:
import csv
test = ["4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test"]
delimited_colon = list(csv.reader(b, delimiter=";", skipinitialspace=True))

Related

Conver string from `ABCD0011` to `ABCD_11`

I have the following CSV file:
ABCD0011
ABCD1404
ABCD1255
There are many such rows in the CSV file which I want to convert as follows:
Input
Expected Output
Actual Output
ABCD0011
ABCD_11_0
ABCD_0011_0
ABCD1404
ABCD_1404_0
ABCD_144_0
ABCD1255
ABCD_1255_0
ABCD_1255_0
Basically, it takes the zeros after the letters and replace it with an underscore ("_").
Code
import numpy as np
import pandas as pd
df = pd.read_csv('Book1.csv')
df.A = df.A.str.replace('[0-9]+', '')+'_'+df.A.str.replace('([A-Z])+', '')+'_0'
Actual Output and Issues
I got the values that are without leading zeros correctly converted like
from ABCD1255 to ABCD_1255_0.
But for values with leading zeros it failed, example:
from ABCD0011 to ABCD_0011_0. Did not change the format.
Even for values with zeros inside it failed, like
from ABCD1404 to ABCD_144_0. It deleted the zero in the middle.
Question
How can I fix this issue?
If we know the input strings will always be eight characters, with the first four being letter and the second set of four being a number, we could:
>>> s = "ABCD0011"
>>> f"{s[:4]}_{int(s[4:])}_0"
'ABCD_11_0'
If we don't know the lengths for sure, we can use re.sub with a lambda to transform two different matching groups.
>>> import re
>>> re.sub(r'([a-zA-Z]+)(\d+)', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", s)
'ABCD_11_0'
>>> re.sub(r'([a-zA-Z]+)(\d+)', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", 'A709')
'A_709_0'
Ignoring the apparent requirement for a dataframe, this is how you could parse the file and generate the strings you need. Uses re. Does not use numpy and/or pandas
import re
FILENAME = 'well.csv'
PATTERN = re.compile(r'^([a-zA-Z]+)(\d+)$')
with open(FILENAME) as csv_data:
next(csv_data) # skip header(s)
for line in csv_data:
if m := PATTERN.search(line):
print(f'{m.group(1)}_{int(m.group(2))}_0')
This will work for the data shown in the question. Other data structures may cause this to fail.
Output:
ABCD_11_0
ABCD_1404_0
ABCD_1255_0

Formatting an unstructured csv in pandas

I'm having an issue reading in accurate information from archived 4chan comments. Since the structure of a thread of a 4chan thread doesn't (seem to) translate very well into a rectangular dataframe I'm having issues actually getting the appropriate comments from each thread into a single row in pandas.
To exacerbate the problem the dataset is 54GB in size and I asked a similar question on how to just read the data into a pandas dataframe (in which the solution to that problem made me realize this issue) which makes diagnosing every problem tedious.
The code I use to read in portions of the data is as follows:
def Four_pleb_chunker():
"""
:return: 4pleb data is over 54 GB so this chunks it into something manageable
"""
with open('pol.csv') as f:
with open('pol_part.csv', 'w') as g:
for i in range(1000): ready
g.write(f.readline())
name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w', 'preview_h',
'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig', 'spoiler', 'deleted', 'capcode',
'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash', 'poster_country', 'exif']
cols = ['num','timestamp', 'email', 'name', 'title', 'comment', 'poster_country']
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter=None,
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
df_chunk = df_chunk.rename(columns={"comment": "Comments"})
df_chunk = df_chunk.dropna(subset=['Comments'])
df_chunk['Comments'] = df_chunk['Comments'].str.replace('[^0-9a-zA-Z]+', ' ')
df_chunk.to_csv('pol_part_df.csv')
return df_chunk
This code works fine, however due to the structure of each thread a parser that I wrote sometimes returns nonsensical results. In csv form this is what the first few rows of the dataset look like (pardon the screen shot, its extremely difficult to actually write all those lines out using this UI.)
As it can be seen the comments per thread are split by '\' but then each comment doesn't take its own row. My goal is at least to get each comment into its own row so I can parse through it correctly. However the function I'm using to parse the data cuts off after 1000 iterations regardless if its a new line or not.
Fundamentally my questions are: How can I structure this data to actually read the comments accurately, and be able to read in a complete sample dataframe as opposed to a truncated one. As for solutions I've tried:
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter='',
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
If I get rid of/change the argument delimiter I get this error:
Skipping line 31473: ',' expected after '"'
Which makes sense because the data isn't separated by , so it skips every line that doesn't fit that condition, in this case the whole dataframe. inputing \ into the argument gives me a syntax error. I'm kind of at a loss for what to do next, so if anyone has had any experience dealing with an issue like this you'd be a lifesaver. Let me know if there isn't something I've included in here and I'll update the post.
Update, here are some sample lines from the CSV for testing:
2 23594708 1385716767 \N Anonymous \N Example: not identifying the fundamental scarcity of resources which underlies the entire global power structure, or the huge, documented suppression of any threats to that via National Security Orders. Or that EVERY left/right ideology would be horrible in comparison to ANY in which energy scarcity and the hierarchical power structures dependent upon it had been addressed.
3 23594754 1385716903 \N Anonymous \N ">>23594701\
\
No, /pol/ is bait. That's the point."
4 23594773 1385716983 \N Anonymous \N ">>23594754
\
Being a non-bait among baits is equal to being a bait among non-baits."
5 23594795 1385717052 \N Anonymous \N Don't forget how heavily censored this board is! And nobody has any issues with that.
6 23594812 1385717101 \N Anonymous \N ">>23594773\
\
Clever. The effect is similar. But there are minds on /pol/ who don't WANT to be bait, at least."
Here's a sample script that converts your csv into separate lines for each comment:
import csv
# open file for output and create csv writer
f_out = open('out.csv', 'w')
w = csv.writer(f_out)
# open input file and create reader
with open('test.csv') as f:
r = csv.reader(f, delimiter='\t')
for l in r:
# skip empty lines
if not l:
continue
# in this line I want to split the last part
# and loop over each resulting string
for s in l[-1].split('\\\n'):
# we copy all fields except the last one
output = l[:-1]
# add a single comment
output.append(s)
w.writerow(output)

Reading irregular colunm data into python 3.X using pandas or numpy

Below is my piece of code.
import numpy as np
filename1=open(f)
xf = np.loadtxt(filename1, dtype=float)
Below is my data file.
0.14200E+02 0.18188E+01 0.44604E-03
0.14300E+02 0.18165E+01 0.45498E-03
0.14400E+02-0.17694E+01 0.44615E+03
0.14500E+02-0.17226E+01 0.43743E+03
0.14600E+02-0.16767E+01 0.42882E+03
0.14700E+02-0.16318E+01 0.42033E+03
0.14800E+02-0.15879E+01 0.41196E+03
as one can see there are negative values that take up the space between 2 values this causes numpy to give
ValueError: Wrong number of columns at line 3
This is just small snippet of my code. I want to read this data using numpy or pandas. Any suggestion would be great.
Edit 1:
#ZarakiKenpachi I used your suggestion of sep=' |-' but it gives me extra 4th column with NaN values.
Edit 2:
#Serge Ballesta nice suggestion but all these are some kind of pre-processing. I want some kind of inbuild function to do this in pandas or numpy.
Edit 3:
Important Note it should be noted that there also negative sign in 0.4373E-03
Thank-you
np.loadtext can read from a (byte string) generator, so you can filter the input file while loading it to add an additional before a minus:
...
def filter(fd):
rx = re.compile(rb'\d-')
for line in fd:
yield rx.sub(b' -', line)
xf = np.loadtxt(filter(open(f, 'b')), dtype=float)
This does not require to preload everything into memory, so it is expected to be memory efficient.
The regex is required to avoid to change something like 0.16545E-012.
In my tests for 10k lines, this should be at most 10% slower than loading everything in memory but will require far less memory
You can do a preprocess your data to add an additional space before your - signs. While there are many ways of doing it, the best approach would be in my opinion (in order to avoid adding whitespaces at the start of the line) is using regex re.sub:
import re
with open(f) as file:
raw_data = file.readlines()
processed_data = re.sub(r'(?:\d)-', " -", raw_data)
xf = np.loadtxt(processed_data, dtype=float)
This replaces every - preceded by a number with -.
Try the below code :
with open('app.txt') as f:
data = f.read()
import re
data_mod = []
for number in data.split('\n')[:-1]:
num = re.findall(r'[\w\.-]+-[\w\.-]',number)
for n in num:
number = number.replace('-',' -')
data_mod.append(number)
with open('mod_text.txt','w') as f:
for data in data_mod:
f.write(data+"\n")
filename1='mod_text.txt'
xf = np.loadtxt(filename1, dtype=float)
Actually you have to per-process the data, using regex. After that you can load that data as you required.
I hope this helps.

How to load a dataframe from a file containing unwanted characters?

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Python: How split data into different data types into 2D array

I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?
Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.
Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])

Categories