Python3: how to count columns in an external file - python

I am trying to count the number of columns in external files. Here is an example of a file, data.dat. Please note that it is not a CSV file. The whitespace is made up of spaces. Each file may have a different number of spaces between the columns.
Data Z-2 C+2
m_[a/b] -155555.0 -133333.0
n_[a/b] -188800.0 -133333.0
o_[a/b*Y] -13.5 -17.95
p1_[cal/(a*c)] -0.01947 0.27
p2_[a/b] -700.2 -200.44
p3_(a*Y)/(b*c) 5.2966 6.0000
p4_[(a*Y)/b] -22222.0 -99999.0
q1_[b/(b*Y)] 9.0 -6.3206
q2_[c] -25220.0 -171917.0
r_[a/b] 1760.0 559140
s 4.0 -4.0
I experimented with split(" ") but could not figure out how to get it to recognize multiple whitespaces; it counted each whitespace as a separate column.
This seems promising but my attempt only counts the first column. It may seem silly to attempt a CSV method to deal with a non-CSV file. Maybe this is where my problems are coming from. However, I have used CSV methods before to deal with text files.
For example, I import my data:
with open(data) as csvfile:
reader = csv.DictReader(csvfile)
n_cols = len(reader.fieldnames)
When I use this, only the first column is recognized. The code is too long to post but I know this is happening because when manually enter n_cols = 3, I do get the results I expect.
It does work if I use commas to delimit the columns, but I can't do that (I need to use whitespace).
Does anyone know an alternative method that deals with arbitrary whitespace and non-CSV files? Thank you for any advice.

Yes, there are alternative methods:
Pandas
import pandas as pd
df = pd.read_csv('data.dat', delim_whitespace=True)
NumPy
arr = np.loadtxt('data.dat', dtype='str')
# or
arr = np.genfromtxt('data.dat',dtype='str')
Python's csv
If you want to use the python's csv library, you can normalize the whitespaces first before reading it, eg:
import re
with open('data.dat') as csvfile:
content = csvfile.read().strip()
normalized_content = re.sub(r' +', r' ', content)
reader = csv.reader(normalized_content.split('\n'), delimiter=' ')

Related

Ignore delimiters between parentheses when reading CSV

I am reading in a csv file delimited by | but some of the data includes extra | characters. When this occurs it appears to be only between two parentheses (example below). I want to be able to read in the data into a dataframe without the columns being messed up (or failing) due to these extra | characters.
Ive been trying to find a way to either
set the pandas read csv delimiter to ignore delimiters between parentheses ()
or
parse over the csv file before loading it to a dataframe and remove any | characters between parentheses ()
I havent had any luck so far. This is sample data that messes up when I attempt to pull into a dataframe.
1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email
I am trying to ignore the | characters between the parentheses () from (ABC(111) to ABC(111),1)
This sample data occurs repeatedly throughout the data so I cant address each time this pattern occurs so I am trying to address it programmatically.
This person seems to be attempting something similar but their solution did not work for me (when changing to |)
Depending on the specifics of your input file you might succeed by applying a regular expression with this strategy:
Replace the sub string containing the brackets with a dummy sub string.
Read the csv.
Re-replace the dummy sub string with the original.
A dirty proof of concept looks as follows:
import re
import pandas as pd
test_line = """1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email"""
# 1. step
dummy_string = '--REPLACED--'
pattern_brackets = re.compile('(\\(.*\\))')
pattern_replace = re.compile(dummy_string)
list_replaced = pattern_brackets.findall(test_line)
csvString = re.sub(pattern_brackets, dummy_string, test_line)
# 2. step
# Now I have to fake your csv
from io import StringIO
csvStringIO = StringIO(csvString)
df = pd.read_csv(csvStringIO, sep="|", header=None)
# 3. step - not the nicest way; just for illustration
# The critical column is the 20th.
new_col = pd.Series(
[re.sub(pattern_replace, list_replaced[nRow], cell)
for nRow, cell in enumerate(df.iloc[:, 20])
])
df.loc[:, 20] = new_col
This works for the line given. Most probably you have to adopt the recipe to the content of your input file. I hope you find your way from here.

Reading rows in CSV file and appending a list creates a list of lists for each value

I am copying list output data from a DataCamp course so I can recreate the exercise in Visual Studio Code or Jupyter Notebook. From DataCamp Python Interactive window, I type the name of the list, highlight the output and paste it into a new file in VSCode. I use find and replace to delete all the commas and spaces and now have 142 numeric values, and I Save As life_exp.csv. Looks like this:
43.828
76.423
72.301
42.731
75.32
81.235
79.829
75.635
64.062
79.441
When I read the file into VSCode using either Pandas read_csv or csv.reader and use values.tolist() with Pandas or a for loop to append an existing, blank list, both cases provide me with a list of lists which then does not display the data correctly when I try to create matplotlib histograms.
I used NotePad to save the data as well as a .csv and both ways of saving the data produce the same issue.
import matplotlib.pyplot as plt
import csv
life_exp = []
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row)
And
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
life_exp = life_exp_df.values.tolist()
When you print life_exp after importing using csv, you get:
[['43.828'],
['76.423'],
['72.301'],
['42.731'],
['75.32'],
['81.235'],
['79.829'],
['75.635'],
['64.062'],
['79.441'],
['56.728'],
….
And when you print life_exp after importing using pandas read_csv, you get the same thing, but at least now it's not a string:
[[43.828],
[76.423],
[72.301],
[42.731],
[75.32],
[81.235],
[79.829],
[75.635],
[64.062],
[79.441],
[56.728],
…
and when you call plt.hist(life_exp) on either version of the list, you get each value as bin of 1.
I just want to read each value in the csv file and put each value into a simple Python list.
I have spent days scouring stackoverflow thinking someone has done this, but I can't seem to find an answer. I am very new to Python, so your help greatly appreciated.
Try:
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
# Select the values of your first column as a list
life_exp = life_exp_df.iloc[:, 0].tolist()
instead of:
life_exp = life_exp_df.values.tolist()
With csv reader, it will parse the line into a list using the delimiter you provide. In this case, you provide \n as the delimiter but it will still take that single item and return it as a list.
When you append each row, you are essentially appending that list to another list. The simplest work-around is to index into row to extract that value
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row[0])
However, if your data is not guaranteed to be formatted the way you have provided, you will need to handle that a bit differently:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
for number in row:
life_exp.append(number)
A bit cleaner with list comprehension:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
[life_exp.append(number) for row in exp_read for number in row]

Pandas python replace empty lines with string

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"
I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))
Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.
Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

Merging column values into one column without comma on python (without pandas)

Question on how to merge different column values into out without comma on python...
My task as like this.
A big csv file data has following rows
s,0,6,8,9,2,-,3,6,2,8,7,1,0,n,.,c,s,v
s,0,5,9,6,0,-,3,6,7,0,1,6,0,n,.,c,s,v
s,1,9,0,5,5,-,3,6,1,5,5,8,6,n,.,c,s,v
s,2,8,0,7,9,-,3,2,5,1,8,2,7,n,.,c,s,v
s,0,0,5,6,5,-,3,3,4,0,5,7,0,n,.,c,s,v
s,3,0,3,4,8,-,3,5,9,1,2,2,6,n,.,c,s,v
s,0,3,8,8,9,-,3,7,3,1,0,2,5,n,.,c,s,v
I want to make this look like follow:
06892
05960
19055
28079
00565
30348
03889
I attempted following code without success.
import csv, os
with open ('/Desktop/case.csv','r') as h:
reader = csv.reader(h)
for row in reader:
k = row[1:6]
print(k)
When I did this, following results come up.
0,6,8,9,2
0,5,9,6,0
1,9,0,5,5
2,8,0,7,9
0,0,5,6,5
3,0,3,4,8
0,3,8,8,9
How to make this look like my desired output, i.e. without commas?
Use join:
from io import StringIO
import csv
txtfile = StringIO("""s,0,6,8,9,2,-,3,6,2,8,7,1,0,n,.,c,s,v
s,0,5,9,6,0,-,3,6,7,0,1,6,0,n,.,c,s,v
s,1,9,0,5,5,-,3,6,1,5,5,8,6,n,.,c,s,v
s,2,8,0,7,9,-,3,2,5,1,8,2,7,n,.,c,s,v
s,0,0,5,6,5,-,3,3,4,0,5,7,0,n,.,c,s,v
s,3,0,3,4,8,-,3,5,9,1,2,2,6,n,.,c,s,v
s,0,3,8,8,9,-,3,7,3,1,0,2,5,n,.,c,s,v""")
reader = csv.reader(txtfile)
for row in reader:
k = row[1:6]
print(''.join(k))
Output:
06892
05960
19055
28079
00565
30348
03889

Python does not read CSV file properly, could not convert string to float value error

I'm fairly new to coding (and this type of website) and right now I'm having difficulties with python. I need python to read data from a CSV file:
Here's a snippet of my code:
import numpy as np
data = np.loadtext('Users/User/Documents/Data.csv', delimiter=',')
firstrow = data[0:,]
Here is just a sample of the CSV (the actual file is very large and contains a row of 2000+ numbers)
2 -2 2 5 -4 -2 0 4 -5
I want python to read the first row of the file but whenever I run the program it is always saying "could not convert string to float". I don't understand what the problem is here and how I can fix it without making a new file (as mentioned before the file is very very large and it would take me a very long time to remake) but any help would be very much appreciated, thanks!
I would use pandas. It is much faster than np.loadtext.
import pandas as pd
data = pd.read_csv('Users/User/Documents/Data.csv', header = None)
The default delimiter is a comma, however if you want to pass another delimiter such as tab just do data = pd.read_csv('Users/User/Documents/Data.csv', sep = '\t'). If the delimiter should be any whitespace instead do data = pd.read_csv('Users/User/Documents/Data.csv', delim_whitespace = True).
If you only need to read the first row of your data, then you can just do pd.read_csv('Users/User/Documents/Data.csv', nrows = 1)
To convert from a pandas dataframe to numpy array, just do:
data_np = data.values

Categories