Row reading issue in csv containing html format data

Row reading issue in csv containing html format data - python

I have one html file containing a table in it. Total rows in the tables are around 3500. I want to read and print rows with same values. PFA Image of the html data.
I transform the data into csv where I could see same data in html format.
As shown in image. I want to print and write all the rows containing "MyData" to another CSV and then need to mail it.
I tried using Soupbeautiful but not able to get the result.
I tried using CSV and Pandas but it is not returning the expected output.
My python code is as follows;
import csv
import numpy as np
import pandas as pd
import sys
csv.field_size_limit(sys.maxsize)
df = pd.read_csv('test.csv')
data = print (df.iloc[0:5])
Another code I tried
search_string = "MyData"
with open('test.csv') as f, open('test2.csv', 'w') as g:
reader = csv.reader(f)
next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == search_string:
writer.writerow(row[:2])
print(row)
When I enter complete row from info_data then it gives me that particular row but not other rows where the string "MyData" is present.
Thanks !

You are currently testing the entry for an exact match with your search string. That entry contains a JSON string, so you could use in to see if it contains search_string rather than is an exact match for it, for example:
search_string = "MyData"
with open('test.csv') as f, open('test2.csv', 'w') as g:
reader = csv.reader(f)
next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if search_string in row[2]:
writer.writerow(row[:2])
print(row)
You would then want to add code to further decode you JSON data.

Related

How to convert csv file into json in python so that the header of csv are keys of every json value

I have this use case
please create a function called “myfunccsvtojson” that takes in a filename path to a csv file (please refer to attached csv file) and generates a file that contains streamable line delimited JSON.
• Expected filename will be based on the csv filename, i.e. Myfilename.csv will produce Myfilename.json or File2.csv will produce File2.json. Please show this in your code and should not be hardcoded.
• csv file has 10000 lines including the header
• output JSON file should contain 9999 lines
• Sample JSON lines from the csv file below:
CSV:
nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0043044,tt0050419,tt0053137" nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0071877,tt0038355,tt0117057,tt0037382" nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0057345,tt0059956,tt0049189,tt0054452"
JSON lines:
{"nconst":"nm0000001","primaryName":"Fred Astaire","birthYear":1899,"deathYear":1987,"primaryProfession":"soundtrack,actor,miscellaneous","knownForTitles":"tt0072308,tt0043044,tt0050419,tt0053137"}
{"nconst":"nm0000002","primaryName":"Lauren Bacall","birthYear":1924,"deathYear":2014,"primaryProfession":"actress,soundtrack","knownForTitles":"tt0071877,tt0038355,tt0117057,tt0037382"}
{"nconst":"nm0000003","primaryName":"Brigitte Bardot","birthYear":1934,"deathYear":null,"primaryProfession":"actress,soundtrack,producer","knownForTitles":"tt0057345,tt0059956,tt0049189,tt0054452"}
I am not able to understand is how the header can be inputted as a key to every value of jason.
Has anyone come access this scenario and help me out of it?
What i was trying i know loop is not correct but figuring it out
with open(file_name, encoding = 'utf-8') as file:
csv_data = csv.DictReader(file)
csvreader = csv.reader(file)
# print(csv_data)
keys = next(csvreader)
print (keys)
for i,Value in range(len(keys)), csv_data:
data[keys[i]] = Value
print (data)

You can convert your csv to pandas data frame and output as json:
df = pd.read_csv('data.csv')
df.to_json(orient='records')

import csv
import json
def csv_to_json(csv_file_path, json_file_path):
data_dict = []
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for rows in csv_reader:
data_dict.append(rows)
with open(json_file_path, 'w', encoding = 'utf-8') as json_file_handler:
json_file_handler.write(json.dumps(data_dict, indent = 4))
csv_to_json("/home/devendra/Videos/stackoverflow/Names.csv", "/home/devendra/Videos/stackoverflow/Names.json")

How to overwrite a particular column of a csv file using pandas or normal python?

I am new to python. I have a .csv file which has 13 columns. I want to round off the floating values of the 2nd column which I was able to achieve successfully. I did this and stored it in a list. Now I am unable to figure out how to overwrite the rounded off values into the same csv file and into the same column i.e. column 2? I am using python3. Any help will be much appreciated.
My code is as follows:
Import statements for module import:
import csv
Creating an empty list:
list_string = []
Reading a csv file
with open('/home/user/Desktop/wine.csv', 'r') as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter = ',')
next(csvReader, None)
for row in csvReader:
floatParse = float(row[1])
closestInteger = int(round(floatParse))
stringConvert = str(closestInteger)
list_string.append(stringConvert)
print(list_string)
Writing into the same csv file for the second column (Overwrites the entire Excel file)
with open('/home/user/Desktop/wine.csv', 'w') as csvDataFile:
writer = csv.writer(csvDataFile)
next(csvDataFile)
row[1] = list_string
writer.writerows(row[1])
PS: The writing into the csv overwrites the entire csv and removes all the other columns which I don't want. I just want to overwrite the 2nd column with rounded off values and keep the rest of the data same.

this might be what you're looking for.
import pandas as pd
import numpy as np
#Some sample data
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"SecondColumnName": [7.256,1.222,3.16547,4.145658,4.154656,6.12,17.1568],
}
wine = pd.DataFrame(data)
#This is how you'd read in your data
#wine = pd.read_csv('/home/user/Desktop/wine.csv')
#Replace the SecondColumnName with the real name
wine["SecondColumnName"] = wine["SecondColumnName"].map('{:,.2f}'.format)
#This will overwrite the sheet, but it will have all the data as before
wine.to_csv(/home/user/Desktop/wine.csv')
Pandas is way easier than read csv...I'd recommended checking it out.

I think this better answers the specific question. The key to this is to define an input_file and an output_file during the with part.
The StringIO part is just there for sample data in this example. newline='' is for Python 3. Without it, blank lines between each row appears in the output. More info.
import csv
from io import StringIO
s = '''A,B,C,D,E,F,G,H,I,J,K,L
1,4.4343,3,4,5,6,7,8,9,10,11
1,8.6775433,3,4,5,6,7,8,9,10,11
1,16.83389832,3,4,5,6,7,8,9,10,11
1,32.2711122,3,4,5,6,7,8,9,10,11
1,128.949483,3,4,5,6,7,8,9,10,11'''
list_string = []
with StringIO(s) as input_file, open('output_file.csv', 'w', newline='') as output_file:
reader = csv.reader(input_file)
next(reader, None)
writer = csv.writer(output_file)
for row in reader:
floatParse = float(row[1]) + 1
closestInteger = int(round(floatParse))
stringConvert = str(closestInteger)
row[1] = stringConvert
writer.writerow(row)

How to search for a specific column value in csv file, if present , write first two column values to a new csv file with Python?

file1.csv:
Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,DDDD,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
I want to search for string/number in 3rd column of above csv file. And if present, write the 'first two column values' to another file.
For example:
In 3rd column, number 6 is present --- > Then I want write 'f','kl' into a new csv file (with headers)
In 3rd column, string DDDD is present ---> Then I want to write 'e','ij' into a new csv file.
Please guide me how we can do this with Python?
I am trying with below code:
import csv
import time
search_string = "1"
with open('file1.csv') as f, open('file3.csv', 'w') as g:
reader = csv.reader(f)
next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == search_string:
writer.writerow(row[:2])
But its printing only last two row values.

I don't see any problem in your code:
The third column of the row in row[2], you are right.
The first two columns are row[0:2] or row[:2], you are right.
If I simulate the reading, like this:
import io
import csv
data = """Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,DDDD,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
"""
with io.StringIO(data) as f:
reader = csv.reader(f)
next(reader, None) # discard the header
for row in reader:
if row[2] == "1":
print(row[:2])
It prints:
['a', 'ab']
['b', 'cd']
Change the value of search_string…

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?

Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])

You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']

The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']

How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.

I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))

Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]

here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)

import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns

I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1

Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

Delete blank rows from CSV?

I have a large csv file in which some rows are entirely blank. How do I use Python to delete all blank rows from the csv?
After all your suggestions, this is what I have so far
import csv
# open input csv for reading
inputCSV = open(r'C:\input.csv', 'rb')
# create output csv for writing
outputCSV = open(r'C:\OUTPUT.csv', 'wb')
# prepare output csv for appending
appendCSV = open(r'C:\OUTPUT.csv', 'ab')
# create reader object
cr = csv.reader(inputCSV, dialect = 'excel')
# create writer object
cw = csv.writer(outputCSV, dialect = 'excel')
# create writer object for append
ca = csv.writer(appendCSV, dialect = 'excel')
# add pre-defined fields
cw.writerow(['FIELD1_','FIELD2_','FIELD3_','FIELD4_'])
# delete existing field names in input CSV
# ???????????????????????????
# loop through input csv, check for blanks, and write all changes to append csv
for row in cr:
if row or any(row) or any(field.strip() for field in row):
ca.writerow(row)
# close files
inputCSV.close()
outputCSV.close()
appendCSV.close()
Is this ok or is there a better way to do this?

Use the csv module:
import csv
...
with open(in_fnam, newline='') as in_file:
with open(out_fnam, 'w', newline='') as out_file:
writer = csv.writer(out_file)
for row in csv.reader(in_file):
if row:
writer.writerow(row)
If you also need to remove rows where all of the fields are empty, change the if row: line to:
if any(row):
And if you also want to treat fields that consist of only whitespace as empty you can replace it with:
if any(field.strip() for field in row):
Note that in Python 2.x and earlier, the csv module expected binary files, and so you'd need to open your files with e 'b' flag. In 3.x, doing this will result in an error.

Surprised that nobody here mentioned pandas. Here is a possible solution.
import pandas as pd
df = pd.read_csv('input.csv')
df.to_csv('output.csv', index=False)

Delete empty row from .csv file using python
import csv
...
with open('demo004.csv') as input, open('demo005.csv', 'w', newline='') as output:
writer = csv.writer(output)
for row in csv.reader(input):
if any(field.strip() for field in row):
writer.writerow(row)
Thankyou

You have to open a second file, write all non blank lines to it, delete the original file and rename the second file to the original name.
EDIT: a real blank line will be like '\n':
for line in f1.readlines():
if line.strip() == '':
continue
f2.write(line)
a line with all blank fields would look like ',,,,,\n'. If you consider this a blank line:
for line in f1.readlines():
if ''.join(line.split(',')).strip() == '':
continue
f2.write(line)
openning, closing, deleting and renaming the files is left as an exercise for you. (hint: import os, help(open), help(os.rename), help(os.unlink))
EDIT2: Laurence Gonsalves brought to my attention that a valid csv file could have blank lines embedded in quoted csv fields, like 1, 'this\n\nis tricky',123.45. In this case the csv module will take care of that for you. I'm sorry Laurence, your answer deserved to be accepted. The csv module will also address the concerns about a line like "","",""\n.

Doing it with pandas is very simple. Open your csv file with pandas:
import pandas as pd
df = pd.read_csv("example.csv")
#checking the number of empty rows in th csv file
print (df.isnull().sum())
#Droping the empty rows
modifiedDF = df.dropna()
#Saving it to the csv file
modifiedDF.to_csv('modifiedExample.csv',index=False)

python code for remove blank line from csv file without create another file.
def ReadWriteconfig_file(file):
try:
file_object = open(file, 'r')
lines = csv.reader(file_object, delimiter=',', quotechar='"')
flag = 0
data=[]
for line in lines:
if line == []:
flag =1
continue
else:
data.append(line)
file_object.close()
if flag ==1: #if blank line is present in file
file_object = open(file, 'w')
for line in data:
str1 = ','.join(line)
file_object.write(str1+"\n")
file_object.close()
except Exception,e:
print e

Here is a solution using pandas that removes blank rows.
import pandas as pd
df = pd.read_csv('input.csv')
df.dropna(axis=0, how='all',inplace=True)
df.to_csv('output.csv', index=False)

I need to do this but not have a blank row written at the end of the CSV file like this code unfortunately does (which is also what Excel does if you Save-> .csv). My (even simpler) code using the CSV module does this too:
import csv
input = open("M51_csv_proc.csv", 'rb')
output = open("dumpFile.csv", 'wb')
writer = csv.writer(output)
for row in csv.reader(input):
writer.writerow(row)
input.close()
output.close()
M51_csv_proc.csv has exactly 125 rows; the program always outputs 126 rows, the last one being blank.
I've been through all these threads any nothing seems to change this behaviour.

In this script all the CR / CRLF are removed from a CSV file then has lines like this:
"My name";mail#mail.com;"This is a comment.
Thanks!"
Execute the script https://github.com/eoconsulting/lr2excelcsv/blob/master/lr2excelcsv.py
Result (in Excel CSV format):
"My name",mail#mail.com,"This is a comment. Thanks!"

Replace the PATH_TO_YOUR_CSV with your
import pandas as pd
df = pd.read_csv('PATH_TO_YOUR_CSV')
new_df = df.dropna()
df.dropna().to_csv('output.csv', index=False)
or in-line:
import pandas as pd
pd.read_csv('data.csv').dropna().to_csv('output.csv', index=False)

I had the same, problem.
I converted the .csv file to a dataframe and after that I converted the dataframe back to the .csv file.
The initial .csv file with the blank lines was the 'csv_file_logger2.csv' .
So, i do the following process
import csv
import pandas as pd
df=pd.read_csv('csv_file_logger2.csv')
df.to_csv('out2.csv',index = False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Row reading issue in csv containing html format data - python

Related

How to convert csv file into json in python so that the header of csv are keys of every json value

How to overwrite a particular column of a csv file using pandas or normal python?

How to search for a specific column value in csv file, if present , write first two column values to a new csv file with Python?

Reading column names alone in a csv file

Delete blank rows from CSV?

Categories

Resources