Combining two scripts into one code for csv file data verification - python

Hello everyone currently I have two scripts that I would like to combine into 1 code. The first script finds missing time stamps from a set of data and fills in a blank row with NaN values then saves to an output file. The second script compares different rows in a set of data and creates a new column with True/False values based on the test condition.
If I run each script as a function then call both with another function I would get two separate output files. How can I make this run with only 1 saved output file?
First Code
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
Second Code
with open('data5.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')

Let put your code as function of filenames:
def first_code(file_in, file_out):
df = pd.read_csv(file_in, ... )
...
df.to_csv(file_out, ...)
def second_code(file_in, file_out):
with open(file_in, 'r') as f:
...
....
with open(file_out, 'w') as f:
...
Your solution can then be:
first_code('data5.csv', 'output.csv')
second_code('output.csv', 'output.csv')
Hope it helps
Note that there is not problem reading and writing in the same file. Be sure that the file is previously closed to avoid side effect. This is implicitly done by using with, which is a good practice

In the second code, change data5.csv which is the first input to the second code to output.csv. and make sure that the file1.py and file2.py are in the same directory. so your modified code in a single file will be as follows:
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
with open('output.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')

Related

Python output every nth row without Pandas

I'm really new to Python and my task is to rewrite a CSV with Python. I managed to program a working script for my task already. Now I would like to get only every 10th row of the CSV as output.
Is there an easy way to do this?
I already tried to use Jason Reeks answer.
Now it works, thank you!
import csv
import sys
userInputFileName = sys.argv[1]
outPutFileSkipped = userInputFileName.split('.')[0] + '-Skipped.csv'
cnt = 0
first = True
with open(outPutFileSkipped, 'w', newline='') as outputCSV:
csv_reader_object_skipped = csv.reader((x.replace('\0', '') for x in open(userInputFileName)), delimiter=',')
csv_writer_object_skipped = csv.writer(outputCSV, delimiter=',')
for row, line in enumerate(csv_reader_object_skipped):
if row % 10 == 0:
print(line)
csv_writer_object_skipped.writerow(line)
print('Es wurden erfolgreich ' + str(cnt) + ' Zeilen formatiert!')
Here's a native way to do it without pandas:
import csv
with open('file.csv', 'r') as f:
reader = csv.reader(f)
for row, line in enumerate(reader):
# Depending on your reference point you may want to + 1 to row
# to get every 10th row.
if row % 10 == 0:
print(line)
There's an easy way with Pandas:
import pandas as pd
df = pd.DataFrame({"a": range(100), "b": range(100, 200)})
df.loc[::10]

How to get the values occurring only once in first column of a csv file using python

I am new in python so I'm trying to read a csv with 700 lines included a header, and get a list with the unique values of the first csv column.
Sample CSV:
SKU;PRICE;SUPPLIER
X100;100;ABC
X100;120;ADD
X101;110;ABV
X102;100;ABC
X102;105;ABV
X100;119;ABG
I used the example here
How to create a list in Python with the unique values of a CSV file?
so I did the following:
import csv
mainlist=[]
with open('final_csv.csv', 'r', encoding='utf-8') as csvf:
rows = csv.reader(csvf, delimiter=";")
for row in rows:
if row[0] not in rows:
mainlist.append(row[0])
print(mainlist)
I noticed that in debugging, rows is 1 line not 700 and I get only the
['SKU'] field what I did wrong?
thank you
A solution using pandas. You'll need to call the unique method on the correct column, this will return a pandas series with the unique values in that column, then convert it to a list using the tolist method.
An example on the SKU column below.
import pandas as pd
df = pd.read_csv('final_csv.csv', sep=";")
sku_unique = df['SKU'].unique().tolist()
If you don't know / care for the column name you can use iloc on the correct number of column. Note that the count index starts at 0:
df.iloc[:,0].unique().tolist()
If the question is intending get only the values occurring once then you can use the value_counts method. This will create a series with the index as the values of SKU with the counts as values, you must then convert the index of the series to a list in a similar manner. Using the first example:
import pandas as pd
df = pd.read_csv('final_csv.csv', sep=";")
sku_counts = df['SKU'].value_counts()
sku_single_counts = sku_counts[sku_counts == 1].index.tolist()
If you want the unique values of the first column, you could modify your code to use a set instead of a list. Maybe like this:
import collections
import csv
filename = 'final_csv.csv'
sku_list = []
with open(filename, 'r', encoding='utf-8') as f:
csv_reader = csv.reader(f, delimiter=";")
for i, row in enumerate(csv_reader):
if i == 0:
# skip the header
continue
try:
sku = row[0]
sku_list.append(sku)
except IndexError:
pass
print('All SKUs:')
print(sku_list)
sku_set = set(sku_list)
print('SKUs after removing duplicates:')
print(sku_set)
c = collections.Counter(sku_list)
sku_list_2 = [k for k, v in c.items() if v == 1]
print('SKUs that appear only once:')
print(sku_list_2)
with open('output.csv', 'w') as f:
for sku in sorted(sku_set):
f.write('{}\n'.format(sku))
A solution using neither pandas nor csv :
lines = open('file.csv', 'r').read().splitlines()[1:]
col0 = [v.split(';')[0] for v in lines]
uniques = filter(lambda x: col0.count(x) == 1, col0)
or, using map (but less readable) :
col0 = list(map(lambda line: line.split(';')[0], open('file.csv', 'r').read().splitlines()[1:]))
uniques = filter(lambda x: col0.count(x) == 1, col0)

Edit a piece of data inside a csv

I have a csv file looking like this
34512340,1
12395675,30
56756777,30
90673412,45
12568673,25
22593672,25
I want to be able to edit the data after the comma from python and then save the csv.
Does anybody know how I would be able to do this?
This bit of code below will write a new line, but not edit:
f = open("stockcontrol","a")
f.write(code)
Here is a sample, which adds 1 to the second column:
import csv
with open('data.csv') as infile, open('output.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
# Transform the second column, which is row[1]
row[1] = int(row[1]) + 1
writer.writerow(row)
Notes
The csv module correctly parses the CSV file--highly recommended
By default, each row will be parsed as text, what is why I converted into integer: int(row[1])
Update
If you really want to edit the file "in place", then use the fileinput module:
import fileinput
for line in fileinput.input('data.csv', inplace=True):
fields = line.strip().split(',')
fields[1] = str(int(fields[1]) + 1) # "Update" second column
line = ','.join(fields)
print line # Write the line back to the file, in place
You can use python pandas to edit the column you want for e.g increase the column number by n:
import pandas
data_df = pandas.read_csv('input.csv')
data_df = data_df['column2'].apply(lambda x: x+n)
print data_df
for adding 1 replace n by 1.

Replacing data to different file

First of all, we have two files:
file01.txt
101|10075.0|12|24/12/2015
102|1083.33|12|24/12/2015
The second file has only one line!
file02.txt
101|False|Section06
The first parameter is th same in both files (unique).
I must replace data file01 by some from file02. Match criterion is the first parameter (code).
I have one input (request for code) and readlines for both file what next I need to do Also I'm working with lists.
Expected result:
input = 101
The output should be:
101|False|Section06
102|1083.33|12|24/12/2015
You could use csv.reader() to read the file, and put them in a dict, then replace the keys like this:
import csv
with open('file1') as f:
d = {i[0]: i[1:] for i in csv.reader(f, delimiter='|')}
with open('file2') as f:
d.update({i[0]: i[1:] for i in csv.reader(f, delimiter='|')})
And d looks like:
{'101': ['False', 'Section06'], '102': ['1083.33', '12', '24/12/2015']}
To get the excepted output:
>>> ['|'.join([i[0]]+i[1]) for i in d.items()]
['101|False|Section06', '102|1083.33|12|24/12/2015']
And if you want write them into a file:
with open('file1', 'w') as f:
for i in d.items():
f.write('|'.join([i[0]]+i[1]))
Solution
This works for the given example:
with open('file01.txt') as fobj1, open('file02.txt') as fobj2:
data1 = fobj1.readlines()
data2 = fobj2.readline()
code = data2.split('|', 1)[0]
with open('file01.txt', 'w') as fobj_out:
for line in data1:
if line.split('|', 1)[0] == code:
fobj_out.write(data2 + '\n')
else:
fobj_out.write(line)
Step by step
We open both files for reading:
with open('file01.txt') as fobj1, open('file02.txt') as fobj2:
data1 = fobj1.readlines()
data2 = fobj2.readline()
The read data looks like this:
>> data1
['101|10075.0|12|24/12/2015\n', '102|1083.33|12|24/12/2015']
>> data2
'101|False|Section06'
We only need the code from file02.txt:
>>> code = data2.split('|', 1)[0]
code
'101'
The data2.split('|', 1) splits at |. Since we need only one split, we can limit it with 1.
Now we open file01.txt again. This time for writing:
with open('file01.txt', 'w') as fobj_out:
for line in data1:
if line.split('|', 1)[0] == code:
fobj_out.write(data2 + '\n')
else:
fobj_out.write(line)
This line if line.split('|', 1)[0] == code: does the same split as above but for all lines of file01.txt. If the code is equal to the one from file02.txt, we use the line from file02.txt, otherwise we just write the line form file01.txt back.
You can simply concatenate the two sets of data into a single pandas.DataFrame(), as follows:
import pandas as pd
df1 = pd.DataFrame([[10075.0, 12,'24/12/2015'], [1083.33, 12, '24/12/2015']], index=[101,102], columns=['prc', 'code', 'date'])
'''
101|10075.0|12|24/12/2015
102|1083.33|12|24/12/2015
'''
df2 = pd.DataFrame([[False, 'Section06'], [True, 'Section07']], index=[101,102], columns=['Bool', 'Section'])
'''
101|False|Section06
102|True|Section07
'''
pd.concat([df1,df2], axis=1, join='outer')
Which gives:
prc code date Bool Section
101 10075.00 12 24/12/2015 False Section06
102 1083.33 12 24/12/2015 True Section07
Now you can get rid of the columns you don't need (eg using pandas.Drop())

Writing a pandas DataFrame into a csv file with some empty rows

I create a one-column pandas DataFrame that contains only strings. One row is empty. When I write the file on disk, the empty row gets an empty quote "" while I want no quote at all. Here's how to replicate the issue:
import pandas as pd
df = "Name=Test\n\n[Actual Values]\nLength=12\n"
df = pd.DataFrame(df.split("\n"))
df.to_csv("C:/Users/Max/Desktop/Test.txt", header=False, index=False)
The output file should be like this:
Name=Test
[Actual Values]
Length=12
But instead is like this:
Name=Test
[Actual Values]
""
Length=12
Is there a way to instruct pandas not to write the quotes and leaves an empty row in the output text file? Thank you, a lot.
There is a parameter for DataFrame.to_csv called na_rep. If you have None values, it will replace them with whatever you pass into this field.
import pandas as pd
df = "Name=Test\n"
df += "\n[Actual Values]\n"
df += "Length=12\n"
df = pd.DataFrame(df.split("\n"))
df[df[0]==""] = None
df.to_csv("pandas_test.txt", header=False, index=False, na_rep=" ")
Unfortunately, it looks like passing in na_rep="" will print quotes into the csv. However, if you pass in a single space (na_rep=" ") it looks better aesthetically...
Of course you could always write your own function to output a csv, or simply replace the "" in the output file using:
f = open(filename, 'r')
text = f.read()
f.close()
text = text.replace("\"\"","")
f = open(filename, 'w')
f.write(text)
f.close()
And here's how you could write your own to_csv() method:
def to_csv(df, filename, separator):
f = open(filename, 'w')
for col in df.values:
for row in col:
f.write(row + separator)
f.close()

Categories