I'm so confused with a csv data treatment, any help will be nice.
I've a csv file with multiple columns like this:
col_1;col_2;col_3;col_4;col_5;col_6;col_7;col_8;
Object1;123;Something;456;Something2;0;0;someword;789
Object2;123;Something;456;Something2;0;0;someword;789
Object3;123;Something;456;Something2;0;0;someword;789
Object4;123;Something;456;Something2;0;0;someword;789
But some Objects have a missing data on col_6, col_7 and col_8, instead of it there's a Keyword in col_6 like this:
col_1;col_2;col_3;col_4;col_5;col_6;col_7;col_8;
Object1;123;Something;456;Something2;0;0;someword;789
Object2;123;Something;456;Something2;0;0;someword;789
Object3;123;Something;456;Something2;Keyword;789;Object4;123
Something;456;Something2;0;0
I've detected how many lines got those Keywords and the number of the row:
import csv
class FixIt:
def test(self):
count = 0
with open('input.csv',mode='r') as file
read = csv.reader(file)
for num,row in enumerate(reader):
count+=1
if 'Keyword' in row[0]:
print num, row
count+=1
print(count)
TryIt = FixIt()
TryIt.test()
I need to put x2 zeros or somestring values on the cells before the keyword to re-order the output to the original structure like:
col_1;col_2;col_3;col_4;col_5;col_6;col_7;col_8;col_9
Object1;123;Something;456;Something2;0;0;someword;789
Object2;123;Something;456;Something2;0;0;someword;789
Object3;123;Something;456;Something2;corrective_data;corrective_data;Keyword;789
Object4;123;Something;456;Something2;0;0;someword;789
Maybe with pandas can be done but i don't know where or how to start, some orientation or answer will be kindly appreciated.
Try 1:
I've tryed to replace the string Keyword on each line by 0;0;Keyword with:
with open("input.csv", "r") as file_input:
with open("output.csv", "w") as file_output:
for line in file_input:
file_output.write(line.replace('Keyword','0;0;Keyword'))
But the result is wrong, it adds a ";" inside every cell and puts the string ";"0;0;Keyword also. After seeing the file with vim i saw the fact that i'll need also to add a new row after the 789 (because i see a " " as breakline).
I'm so lost right now, maybe creating 1 object and a list of properties for every row will be better (?).
Not sure if this is what you want, because your data in second code cell is not correctly formatted. I assume you want to make following changes:
col_1;col_2;col_3;col_4;col_5;col_6;col_7;col_8;col_9
Object3;123;Something;456;Something2;0;0;someword;789
Object4;123;Something;456;Something2;Keyword;231
# TO #
col_1;col_2;col_3;col_4;col_5;col_6;col_7;col_8;col_9
Object3;123;Something;456;Something2;0;0;someword;789
Object3;123;Something;456;Something2;0;0;Keyword;231
So here is how you can make the changes with pandas:
import pandas as pd
# input data from csv file
data = pd.read_csv("input.csv", delimiter=';')
# get the indices of rows with "Keyword" appearing in col_6
idxs = data.loc[data['col_6'] == "Keyword"].index
# copy value in col_6 to col_8
data.set_value(idxs, 'col_8', data.iloc[idxs]['col_6'])
# copy value in col_7 to col_9
data.set_value(idxs, 'col_9', data.iloc[idxs]['col_7'])
data.set_value(idxs, 'col_6', 0) # fill col_6 with 0
data.set_value(idxs, 'col_7', 0) # fill col_7 with 0
# write result to a new file
data.to_csv("result.csv", sep=';')
Related
Hy guys my teacher has assing me to get the integer from a row string in one column. This all thing is going to be by read a csv file with the help from python.So my terminal dosen't hit but i dont get nothing as a guide problem, i want from every row to take the integer and print them.
Here is my code :
import pandas as pd
tx = [ "T4.csv" ]
for name_csv in tx :
df = pd.read_csv( name_csv, names=["A"])
for row in df:
if row == ('NSIT ,A: ,'):
# i dont know how to use the split for to take the integer and print them !!!!
print("A",row)
else
# i dont know how to use the split for to take the integer and print them !!!!
print("B",row)
Also here is and what it have the the csv file :(i have the just them all in the column A)
NSIT ,A: ,-213
NSIT ,A: ,-43652
NSIT ,B: ,-39
NSIT ,A: ,-2
NSIT ,B: ,-46
At the end i have put my try on python, i hope you guys to understand the problem i have.
df = pd.read_csv( "T4.csv", names=["c1", "c2", "c3"])
print(df.c3)
Read the file one line at a time. Split each line on comma. Print the last item in the resulting list.
with open('T4.csv') as data:
for line in data:
len(tokens := line.split(',')) == 3:
print(tokens[2])
Alternative:
with open('T4.csv') as data:
d = {}
for line in data:
if len(tokens := line.split(',')) == 3:
_, b, c = map(str.strip, tokens)
d.setdefault(b, []).append(c)
for k, v in d.items():
print(k, end='')
print(*v, sep=',', end='')
print(f' sum={sum(map(int, v))}')
Output:
A:-213,-43652,-2 sum=-43867
B:-39,-46 sum=-85
Your question was not very clear. So I assume you want to print out the 3rd column of the CSV file. I also think that you opened the CSV file in Excel, which is why you see that all the data is put in Column A.
A CSV (comma-separated values) file is a plain text file that contains data organised as a table of rows and columns, where each row represents a record, and each column represents a field or attribute of the form.
A newline character typically separates each row of data in a CSV file, and the values in each column are separated by a delimiter character, such as a comma (,). For example, here is a simple CSV file with three rows and three columns:
S.No, Student Name, Student Roll No.
1, Alpha, 123
2, Beta, 456
3, Gamma, 789
For a simple application like what you mention, Pandas might not be required. You can use the standard csvreader library of Python to do this.
Please find the code below to print out the 3rd column of your CSV file.
import csv
with open("T4.csv") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
headers = next(csv_reader) # Get the column headers
print(headers[2]) # Print the 3rd column header
for row in csv_reader:
print(row[2]) # Print the 3rd column data
I noticed pandas is smart when using read_excel / read_csv, it skips the empty rows so if my input has a blank row like
Col1, Col2
Value1, Value2
It just works, but is there a way to get the actual # of skipped rows? (In this case 1)
I want to tie the dataframe row numbers back to the raw input file's row numbers.
You could use the skip_blank_lines=False and import the entire file including the empty lines. Then you can detect them, count them and filter them out:
def custom_read(f_name, **kwargs):
df = pd.read_csv(f_name, skip_blank_lines=False, **kwargs)
non_empty = df.notnull().all(axis=1)
print('Skipped {} blank lines'.format(sum(~non_empty)))
return df.loc[non_empty, :]
You can also use csv.reader to import your file row-by-row and only allow non-empty rows:
import csv
def custom_read2(f_name):
with open(f_name) as f:
cont = []
empty_counts = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row) > 0:
cont.append(row)
else:
empty_counts += 1
print('Skipped {} blank lines'.format(empty_counts))
return pd.DataFrame(cont)
As far as I can tell, at most one blank line at a time will occupy your memory. This may be useful if you happened to have large files with many blank lines, but I am pretty sure option 1 will always be the better option in practice
So I'm trying to clean up a .csv that our badging system exports. One of the issues with this export is that it doesn't separate the badging info (badge ID, activation state, company, etc.) into separate columns.
Here's what I need to do:
Create new .csv with only some of the columns
Rename top row
Clean up the CREDENTIALS column so it only outputs the activated badge number
Problem: I already did steps 1 and 2, however I need help going through the CREDENTIALS [3] column, to find the "Active" keyword and delete everything except for the first set of numbers. However, some credentials will have multiple badges separated by a |.
For instance, here is how the original .csv will looks like:
COMMAND,PERSONID,PARTITION,CREDENTIALS,EMAIL,FIRSTNAME,LASTNAME
NoCommand,43,Master,{9065~9065~Company~Active~~~},personone#company.com,person,one
NoCommand,57,Master,{9482~9482~Company~Active~~~},persontwo#company.com,person,two
NoCommand,323,Master,{8045~8045~Company~Disabled~~~},personthree#company.com,person,three
NoCommand,84,Master,{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},personfour#company.com,person,four
NoCommand,46,Master,{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},personfive#company.com,person,five
As you can see, the CREDENTIALS column [3] has a bunch of data included. It will also have multiple badge credentials separated by a |.
Here's what I have so far to complete steps 1 and 2:
import csv
# Empty data set that will eventually be written with the new sanitized data
data = []
# Keyword to search for
word = 'Active'
# Source .csv file that we will be working with
input_filename = '/path/to/original/csv'
# Output .csv file that we will create with the data from input_filename
output_filename = '/path/to/new/csv'
with open(input_filename, "rb") as the_file:
reader = csv.reader(the_file, delimiter=",")
next(reader, None)
# Test sanitizing column 3
for row in reader:
for col in row[3]:
if word in row[3]:
print col
new_row = [row[3], row[5], row[6], row[4]]
data.append(new_row)
with open(output_filename, "w+") as to_file:
writer = csv.writer(to_file, delimiter=",")
writer.writerow(['BadgeID', 'FirstName', 'LastName', 'EmployeeEmail'])
for new_row in data:
writer.writerow(new_row)
So far the new .csv is looking like this:
BadgeID,FirstName,LastName,EmployeeEmail
{9065~9065~Company~Active~~~},person,one,personone#company.com
{9482~9482~Company~Active~~~},person,two,persontwo#company.com
{8045~8045~Company~Disabled~~~},person,three,personthree#company.com
{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},person,four,personfour#company.com
{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},person,five,personfive#company.com
I want it to look like this, with the "Active" credentials:
BadgeID,FirstName,LastName,EmployeeEmail
9066,person,one,personone#company.com
9482,person,two,persontwo#company.com
8045,person,three,personthree#company.com
8283,person,four,personfour#company.com
9693,person,five,personfive#company.com
However, for my column 3 testing code block, I'm trying to at least make sure I'm grabbing the correct data. The weird thing is that when I print that column it comes out looking weird:
# Test sanitizing column 3
for row in reader:
for col in row[3]:
if word in row[3]:
print col
It outputs something like this:
C
a
r
d
s
~
A
c
t
i
v
e
~
~
~
}
{
8
8
2
4
~
8
8
2
4
~
Anyone have any thoughts?
Going by your output, you're grabbing the correct data! The problem is: Column 3 is a string. You're treating it like a list from the outset, resulting in characters being pulled from words. Use string methods to get lists of words first.
Step by step with pseudo-code:
Strip those brackets
column3 = column3.strip("{}")
Since you might have multiple badges separated by "|", you should
badges_str = column3.split("|")
Now you have a list of strings, each representing a single badge.
badges = []
for badge in badges_str:
badges.append(badge.split("~"))
Now you have a list of individual badge listings that you can use indexes on.
for badge in badges:
# test for the Active badges, then do things
if badge[3] == "Active":
do_something(badge[0])
do_something_else(badge[1])
etc...
That doesn't give you actual code, but should get you to the next steps to get there.
I am looking to take a CSV file and sort the file using python 2.7 to get an individual value based on two columns for a block and lot. My data looks like now in the link below:
Beginning
I want to be able on the lot value to create extra lines using Python to automate this into a new CSV where the values will look like this when drawn out on the new CSV
End Result
So I know that I need read the row and the column and based on the cell value for the lot column if there is a "," then the row will be copied to the next row in the other csv and all the values before the first column will be copied only and then the second, third etc.
After the Commas are separated out, then the ranges will be managed in a similar way in a third CSV. If there is a single value, the whole row will be copied as is.
Thank you for the help in advanced.
This should work.
On Windows open files in binary mode or else you get double new lines.
I assumed rows are separated by ; because cells contains ,
First split by ,, then check for ranges
print line is for debugging
Error checking is left as an exercise for the reader.
Code:
import csv
file_in = csv.reader(open('input.csv', 'rb'), delimiter=';')
file_out = csv.writer(open('output.csv', 'wb'), delimiter=';')
for i, line in enumerate(file_in):
if i == 0:
# write header
file_out.writerow(line)
print line
continue
for j in line[1].split(','):
if len(j.split('-')) > 1:
# lines with -
start = int(j.split('-')[0])
end = int(j.split('-')[1])
for k in xrange(start, end + 1):
line[1] = k
file_out.writerow(line)
print line
else:
# lines with ,
line[1] = j
file_out.writerow(line)
print line
I need to filter and do some math on data coming from CSV files.
I've wrote a simple Pyhton script to isolate the rows I need to get (they should contain certain keywords like "Kite"), but my script does not work and I can't find why. Can you tell me what is wrong with it? Another thing: once I get to the chosen row/s, how can I point to each (comma separated) column?
Thanks in advance.
R.
import csv
with open('sales-2013.csv', 'rb') as csvfile:
sales = csv.reader(csvfile)
for row in sales:
if row == "Kite":
print ",".join(row)
You are reading the file in bytes. Change the open('filepathAndName.csv, 'r') command or convert your strings like "Kite".encode('UTF-8'). The second mistake could be that you are looking for a line with the word "Kite", but if "Kite" is a substring of that line it will not be found. In this case you have to use if "Kite" in row:.
with open('sales-2013.csv', 'rb') as csvfile: # <- change 'rb' to 'r'
sales = csv.reader(csvfile)
for row in sales:
if row == "Kite": # <- this would be better: if "Kite" in row:
print ",".join(row)
Read this:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
To find the rows than contain the word "Kite", then you should use
for row in sales: # here you iterate over every row (a *list* of cells)
if "Kite" in row:
# do stuff
Now that you know how to find the required rows, you can access the desired cells by indexing the rows. For example, if you want to select the second cell of a row, you simply do
cell = row[1] # remember, indexes start with 0