How to remove some extra commas on CSV file sometimes there are 3 or more extra commas, I would like the marked part to become a single column
correct format is 11 columns, I just want to find the ones that are not and remove the commas
84,855,648857,8787548,R,mark,one 55, power,0000081,3434,59190000,defen,six,
first 5 and last 5 columns are static, only the middle will become a single column and sometimes there are more than 3 extra columns
now i split the 300 GB file to work with python script in loop so there is a folder contain the files
the result should be like this
84,855,648857,8787548,R,mark one 55 power,0000081,3434,59190000,defen,six,
I suggest reading the csv data into a list, merge them, and write it back:
def merge(data):
result = []
result += data[:5]
temporary = ""
for item in data[5:-5]:
temporary += item + " "
result.append(temporary[:-1])
result += data[-5:]
return result
This function take a list, start(inclusive), end(exclusive), it merge the range specified and returns the result.
For example, calling
merge(["84","855","648857","8787548","R","mark","one 55","power","0000081","3434","59190000","defen","six"])
will merge index 5,6,7, and return:
['84', '855', '648857', '8787548', 'R', 'mark one 55 power', '0000081', '3434', '59190000', 'defen', 'six']
You can then write the list back into a csv.
Related
I have a .tsv file which I have attached along with this post. I have rows(cells) in the format of A1,A2,A3...A12 , B1..B2, .... H1..H12. I need to re-arrange this to a format like A1,B1,C1,D1,...H1 , A2,B2,C2,...H2 ..... A12,B12,C12,...H12.
I need to do this using Python.
I have another .tsv file that allows me to compare it with this file. It is called flipped.tsv . The flipped.tsv file contains the accurate well values corresponding to the cells. In other words, I must map the well values with their accurate cell-lines.
From what I have understood is that the cell line of the meta-data is incorreclty arranged in column-major whereas it has to be arranged in a row-major format like how it is in flipped.tsv file.
For example :
"A2 of flipped_metadata.tsv has the same well values as that of B1 of metadata.tsv."
What is the logic that I can carry out to perform this in Python?
First .tsv file
flipped .tsv file
You could do the following:
import csv
# Read original file
with open("file.tsv", "r") as file:
rows = list(csv.reader(file, delimiter="\t"))
# Key function for sorting
def key_func(row):
""" Transform row in sort key, e.g. ['A7', 1, 2] -> (7, 'A') """
return int(row[0][1:]), row[0][0]
# Write `flipped´ file
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
row[:1] + flipped[1:]
for row, flipped in zip(rows, sorted(rows, key=key_func))
)
The flipping is done by sorting the original rows by
first the integer part of their first row entry int(row[0][1:]), and
then the character part of their first entry row[0][0].
See tio.run illustration here.
If the effect of the sorting isn't obvious, take a look at the result of the same operation, just without the relabelling of the first column:
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
sorted(rows, key=key_func)
)
Output:
A1 26403 23273
B1 27792 8805
C1 5668 19510
...
F12 100 28583
G12 18707 14889
H12 13544 7447
The blocks are build based on the number part first, and within each block the lines run through the sorted characters.
This only works as long as the non-number part has always exactly one character.
If the non-number part has always exactly 2 characters then the return of the key function has to be adjusted to int(row[0][2:]), row[0][:2] etc.
If there's more variability allowed, e.g. between 1 and 5 characters, then a regex approach would be more appropriate:
import re
re_key = re.compile(r"([a-zA-Z]+)(\d+)")
def key_func(row):
""" Transform row in sort key, e.g. ['Aa7', 10, 20] -> (7, 2, 'Aa') """
word, number = re_key.match(row[0]).group(1, 2)
return int(number), len(word), word
Here's a regex demo.
And, depending on how the words have to be sorted, it might be necessary to include the length of the word into the sort key: Python sorts ['B', 'AA', 'A'] naturally into ['A', 'AA', 'B'] and not ['A', 'B', 'AA']. The addition of the length, like in the function, does achieve that.
I am using the following code to scrape content from a webpage with the end goal of writing to a CSV. On the first iteration I had this portion working, but now that my data is formatted differently it writes the data in a way that gets mangled when I try to view it in excel.
If I use the code below the "heading.text" data is correctly put into one cell when viewed in excel. Where as the contents of "child.text" is packed into one cell rather then being split based on the commas. You will see I have attempted to clean up the content of "child.text" in an effort to see if that was my issue.
If I remove "heading.text" from "z" and try again, it writes in a way that has excel showing one letter per cell. In the end I would like each value that is seperated by commas to display in one cell when viewed in excel, I believe I am doing something (many things?) incorrectly in structuring "z" and or when I write the row.
Any guidance would be greatly appreciated. Thank you.
csvwriter = csv.writer(csvfile)
for heading in All_Heading:
driver.execute_script("return arguments[0].scrollIntoView(true);", heading)
print("------------- " + heading.text + " -------------")
ChildElement = heading.find_elements_by_xpath("./../div/div")
for child in ChildElement:
driver.execute_script("return arguments[0].scrollIntoView(true);", child)
#print(heading.text)
#print(child.text)
z = (heading.text, child.text)
print (z)
csvwriter.writerow(z)
When I print "z" I get the following:
('Flower', 'Afghani 3.5g Pre-Pack Details\nGREEN GOLD ORGANICS\nAfghani 3.5g Pre-Pack\nIndica\nTHC: 16.2%\n1/8 oz - \n$45.00')
When I print "z" with the older code that split the string on "\n" I get the following:
('Flower', "Cherry Limeade 3.5g Flower - BeWell Details', 'BE WELL', 'Cherry Limeade 3.5g Flower - BeWell', 'Hybrid', 'THC: 18.7 mg', '1/8 oz - ', '$56.67")
csv.writerow() takes an iterable, each element of which is separated by the writer's delimiter i.e. made a different cell.
First let’s see what’s been happening with you till now:
(heading.text, child.text) has two elements i.e. two cells, heading.text and child.text
(child.text) is simply child.text (would be a tuple if it was (child.text**,**)) and a string's elements are each letter. Hence each letter made its own cell.
To get different cells in a row we need separate elements in our iterable so we want an iterable like [header.text, child.text line 1, child.text line 2, ...]. You were right in splitting the text into lines but the lines weren’t being added to it correctly.
Tuples being immutable I’ll use a list instead:
We know heading.text is to take a single cell so we can write the following to start with
row = [heading.text] # this is what your z is
We want each line to be a separate element so we split child.text:
lines = child.text.split("\n")
# The text doesn’t start or end with a newline so this should suffice
Now we want each element to be added to the row separately, we can make use of the extend() method on lists:
row.extend(lines)
# [1, 2].extend([3, 4, 5]) would result in [1, 2, 3, 4, 5]
To cumulate it:
row = [heading.text]
lines = child.text.split("\n")
row.extend(lines)
or unpacking it in a single line:
row = [heading.text, *child.text.split("\n")] # You can also use a tuple here
I have a file that looks like this
a:1
a:2
a:3
b:1
b:2
b:2
and I would like it to take the the a and b portion of the file and add it as the first column and and the number below, like this.
a b
1 1
2 2
3 3
can this be done?
A CSV (Comma Separated File) should have commas in it, so the output should have commas instead of space-separators.
I recommend writing your code in two parts: The first part should read the input; the second should write out the output.
If your input looks like this:
a:1
a:2
a:3
b:1
b:2
b:2
c:7
you can read in the input like this:
#!/usr/bin/env python3
# Usage: python3 scripy.py < input.txt > output.csv
import sys
# Loop through all the input lines and put the values in
# a list according to their category:
categoryList = {} # key => category, value => list of values
for line in sys.stdin.readlines():
line = line.rstrip('\n')
category, value = line.split(':')
if category not in categoryList:
categoryList[category] = []
categoryList[category].append(value)
# print(categoryList) # debug line
# Debug line prints: {'a': ['1', '2', '3'], 'b': ['1', '2', '2']}
This will read in all your data into a categoryList dict. It's a dict that contains the categories (the letters) as keys, and contains lists (of numbers) as the values. Once you have all the data held in that dict, you can output it.
Outputting involves getting a list of categories (the letters, in your example case) so that they can be written out first as your header:
# Get the list of categories:
categories = sorted(categoryList.keys())
assert categories, 'No categories found!' # sanity check
From here, you can use Python's nice csv module to output the header and then the rest of the lines. When outputting the main data, we can use an outer loop to loop through the nth entries of each category, then we can use an inner loop to loop through every category:
import csv
csvWriter = csv.writer(sys.stdout)
# Output the categories as the CSV header:
csvWriter.writerow(categories)
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
csvWriter.writerow(values)
i += 1 # increment index for the next time through the loop
If you don't want to use Python's csv module, you will still need to figure out how to group the entries in the category together. And if all you have is simple output (where none of the entries contain quotes, newlines, or commas), you can get away with manually writing out the output.
You could use something like this to output your values:
# Output the categories as the CSV header:
print(','.join(categories))
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
print(','.join(values))
i += 1 # increment index for the next time through the loop
This will print out:
a,b,c
1,1,7
2,2,
3,2,
It does this by looping through all the list entries (the outer loop), and then looping through all the categories (the inner loop), and then printing out the values joined together by commas.
If you don't want the commas in your output, then you're technically not looking for CSV (Comma Separated Value) output. Still, in that case, it should be easy to modify the code to get what you want.
But if you have more complicated output (that is, values that have quotes, commas, and newlines in it) you should strongly consider using the csv module to output your data. Otherwise, you'll spend lots of time trying to fix obscure bugs with odd input that the csv module already handles.
my name is Rhein and I have just started to learn Python, and I'm having a lot of fun :D. I just finished a course on YouTube and I am currently working on a project of mine. Currently, I am trying to separate the columns into their own strings from a crime-data csv.
with open('C:/Users/aferdous/python-works/data-set/crime-data/crime_data-windows-1000.csv') as crime_data:
for crime in crime_data:
id = crime_data.readline(8) #<- prints the first x char of each line
print(id)
case_number = crime_data.readline(8) #<- prints the first x char of each line
print(case_number)
date = crime_data.readline(22) #<- prints the first x char of each line
print(date)
block = crime_data.readline(25) #<- prints the first x char of each line
print(block)
This was easy for the first two columns, since they all have the same amount of character lengths. But for 'block', the words in the columns have different lengths, so I do not know how to extract the right amount of characters from each word in each line. And there is a 1000 lines total.
- Thanks
I assumen that your csv format is "value1, value2, value3" if that the case you can user a python function called split. Examples:
...
columns = crime_data.split(",")
print(columns[0]) #print column 1
print(columns[2]) #print column 2
...
But for read csv in python there a lot better options you can search in google a examples I found:
https://gist.github.com/ultrakain/79758ff811f87dd11a8c6c80c28397c4
Reading a CSV file using Python
Assume I have A1 as the only cell in a workbook, and it's blank.
I want my code to add "1" "2" and "3" to it so it says "1 2 3"
As of now I have:
NUMBERS = [1, 2, 3, 4, 5]
ThisSheet.Cells(1,1).Value = NUMBERS
this just writes the first value to the cell. I tried
ThisSheet.Cells(1,1).Value = Numbers[0-2]
but that just puts the LAST value in there. Is there a way for me to just add all of the data in there? This information will always be in String format, and I need to use Win32Com.
update:
I did
stringVar = ', '.join(str(v) for v in LIST)
UPDATE:this .join works perfectly for the NUMBERS list. Now I tried attributing it to another list that looks like this
LIST=[Description Good\nBad, Description Valid\nInvalid]
If I print LIST[0] The outcome is
Description Good
Bad
Which is what I want. But if I use .join on this one, it prints
('Description Good\nBad, Description Valid\nInvalid')
so for this one I need it to print as though I did LIST[0] and LIST[1]
So if you want to put each number in a different cell, you would do something like:
it = 1
for num in NUMBERS:
ThisSheet.Cells(1,it).Value = num
it += 1
Or if you want the first 3 numbers in the same cell:
ThisSheet.Cells(1,it).Value = ' '.join([str(num) for num in NUMBERS[:3]])
Or all of the elements in NUMBERS:
ThisSheet.Cells(1,1).Value = ' '.join([str(num) for num in NUMBERS])
EDIT
Based on your question edit, for string types containing \n and assuming every time you find a newline character, you want to jump to the next row:
# Split the LIST[0] by the \n character
splitted_lst0 = LIST[0].split('\n')
# Iterate through the LIST[0] splitted by newlines
it = 1
for line in splitted_lst0:
ThisSheet.Cells(1,it).Value = line
it += 1
If you want to do this for the whole LIST and not only for LIST[0], first merge it with the join method and split it just after it:
joined_list = (''.join(LIST)).split('\n')
And then, iterate through it the same way as we did before.