What is the best way to take this string:
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
and transform to a CSV containing 6 columns?
Desired output
Is a CSV which will be imported into Pandas:
1,a,1,a,1,a
2,b,2,b,2,b
etc..
Updated desired output as per comments to 6 rows.
Updated. I can get the first row like this if I assign the string to l variable:
l.split()[0::4]
['1', 'a', '1', 'a', '1', 'a']
with open('data.txt', 'r') as f:
data = f.read().split("\n")
for i in range(4):
d = list()
for j in range(i, len(data), 4):
d.append(data[j])
with open('data.csv', 'a') as csv:
csv.write(','.join(d)+"\n")
Even though Art's answer is accepted, here is another way using pandas. You wouldn't need to export the data prior to importing with pandas if you use something like this.
import pandas as pd
myFile="lines_to_read2.txt"
myData = pd.DataFrame (columns=['col1', 'col2', 'col3','col4'])
mycolumns = 4
thisItem = list()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
thisItem.append(thisLine.strip('\n, " "'))
if len(thisItem) == mycolumns:
myData = myData.append({'col1':thisItem[0],'col2':thisItem[1],'col3':thisItem[2],'col4':thisItem[3]}, ignore_index=True)
thisItem = list()
myData.to_csv('lines_as_csv_file.csv', index=False)
print(myData) # Full Table
Related
I have the following input file in csv
A,B,C,D
1,2,|3|4|5|6|7|8,9
11,12,|13|14|15|16|17|18,19
How do I split column C right in the middle into two new rows with additional column E where the first half of the split get "0" in Column E and the second half get "1" in Column E?
A,B,C,D,E
1,2,|3|4|5,9,0
1,2,|6|7|8,9,1
11,12,|13|14|15,19,0
11,12,|16|17|18,19,1
Thank you
Here's how to do it without Pandas:
import csv
with open("input.csv", newline="") as f_in, open("output.csv", "w", newline="") as f_out:
reader = csv.reader(f_in)
header = next(reader) # read header
header += ["E"] # modify header
writer = csv.writer(f_out)
writer.writerow(header)
for row in reader:
a, b, c, d = row # assign 4 items for each row
c_items = [x.strip() for x in c.split("|") if x.strip()]
n_2 = len(c_items) // 2 # halfway index
c1 = "|" + "|".join(c_items[:n_2])
c2 = "|" + "|".join(c_items[n_2:])
writer.writerow([a, b, c1, d, 0]) # 0 & 1 will be converted to str on write
writer.writerow([a, b, c2, d, 1])
If I understand you correctly, you can use str.split on column 'C', then .explode() the column and join it again:
df["C"] = df["C"].apply(
lambda x: [
(vals := x.strip(" |").split("|"))[: len(vals) // 2],
vals[len(vals) // 2 :],
]
)
df["E"] = df["C"].apply(lambda x: range(len(x)))
df = df.explode(["C", "E"])
df["C"] = "|" + df["C"].apply("|".join)
print(df.to_csv(index=False))
Prints:
A,B,C,D,E
1,2,|3|4|5,9,0
1,2,|6|7|8,9,1
11,12,|13|14|15,19,0
11,12,|16|17|18,19,1
Using a regex and str.findall to break the string, then explode and Groupby.cumcount:
(df.assign(C=df['C'].str.findall('(?:\|[^|]*){3}'))
.explode('C')
.assign(E=lambda d: d.groupby(level=0).cumcount())
#.to_csv('out.csv', index=False)
)
Output (before CSV export):
A B C D E
0 1 2 |3|4|5 9 0
0 1 2 |6|7|8 9 1
1 11 12 |13|14|15 19 0
1 11 12 |16|17|18 19 1
Output CSV:
A,B,C,D,E
1,2,|3|4|5,9,0
1,2,|6|7|8,9,1
11,12,|13|14|15,19,0
11,12,|16|17|18,19,1
Another way
df=(df.assign(C=df['C'].str.replace('^\|','',regex=True)#remove leading | to allow split by the character
.str.split('\|')#Split to create list
.apply(lambda x:np.array_split(x, 2)))#split list into lists of sublists
.explode('C')#explode into rows
)
df = df.assign(C= "|" + df["C"].apply("|".join)#clean c
,E=df.groupby('A').cumcount('B'))
I am trying to import data from a file and then add it to an array. I know that this is not the best way to add elements to a numpy array. Nevertheless, why is the data not appending? The last element of the csv is 1.1 and thats what i get when i do print(dd)
with open('C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv', 'r') as data_file:
data = csv.reader(data_file, delimiter=',')
for i in data:
t = []
d = []
dd = []
t.append([float(i[0])])
d.append([float(i[1])])
dd.append([float(i[2])])
t = np.array(t)
d = np.array(d)
dd = np.array(dd)
print (dd)
The root of your problem lies in the fact that every iteration of your loop you are re-assigning t, d and dd to empty lists []. If your end-all goal is to acquire numpy arrays for these variables, I would recommend using pd.read_csv() to convert your csv file to a dataframe. Take this sample csv:
t,d,dd
1,2,3
4,5,6
7,8,9
Using pd.read_csv():
df = pd.read_csv(r'C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv')
Gives:
t d dd
0 1 2 3
1 4 5 6
2 7 8 9
Then you can query your columns to return them as pd.Series():
t = df['t']
d = df['d']
dd = df['dd']
Or you can convert them to np.array():
t = np.array(df['t'])
d = np.array(df['d'])
dd = np.array(df['dd'])
I would like to create a list for every column in a txt file.
The file looks like this:
NAME S1 S2 S3 S4
A 1 4 3 1
B 2 1 2 6
C 2 1 3 5
PROBLEM 1 . How do I dynamically make the number of lists that fit the number of columns, such that I can fill them? In some files I will have 4 columns, others I will have 6 or 8...
PROBLEM 2. What is a pythonic way to iterate through each column and make a list of the values like this:
list_s1 = [1,2,2]
list_s2 = [4,1,1]
etc.
Right now I have read in the txt file and I have each individual line. As input I give the number of NAMES in a file (here HOW_MANY_SAMPLES = 4)
def parse_textFile(file):
list_names = []
with open(file) as f:
header = f.next()
head_list = header.rstrip("\r\n").split("\t")
for i in f:
e = i.rstrip("\r\n").split("\t")
list_names.append(e)
for i in range(1, HOW_MANY_SAMPLES):
l+i = []
l+i.append([a[i] for a in list_names])
I need a dynamic way of creating and filling the number of lists that correspond to the amount of columns in my table.
By using pandas you can create a list of list or a dic to get what you are looking for.
Create a dataframe from your file, then iterate through each column and add it to a list or dic.
from StringIO import StringIO
import pandas as pd
TESTDATA = StringIO("""NAME S1 S2 S3 S4
A 1 4 3 1
B 2 1 2 6
C 2 1 3 5""")
columns = []
c_dic = {}
df = pd.read_csv(TESTDATA, sep=" ", engine='python')
for column in df:
columns.append(df[column].tolist())
c_dic[column] = df[column].tolist()
Then you will have a list of list for all the columns
for x in columns:
print x
Returns
['A', 'B', 'C']
[1, 2, 2]
[4, 1, 1]
[3, 2, 3]
[1, 6, 5]
and
for k,v in c_dic.iteritems():
print k,v
returns
S3 [3, 2, 3]
S2 [4, 1, 1]
NAME ['A', 'B', 'C']
S1 [1, 2, 2]
S4 [1, 6, 5]
if you need to keep track of columns name and data
Problem 1:
You can use len(head_list) instead of having to specify HOW_MANY_SAMPLES.
You can also try using Python's CSV module and setting the delimiter to a space or a tab instead of a comma.
See this answer to a similar StackOverflow question.
Problem 2:
Once you have a list representing each row, you can use zip to create lists representing each column:
See this answer.
With the CSV module, you can follow this suggestion, which is another way to invert the data from row-based lists to column-based lists.
Sample:
import csv
# open the file in universal line ending mode
with open('data.txt', 'rU') as infile:
# register a dialect that skips extra whitespace
csv.register_dialect('ignorespaces', delimiter=' ', skipinitialspace=True)
# read the file as a dictionary for each row ({header : value})
reader = csv.DictReader(infile, dialect='ignorespaces')
data = {}
for row in reader:
for header, value in row.items():
try:
if (header):
data[header].append(value)
except KeyError:
data[header] = [value]
for column in data.keys():
print (column + ": " + str(data[column]))
this yields:
S2: ['4', '1', '1']
S1: ['1', '2', '2']
S3: ['3', '2', '3']
S4: ['1', '6', '5']
NAME: ['A', 'B', 'C']
Is there an efficient way to store each column of a tab-delimited file in a separate dictionary using python?
A sample input file: (Real input file contains thousands of lines and hundreds of columns. Number of columns is not fixed, it changes frequently.)
A B C
1 4 7
2 5 8
3 6 9
I need to print values in column A:
for cell in mydict["A"]:
print cell
and to print values in the same row:
for i in range(1, numrows):
for key in keysOfMydict:
print mydict[key][i]
The simplest way is to use DictReader from the csv module:
with open('somefile.txt', 'r') as f:
reader = csv.DictReader(f, delimiter='\t')
rows = list(reader) # If your file is not large, you can
# consume it entirely
# If your file is large, you might want to
# step over each row:
#for row in reader:
# print(row['A'])
for row in rows:
print(row['A'])
#Marius made a good point - that you might be looking to collect all columns separately by their header.
If that's the case, you'll have to adjust your reading logic a bit:
from collections import defaultdict
by_column = defaultdict(list)
for row in rows:
for k,v in row.iteritems():
by_column[k].append(v)
Another option is pandas:
>>> import pandas as pd
>>> i = pd.read_csv('foo.csv', sep=' ')
>>> i
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> i['A']
0 1
1 2
2 3
Name: A, dtype: int64
Not sure this is relevant, but you can do this using rpy2.
from rpy2 import robjects
dframe = robjects.DataFrame.from_csvfile('/your/csv/file.csv', sep=' ')
d = dict([(k, list(v)) for k, v in dframe.items()])
output:
{'A': [1, 2, 3], 'C': [7, 8, 9], 'B': [4, 5, 6]}
I'm new in Python and I have a problem of removing unwanted rows in a csv file. For instance I have 3 columns and a lot of rows:
A B C
hi 1.0 5
hello 2.0 6
ok 3.0 7
I loaded the data using numpy (instead of csv)
import numpy as np
a= np.loadtxt('data.csv' , delimiter= ',' , skiprows= 1)
I want to introduce a range for the 2nd column
b=np.arange(0, 2.1,0.1)
I don't have any idea how I should use that piece of code.
What I want as a final output is the following:
A B C
hi 1.0 5
hello 2.0 6
The last row would be remove since I chose a range for the 2nd column up to 2.0 only. I don't have any idea how can I accomplish this.
Try with Pandas:
import pandas as pd
a = pd.read_csv('data.csv', index_col=0) # column A will be the index.
a
B C
A
hi 1 5
hello 2 6
ok 3 7
For every value of B up to 2 :
a[a.B <= 2]
B C
A
hi 1 5
hello 2 6
Details :
a.B
A
hi 1
hello 2
ok 3
Name: B, dtype: float64
a.B <= 2
A
hi True
hello True
ok False
Name: B, dtype: bool
You can do it using logical indexing
index = (x[:, 1] <= 2.0)
Then
x = x[index]
selecting only the lines that satisfy this condition
You can just use the csv module. N.B the following expects that the CSV fields are comma separated, not tab separated (as your sample suggests).
import csv
with open('data.csv') as data:
reader = csv.reader(data) # or csv.reader(data, delimiter='\t') for tabs
field_names = next(reader)
filtered_rows = [row for row in reader if 0 <= float(row[1]) <= 2.0]
>>> field_names
['A', 'B', 'C']
>>> filtered_rows
[['hi', '1.0', '5'], ['hello', '2.0', '6']]
>>> filtered_rows.insert(0, field_names)
>>> filtered_rows
[['A', 'B', 'C'], ['hi', '1.0', '5'], ['hello', '2.0', '6']]
If you require that values be exact tenths within the required range, then you can do this:
import csv
import numpy as np
allowed_values = np.arange(0, 2.1, 0.1)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if float(row[1]) in allowed_values]
Edit after updated requirements
With extra constraints on column "C", e.g. value must be >= 6.
import csv
import numpy as np
allowed_values_B = np.arange(0, 2.1, 0.1)
def accept_row(row):
return (float(row[1]) in allowed_values_B) and (int(row[2]) >= 6)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if accept_row(row)]
>>> filtered_rows
[['hello', '2.0', '6']]