I am trying to import data from a file and then add it to an array. I know that this is not the best way to add elements to a numpy array. Nevertheless, why is the data not appending? The last element of the csv is 1.1 and thats what i get when i do print(dd)
with open('C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv', 'r') as data_file:
data = csv.reader(data_file, delimiter=',')
for i in data:
t = []
d = []
dd = []
t.append([float(i[0])])
d.append([float(i[1])])
dd.append([float(i[2])])
t = np.array(t)
d = np.array(d)
dd = np.array(dd)
print (dd)
The root of your problem lies in the fact that every iteration of your loop you are re-assigning t, d and dd to empty lists []. If your end-all goal is to acquire numpy arrays for these variables, I would recommend using pd.read_csv() to convert your csv file to a dataframe. Take this sample csv:
t,d,dd
1,2,3
4,5,6
7,8,9
Using pd.read_csv():
df = pd.read_csv(r'C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv')
Gives:
t d dd
0 1 2 3
1 4 5 6
2 7 8 9
Then you can query your columns to return them as pd.Series():
t = df['t']
d = df['d']
dd = df['dd']
Or you can convert them to np.array():
t = np.array(df['t'])
d = np.array(df['d'])
dd = np.array(df['dd'])
Related
I have a CSV file with, say $n=100$ elements. So the file looks like a $n$-dimensional vector. The question is: how can I average every 4 elements and save the results in a new csv file?
For example I generate a list of random numbers:
import random
my_random_list = []
for i in range(0,9):
n = random.randint(1,100)
my_random_list.append(n)
df = pd.DataFrame(my_random_list)
df.to_csv('my_csv.csv', index=False, header=None)
This is similar to my code. Now, I want create a new csv (because I have the data in csv form already) where I average out and save the first 4 elements, then the next 4, etc. So I will end up with a csv file with only 25 elements.
Use DataFrame.groupby with integer division of index for groups of 4 values and aggregate mean:
np.random.seed(2021)
df = pd.DataFrame({'a':np.random.randint(1,10, size=10)})
print (df)
a
0 5
1 6
2 1
3 7
4 6
5 9
6 7
7 7
8 7
9 7
df1 = df.groupby(df.index // 4).mean()
print (df1)
a
0 4.75
1 7.25
2 7.00
Detail:
print (df.index // 4)
Int64Index([0, 0, 0, 0, 1, 1, 1, 1, 2, 2], dtype='int64')
All together:
df = pd.read_csv(file, header=None)
df1 = df.groupby(df.index // 4).mean()
df1.to_csv('my_csv.csv', index=False, header=None)
import pandas as pd
import random
import csv
# FIRST PART -- GENERATES THE ORIGINAL CSV FILE
my_random_list = []
for i in range(0,100):
n = random.randint(1,100)
my_random_list.append(n)
df = pd.DataFrame(my_random_list)
df.to_csv('my_csv.csv', index=False, header=None)
# SECOND PART -- POPULATES A LIST WITH THE CONTENTS OF THE
# ORIGINAL CSV FILE
file_CSV = open('my_csv.csv')
data_CSV = csv.reader(file_CSV)
list_CSV = list(data_CSV)
# THIRD PART -- GENERATES A NEW LIST CONTAINING
# THE AVERAGE OF EVERY FOURTH ELEMENT
# AND ITS THREE PREDECESSORS
new_list = []
for i in range(0,len(list_CSV)):
if(i%4==0):
s = int(list_CSV[i+0][0])
s = s + int(list_CSV[i+1][0])
s = s + int(list_CSV[i+2][0])
s = s + int(list_CSV[i+3][0])
s = s/4
new_list.append(s)
i = i + 1
# FOURTH PART -- GENERATES A NEW CSV
df = pd.DataFrame(new_list)
df.to_csv('new_csv.csv', index=False, header=None)
What is the best way to take this string:
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
and transform to a CSV containing 6 columns?
Desired output
Is a CSV which will be imported into Pandas:
1,a,1,a,1,a
2,b,2,b,2,b
etc..
Updated desired output as per comments to 6 rows.
Updated. I can get the first row like this if I assign the string to l variable:
l.split()[0::4]
['1', 'a', '1', 'a', '1', 'a']
with open('data.txt', 'r') as f:
data = f.read().split("\n")
for i in range(4):
d = list()
for j in range(i, len(data), 4):
d.append(data[j])
with open('data.csv', 'a') as csv:
csv.write(','.join(d)+"\n")
Even though Art's answer is accepted, here is another way using pandas. You wouldn't need to export the data prior to importing with pandas if you use something like this.
import pandas as pd
myFile="lines_to_read2.txt"
myData = pd.DataFrame (columns=['col1', 'col2', 'col3','col4'])
mycolumns = 4
thisItem = list()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
thisItem.append(thisLine.strip('\n, " "'))
if len(thisItem) == mycolumns:
myData = myData.append({'col1':thisItem[0],'col2':thisItem[1],'col3':thisItem[2],'col4':thisItem[3]}, ignore_index=True)
thisItem = list()
myData.to_csv('lines_as_csv_file.csv', index=False)
print(myData) # Full Table
I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None
I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0
Is there an efficient way to store each column of a tab-delimited file in a separate dictionary using python?
A sample input file: (Real input file contains thousands of lines and hundreds of columns. Number of columns is not fixed, it changes frequently.)
A B C
1 4 7
2 5 8
3 6 9
I need to print values in column A:
for cell in mydict["A"]:
print cell
and to print values in the same row:
for i in range(1, numrows):
for key in keysOfMydict:
print mydict[key][i]
The simplest way is to use DictReader from the csv module:
with open('somefile.txt', 'r') as f:
reader = csv.DictReader(f, delimiter='\t')
rows = list(reader) # If your file is not large, you can
# consume it entirely
# If your file is large, you might want to
# step over each row:
#for row in reader:
# print(row['A'])
for row in rows:
print(row['A'])
#Marius made a good point - that you might be looking to collect all columns separately by their header.
If that's the case, you'll have to adjust your reading logic a bit:
from collections import defaultdict
by_column = defaultdict(list)
for row in rows:
for k,v in row.iteritems():
by_column[k].append(v)
Another option is pandas:
>>> import pandas as pd
>>> i = pd.read_csv('foo.csv', sep=' ')
>>> i
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> i['A']
0 1
1 2
2 3
Name: A, dtype: int64
Not sure this is relevant, but you can do this using rpy2.
from rpy2 import robjects
dframe = robjects.DataFrame.from_csvfile('/your/csv/file.csv', sep=' ')
d = dict([(k, list(v)) for k, v in dframe.items()])
output:
{'A': [1, 2, 3], 'C': [7, 8, 9], 'B': [4, 5, 6]}