I'm having a input .txt file with adjacency matrix looks like:
A B C
A 0 55 0
B 55 0 0
C 0 0 0
How can I parse this input into a 2D array or nested dictionary?
e.g.
map['A']['B'] = 55
import StringIO
# this is just for the sake of a self-contained example
# this would be your actual file opened with open()
inf = StringIO.StringIO(" A B C\n"
"A 0 55 0\n"
"B 55 0 0\n"
"C 0 0 0")
import re
map = {}
lines = inf.readlines()
headers = []
# extract each group of consecutive word characters.
# If your headers might contain dashes or other non-word characters,
# you might want ([^\s]+) instead.
for header in re.findall('(\w+)', lines[0]):
headers.append(header)
map[header] = {}
for line in lines[1:]:
items = re.findall('(\w+)', line)
rowname = items[0]
for idx, item in enumerate(items[1:]):
map[headers[idx]][rowname] = item
print map
from io import StringIO
d = StringIO(u" A B C\nA 0 55 0\nB 55 0 0\nC 0 0 0\n")
import pandas as pd
map = pd.read_table(d, header=0, index_col=0, delim_whitespace=True)
print(map)
A B C
A 0 55 0
B 55 0 0
C 0 0 0
print(map['A']['B'])
55
Related
I am trying to convert a file into an adjacency matrix. I need to do this in a way that allows files of different sizes to fill this matrix. My current working file is of size 4. This is my testing file, what I need is way of doing this in an abstract way to do larger files.
This is my test file. The 1 — 4 are the column that the Boolean values belong.
1,0
1,0
1,1
1,1
2,0
2,0
2,0
2,1
3,1
3,0
3,0
3,1
4,1
4,1
4,1
4,0
I would like an end result of:
0 0 1 1
0 0 0 1
1 0 0 1
1 1 1 0
Here is the code I have that produces a dataframe similar to my input file.
# Importing needed libraries
import os.path
from math import sqrt
import numpy as np
import pandas as pd
# changing filepath to a variable name
fileName = "./testAlgorithm.csv"
# opening file, doing file check, converting
# file to dataframe
if os.path.isfile(fileName):
with open(fileName, "r") as csvfile:
df = pd.read_csv(fileName, header=None)
else:
print(f"file{fileName} does not exist")
# method used to count the number of lines
# in data file
def simpleCount(fileName):
lines = 0
for line in open(fileName):
lines += 1
return sqrt(lines)
# method call for line count.
lineNum = simpleCount(fileName)
print(df)
num = int(simpleCount(fileName))
df = pd.DataFrame({"A":[0,0,1,1,0,0,0,1,1,0,0,1,1,1,1,0]})
df.values.reshape(4,4)
If you want to make it back to a dataframe
pd.DataFrame(df.values.reshape(4,4), columns=["A", "B", "C", "D"])
you can try:
dummy = pd.DataFrame(columns= [['c1','c2','c3','c4']])
dummy['c1'] = np.array(df['c2'].loc[df['c1'] == 1])
dummy['c2'] = np.array(df['c2'].loc[df['c1'] == 2])
dummy['c3'] = np.array(df['c2'].loc[df['c1'] == 3])
dummy['c4'] = np.array(df['c2'].loc[df['c1'] == 4])
It will give you as :
c1 c2 c3 c4
0 0 0 1 1
1 0 0 0 1
2 1 0 0 1
3 1 1 1 0
I have a dataframe as follows:
Index A B C D E F
1 0 0 C 0 E 0
2 A 0 0 0 0 F
3 0 0 0 0 E 0
4 0 0 C D 0 0
5 A B 0 0 0 0
Basically I would like to write the dataframe to a txt file, such that every row consists of the index and the subsequent column name only, excluding the zeroes.
For example:
txt file
1 C E
2 A F
3 E
4 C D
5 A B
The dataset is quite big, about 1k rows, 16k columns. Is there any way I can do this using a function in Pandas?
Take a matrix vector multiplication between the boolean matrix generated by "is this entry "0" or not" and the columns of the dataframe, and write it to a text file with to_csv (thanks to #Andreas' answer!):
df.ne("0").dot(df.columns + " ").str.rstrip().to_csv("text_file.txt")
where we right strip the spaces at the end due to the added " " to the last entries.
If you don't want the name Index appearing in the text file, you can chain a rename_axis(index=None) to get rid of it i.e.,
df.ne("0").dot(df.columns + " ").str.rstrip().rename_axis(index=None)
and then to_csv as above.
You can try this (replace '0' with 0 if that are numeric 0 instead of string 0):
# Credits to Pygirl, made the code even better.
df.set_index('Index', inplace=True)
df = df.replace('0',np.nan)
df.stack().groupby(level=0).apply(list)
# Out[79]:
# variable
# 0 [C, E]
# 1 [A, F]
# 2 [E]
# 3 [C, D]
# 4 [A, B]
# Name: value, dtype: object
For the writing to text, you can use pandas as well:
df.to_csv('your_text_file.txt')
You could replace string '0' with empty string '', then so some string-list-join manipulation to get the final results. Finally append each line into a text file. See code:
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['0','0','0','0','E','0'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df = df.replace('0', '')
logfile = open('test.txt', 'a')
for i in range(len(df)):
temp = ''.join(list(df.loc[i,:]))
logfile.write(str(i+1) + ' ' + ' '.join(list(temp)) + '\n')
logfile.close()
Output test.txt
1 C E
2 A F
3 E
4 C D
5 A B
My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices
You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])
I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"
I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))
I have a big file of word/tag pairs saved like this:
This/DT gene/NN called/VBN gametocide/NN
Now I want to put these pairs into a DataFrame with their counts like this:
DT | NN --
This| 1 0
Gene| 0 1
:
I tried doing this with a dict that counts the pairs and then put it in the DataFrame:
file = open("data.txt", "r")
train = file.read()
words = train.split()
data = defaultdict(int)
for i in words:
data[i] += 1
matrixB = pd.DataFrame()
for elem, count in data.items():
word, tag = elem.split('/')
matrixB.loc[tag, word] = count
But this takes a really long time (file has like 300000 of these). Is there a faster way to do this?
What was wrong with the answers from your other question?
from collections import Counter
with open('data.txt') as f:
train = f.read()
c = Counter(tuple(x.split('/')) for x in train.split())
s = pd.Series(c)
df = s.unstack().fillna(0)
print(df)
yields
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
I thought this question was remarkably similar... Why did you post twice?
from collection import Counter
text = "This/DT gene/NN called/VBN gametocide/NN"
>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0)
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a python script. For example the content of the CSV file is as mentioned.
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
In the above example I need the output as below
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
Can any of you please help me out with this?
^\S+/
You can simply use this regex over each line and replace by empty string.See demo.
https://regex101.com/r/cK4iV0/17
import re
p = re.compile(ur'^\S+/', re.MULTILINE)
test_str = u"C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0 "
subst = u" "
result = re.sub(p, subst, test_str)
What about something like,
import csv
with open("file.csv", 'rb') as f:
sl = []
csvread = csv.reader(f, delimiter=' ')
for line in csvread:
sl.append(line.replace("C:/Abc/Def/Test/temp\.\test\GLNext\", ""))
To write the list sl out to filenew use,
with open('filenew.csv', 'wb') as f:
csvwrite = csv.writer(f, delimiter=' ')
for line in sl:
csvwrite.writerow(line)
You can automatically detect the common prefix without the need to hardcode it. You don't really need regex for this. os.path.commonprefix can be used
instead:
import csv
import os
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
paths = [] #stores all paths
rows = [] #stores all lines
for row in reader:
paths.append(row[0].split("/")) #split path by "/"
rows.append(row)
commonprefix = os.path.commonprefix(paths) #finds prefix common to all paths
for row in rows:
row[0] = row[0].replace('/'.join(commonprefix)+'/', "") #remove prefix
rows now has a list of lists which you can write to a file
with open('data2.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for row in rows:
writer.writerow(row)
The following Python script will read your file in (assuming it looks like your example) and will create a version removing the common folders:
import os.path, csv
finput = open("d:\\input.csv","r")
csv_input = csv.reader(finput, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(open("d:\\output.csv", "wb"), delimiter=" ")
# Create a set of unique folder names
set_folders = set()
for input_row in csv_input:
set_folders.add(os.path.split(input_row[0])[0])
# Determine the common prefix
base_folder = os.path.split(os.path.commonprefix(set_folders))[0]
nprefix = len(base_folder) + 1
# Go back to the start of the input CSV
finput.seek(0)
for input_row in csv_input:
csv_output.writerow([input_row[0][nprefix:]] + input_row[1:])
Using the following as input:
C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp/test/GLNext2/FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp/test/GLNext5/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp/test/GLNext7/FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
The output is as follows:
GLNext/FILE0.frag 0 0 0
GLNext/FILE0.vert 0 0 0
GLNext/FILE0.link-link-0.frag 16 24 3
GLNext2/FILE0.link-link-0.vert 87 116 69
GLNext5/FILE0.link-link-0.vert.bin 75 95 61
GLNext7/FILE0.link-link-0 0 0
GLNext/FILE0.link-link-6 0 0 0
With one space between each column, although this could easily be changed.
So i tried something like this
for dirName, subdirList, fileList in os.walk(Directory):
for fname in fileList:
if fname.endswith('.csv'):
for line in fileinput.input(os.path.join(dirName, fname), inplace = 1):
location = line.find(r'GLNext')
if location > 0:
location += len('GLNext')
print line.replace(line[:location], ".")
else:
print line
You can use the pandas library for this. Doing so, you can leverage pandas' amazing handling of big CSV files (even in the hundreds of MB).
Code:
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file, header=None)
print df
print "-------------------------------------------"
path = "C:/Abc/bcd/Def/Test/temp/test/GLNext/"
df[0] = df[0].replace({path:""}, regex=True)
print df
# df.to_csv("truncated.csv") # Export to new file.
Result:
0 1 2 3
0 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
1 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
2 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 16 24 3
3 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 87 116 69
4 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 75 95 61
5 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 NaN
6 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 0
-------------------------------------------
0 1 2 3
0 FILE0.frag 0 0 0
1 FILE0.vert 0 0 0
2 FILE0.link-link-0.frag 16 24 3
3 FILE0.link-link-0.vert 87 116 69
4 FILE0.link-link-0.vert.bin 75 95 61
5 FILE0.link-link-0 0 0 NaN
6 FILE0.link-link-6 0 0 0