2D list to csv - by column - python

I'd like to export the content of a 2D-list into a csv file.
The size of the sublists can be different. For example, the 2D-list can be something like :
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
I want my csv to store the data like this - "by column" :
a,e,g, ,h
b,f, , ,i
c
d
Do I have to add some blank spaces to get the same size for each sublist ? Or is there another way to do so ?
Thank you for your help

You can use itertools.zip_longest:
import itertools, csv
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows(list(itertools.zip_longest(*a, fillvalue='')))
Output:
a,e,g,,h
b,f,,,i
c,,,,
d,,,,

It can be done using pandas and transpose function (T)
import pandas as pd
pd.DataFrame(a).T.to_csv('test.csv')
Result:
(test.csv)
,0,1,2,3,4
0,a,e,g,,h
1,b,f,,,i
2,c,,,,
3,d,,,,

import itertools
import pandas as pd
First create a dataframe using a nested array:
a = ['a','b','c','d']
b = ['e','f']
c = ['g']
d = []
e = ['h','i']
nest = [a,b,c,d,e]
df = pd.DataFrame((_ for _ in itertools.zip_longest(*nest)), columns=['a', 'b', 'c', 'd', 'e'])
like that:
a b c d e
0 a e g None h
1 b f None None i
2 c None None None None
3 d None None None None
and then store it using pandas:
df.to_csv('filename.csv', index=False)

We have three task to do here: fill sublist so all have same length, transpose, write to csv.
Sadly Python has not built-in function for filling, however it can be done relatively easily, I would do it following way
(following code is intended to give result as requested in OP):
a = [['a','b','c','d'],['e','f'],['g'],[],['h','i']]
ml = max([len(i) for i in a]) #number of elements of longest sublist
a = [(i+[' ']*ml)[:ml] for i in a] #adding ' ' for sublist shorter than longest
a = list(zip(*a)) #transpose
a = [','.join(i) for i in a] #create list of lines to be written
a = [i.rstrip(', ') for i in a] #jettison spaces if not followed by value
a = '\n'.join(a) #create string to be written to file
with open('myfile.csv','w') as f: f.write(a)
Content of myfile.csv:
a,e,g, ,h
b,f, , ,i
c
d

Related

Is there an elegant way to map an alias to its real entity name in a connected data file?

I have a big csv file containing connected data such as:
Source,Target
a,token
b,token2
c,token3
d,j
e,k
f,l
token,g
token2,h
token3,i
the structure of the file is mixed, so the rows where the relation is
a,token
b,token2
c,token3
do not identify specific relations in the network graph, but instead define aliases to which the entities a, b and c are mapped.
In the rest of the file, i have standard relations (d,j; e,k; f,l) but also relations where the real name of the entity is replaced by its alias:
token,g
token2,h
token3,i
Currently, I am looping over the file with ugly 'for' loops and in such a way I am able to map the relations in the desired way and to get:
a,g
b,h
c,i
but it's not an elegant way and, perhaps, heavy on my cpu.
Is there any built-in function (maybe in pandas) or some elegant and quick way (few lines of code) to map the file as desidred in Python?
data = [
['a', 'token'],
['b', 'token2'],
['c', 'token3'],
['d', 'j'],
['e', 'k'],
['f', 'l'],
['token', 'g'],
['token2', 'h'],
['token3', 'i']
]
df = pd.DataFrame(data, columns=['Source', 'Target'])
source_to_target = {row.Source: row.Target for row in df.itertuples()}
df.loc[:, 'AliasedTarget'] = df.loc[:, 'Target'].apply(lambda x: source_to_target.get(x, x))
print(df.head())
Source Target AliasedTarget
0 a token g
1 b token2 h
2 c token3 i
3 d j j
4 e k k
IIUC, you can try:
Fetch the common elements from source / target.
Replace the value of common elements in target with the required values.
Delete the rows with common elements.
import numpy as np
common_elements = np.intersect1d(df.Source.values, df.Target.values)
df.Target = df.Target.replace(dict(df[df.Source.isin(common_elements)].values))
df = df[~df.Source.isin(common_elements)]
OUTPUT:
Source Target
0 a g
1 b h
2 c i
3 d j
4 e k
5 f l

Python find and replace tool using pandas and a dictionary

Having issues with building a find and replace tool in python. Goal is to search a column in an excel file for a string and swap out every letter of the string based on the key value pair of the dictionary, then write the entire new string back to the same cell. So "ABC" should convert to "BCD". I have to find and replace any occurrence of individual characters.
The below code runs without debugging, but newvalue never creates and I don't know why. No issues writing data to the cell if newvalue gets created.
input: df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
expected output: df = pd.DataFrame({'Code1': ['BCD1', 'C5DE', 'D3EF']})
mycolumns = ["Col1", "Col2"]
mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
for x in mycolumns:
# 1. If the mycolumn value exists in the headerlist of the file
if x in headerlist:
# 2. Get column coordinate
col = df.columns.get_loc(x) + 1
# 3. iterate through the rows underneath that header
for ind in df.index:
# 4. log the row coordinate
rangerow = ind + 2
# 5. get the original value of that coordinate
oldval = df[x][ind]
for count, y in enumerate(oldval):
# 6. generate replacement value
newval = df.replace({y: mydictionary}, inplace=True, regex=True, value=None)
print("old: " + str(oldval) + " new: " + str(newval))
# 7. update the cell
ws.cell(row=rangerow, column=col).value = newval
else:
print("not in the string")
else:
# print(df)
print("column doesn't exist in workbook, moving on")
else:
print("done")
wb.save(filepath)
wb.close()
I know there's something going on with enumerate and I'm probably not stitching the string back together after I do replacements? Or maybe a dictionary is the wrong solution to what I am trying to do, the key:value pair is what led me to use it. I have a little programming background but ery little with python. Appreciate any help.
newvalue never creates and I don't know why.
DataFrame.replace with inplace=True will return None.
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> df = df.replace('ABC1','999')
>>> df
Code1
0 999
1 B5CD
2 C3DE
>>> q = df.replace('999','zzz', inplace=True)
>>> print(q)
None
>>> df
Code1
0 zzz
1 B5CD
2 C3DE
>>>
An alternative could b to use str.translate on the column (using its str attribute) to encode the entire Series
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
>>> table = str.maketrans('ABC','BCD')
>>> df
Code1
0 ABC1
1 B5CD
2 C3DE
>>> df.Code1.str.translate(table)
0 BCD1
1 C5DD
2 D3DE
Name: Code1, dtype: object
>>>

How to combine two rows in Python list

Suppose I have a 2D list,
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
I want to obtain a list such that if the first two elements in rows are identical, sum the fourth element, drop the third element and combine these rows together, like the following,
b = [['a','b',3],
['a','e',7]]
What is the most efficient way to do this?
If your list is already sorted, then you can use itertools.groupby. Once you group by the first two elements, you can use a generator expression to sum the 4th element and create your new lists.
>>> from itertools import groupby
>>> a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
>>> [g[0] + [sum(i[3] for i in g[1])] for g in groupby(a, key = lambda i : i[:2])]
[['a', 'b', 3],
['a', 'e', 7]]
Using pandas's groupby:
import pandas as pd
df = pd.DataFrame(a)
df.groupby([0, 1]).sum().reset_index().values.tolist()
Output:
df.groupby([0, 1]).sum().reset_index().values.tolist()
Out[19]: [['a', 'b', 3L], ['a', 'e', 7L]]
You can use pandas groupby methods to achieve that goal.
import pandas as pd
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
df = pd.DataFrame(a)
df_sum = df.groupby([0,1])[3].sum().reset_index()
array_return = df_sum.values
list_return = array_return.tolist()
print(list_return)
list_reuturn is the result you want.
If you're interested. Here is an implementation using raw python. I've only tested it on the dataset you provided.
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
b_dict = {}
for row in a:
key = (row[0], row[1])
b_dict[key] = b_dict[key] + row[3] if key in b_dict else row[3]
b = [[key[0], key[1], value] for key, value in b_dict.iteritems()]

Add bi-grams to a pandas dataframe

I have a list of bi-grams like this:
[['a','b'],['e', ''f']]
Now I want to add these bigrams to a DataFrame with their frequencies like this:
b f
a|1 0
e|0 1
I tried doing this with the following code, but this raises an error, because the index doesn't exist yet. Is there a fast way to do this for really big data? (like 200000 bigrams)
matrixA = pd.DataFrame()
# Put the counts in a matrix
for elem in grams:
tag1, tag2 = elem[0], elem[1]
matrixA.loc[tag1, tag2] += 1
from collections import Counter
bigrams = [[['a','b'],['e', 'f']], [['a','b'],['e', 'g']]]
pairs = []
for bg in bigrams:
pairs.append((bg[0][0], bg[0][1]))
pairs.append((bg[1][0], bg[1][1]))
c = Counter(pairs)
>>> pd.Series(c).unstack() # optional: .fillna(0)
b f g
a 2 NaN NaN
e NaN 1 1
The above is for the intuition. This can be wrapped up in a one line generator expression as follows:
pd.Series(Counter((bg[i][0], bg[i][1]) for bg in bigrams for i in range(2))).unstack()
You can use Counter from the collections package. Note that I changed the contents of the list to be tuples rather than lists. This is because Counter keys (like dict keys) must be hashable.
from collections import Counter
l = [('a','b'),('e', 'f')]
index, cols = zip(*l)
df = pd.DataFrame(0, index=index, columns=cols)
c = Counter(l)
for (i, c), count in c.items():
df.loc[i, c] = count

How to get two arrays out of one csv file in Python?

I've a csv file containing lines like this:
A,x1
A,x2
A,x3
B,x4
B,x5
B,x6
The first part reflects the group (A or B) a value (x1, x2, ...) belongs to.
What I want to do now is importing that csv file in Python, so I have two lists in the end:
ListA = [x1, x2, x3]
ListB = [x4, x5, x6]
Can someone help me out with that?
Thanks in advance :)
import sys
file_path = "path_to_your_csv"
stream_in = open(file_path, 'rb')
A = [];
B = [];
for line in stream_in.readlines():
add_to_List = line.split(",")[1].strip()
if 'A' in line:
A.append(add_to_List);
if 'B' in line:
B.append(add_to_List)
stream_in.close()
print A
print B
after putting your data in a pandas Series object names ser, just type in ser.loc("A")
and ser.loc("B") to get the data slice you want.
Using preassigned names for your vectors lead to lots of duplicated logic, that gets more and more complicated if you add new vectors to your data description...
It's much better to use dictionaries
data=[['a', 12.3], ['a', 12.4], ['b', 0.4], ['c', 1.2]]
vectors = {} # an empty dictionary
for key, value in data:
vectors.setdefault(key,[]).append(value)
The relevant docs, from the python official documentation
setdefault(key[, default])
If key is in the dictionary, return its value.
If not, insert key with a value of default and return default.
default defaults to None.
append(x)
appends x to the end of the sequence (same as s[len(s):len(s)] = [x])
You could try:
In[1]: import pandas as pd
In[2]: df = pd.read_csv(file_name, header=None)
In[3]: print(df)
out[3]:
0 1
0 A x1
1 A x2
2 A x3
3 B x4
4 B x5
In[4]: ListA = df[0].tolist()
In[5]: print(ListB)
Out[5]: ['A', 'A', 'A', 'B', 'B', 'B']
In[6]: ListB = t_df[1].tolist()
In[7]: print(ListB)
Out[7]: ['x1', 'x2', 'x3', 'x4', 'x5', 'x6']

Categories