data accumulating from csv file using python - python

out_gate,useless_column,in_gate,num_connect
a,u,b,1
a,s,b,3
b,e,a,2
b,l,c,4
c,e,a,5
c,s,b,5
c,s,b,3
c,c,a,4
d,o,c,2
d,l,c,3
d,u,a,1
d,m,b,2
shown above is a given, sample csv file. First of all, My final goal is to get the answer as a form of csv file like below:
,a,b,c,d
a,0,4,0,0
b,2,0,4,0
c,9,8,0,0
d,1,2,5,0
I am trying to match this each data (a,b,c,d) one by one to the in_gate so, for example when out_gate 'c'-> in_gate 'b', number of connections is 8 and 'c'->'a' becomes 9.
I want to solve it with lists(or tuple, Dictionary, set) or collections. defaultdict WITHOUT USING PANDAS OR NUMPY, and I want a solution that can be applied to many gates (around 10 to 40) as well.
I understand there is a similar question and It helped a lot, but I still have some troubles in compiling. Lastly, Is there any way with using lists of columns and for loop?
((ex) list1=[a,b,c,d],list2=[b,b,a,c,a,b,b,a,c,c,a,b])
what if there are some useless columns that are not related to the data but the final goal remains same?
thanks

I'd use a Counter for this task. To keep the code simple, I'll read the data from a string. And I'll let you figure out how to produce the output as a CSV file in the format of your choice.
import csv
from collections import Counter
data = '''\
out_gate,in_gate,num_connect
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
d,c,2
d,c,3
d,a,1
d,b,2
'''.splitlines()
reader = csv.reader(data)
#skip header
next(reader)
# A Counter to accumulate the data
counts = Counter()
# Accumulate the data
for ogate, igate, num in reader:
counts[ogate, igate] += int(num)
# We could grab the keys from the data, but it's easier to hard-code them
keys = 'abcd'
# Display the accumulated data
for ogate in keys:
print(ogate, [counts[ogate, igate] for igate in keys])
output
a [0, 4, 0, 0]
b [2, 0, 4, 0]
c [9, 8, 0, 0]
d [1, 2, 5, 0]

If I understand your problem correctly, you could try and using a nested collections.defaultdict for this:
import csv
from collections import defaultdict
d = defaultdict(lambda : defaultdict(int))
with open('gates.csv') as in_file:
csv_reader = csv.reader(in_file)
next(csv_reader)
for row in csv_reader:
outs, ins, connect = row
d[outs][ins] += int(connect)
gates = sorted(d)
for outs in gates:
print(outs, [d[outs][ins] for ins in gates])
Which Outputs:
a [0, 4, 0, 0]
b [2, 0, 4, 0]
c [9, 8, 0, 0]
d [1, 2, 5, 0]

Related

count how many times a record appears in a pandas dataframe and create a new feature with this counter

I have this two dataframes dt_t and dt_u. I want to be able to count how many times a record in the text feature appears and I want to create a new feature in df_u where I associate to each id the counter. So id_u = 1 and id_u = 2 both will have counter = 3 since hello appears 3 times in df_t and both published a post with "hello" in the text.
import pandas as pd
import numpy as np
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3, 4], 'id_u': [1, 1, 3, 2, 2], 'text': ["hello", "hello", "friend", "hello", "my"]})
print(df_t)
df_u = pd.DataFrame({'id_u': [1, 2, 3]})
print()
print(df_u)
df_u_new = pd.DataFrame({'id_u': [1, 2, 3], 'counter': [3, 3, 1]})
print()
print(df_u_new)
The code I wrote for the moment is this, but this is very slow and also I have a very huge dataset so it is impossible to do.
user_counter_dict = {}
tmp = dict(df_t["text"].value_counts())
# to speedup the process we set as index the text column
df_t.set_index(["text"], inplace=True)
for i, (k, v) in enumerate(tmp.items()):
row = (k, v)
text = row[0]
counter = row[1]
#this is slow and take much of the time
uniques_id = df_.loc[tweet]["id_u"].unique()
for elem in uniques_id:
value = user_counter_dict.setdefault(str(elem), counter)
if value < counter:
user_counter_dict[str(elem)] = counter
# and now I will put the date on the dict on a new column in df_u
Is there a very fast way to compute this?
You can do:
df_u_new = df_t.assign(counter=df_t["text"].map(df_t["text"].value_counts()))[
["id_u", "counter"]
].groupby("id_u", as_index=False).max()
Get the value_counts of text and groupby id_u and get the maximum value which is what you were trying to get IIUC.
print(df_u_new)
id_u counter
0 1 3
1 2 3
2 3 1

Compare cell values csv file python

I have the following dataset in a CSV file
[1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 1, 1, 2]
Now I want to count each value by comparing them and store it in an array, but I don't want the frequency. So my output should be like this:
[3, 4, 3, 2, 1]
My code is as follows:
import csv
with open("c:/Users/Niels/Desktop/test.csv", 'rb') as f:
reader = csv.reader(f, delimiter=';')
data = []
for column in reader:
data.append(column[0])
results = data
results = [int(i) for i in results]
print results
dataFiltered = []
for i in results:
if i == (i+1):
counter = counter + 1
dataFiltered.append(counter)
counter = 0
print dataFiltered
My idea was by comparing the cell values. I know something is wrong in the for loop of results, but I can't figure out where my mistake is. My idea was by comparing the cell values. Maybe
I won't go into the details of your loop which is very wrong, if i==(i+1): just cannot be True for starters.
Next, you'd be better off with itertools.groupby and sum the length of the groups:
import itertools
results = [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 1, 1, 2]
freq = [len(list(v)) for _,v in itertools.groupby(results)]
print(freq)
len(list(v)) uses list to force the iteration on the grouped items so we can compute the length (maybe sum(1 for x in v) would more performant/appropriate, I haven't benched both approaches)
I get:
[3, 4, 3, 2, 1]
Aside: reading the first column of a csv file and convert the result to integer can be simply acheived by:
results = [int(row[0]) for row in reader]

Write multiple rows from dict using csv

Update: I do not want to use pandas because I have a list of dict's and want to write each one to disk as they come in (part of webscraping workflow).
I have a dict that I'd like to write to a csv file. I've come up with a solution, but I'd like to know if there's a more pythonic solution available. Here's what I envisioned (but doesn't work):
import csv
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(test_dict)
Which would ideally result in:
review_id text
1 5
2 6
3 7
4 8
The code above doesn't seem to work that way I'd expect it to and throws a value error. So, I've turned to following solution (which does work, but seems verbose).
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
response = test_dict
cells = [{x: {key: val}} for key, vals in response.items()
for x, val in enumerate(vals)]
rows = {}
for d in cells:
for key, val in d.items():
if key in rows:
rows[key].update(d.get(key, None))
else:
rows[key] = d.get(key, None)
for row in [val for _, val in rows.items()]:
writer.writerow(row)
Again, to reiterate what I'm looking for: the block of code directly above works (i.e., produces the desired result mentioned early in the post), but seems verbose. So, is there a more pythonic solution?
Thanks!
Your first example will work with minor edits. DictWriter expects a list of dicts rather than a dict of lists. Assuming you can't change the format of the test_dict:
import csv
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
def convert_dict(mydict, numentries):
data = []
for i in range(numentries):
row = {}
for k, l in mydict.iteritems():
row[k] = l[i]
data.append(row)
return data
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(convert_dict(test_dict, 4))
Try using pandas of python..
Here is a simple example
import pandas as pd
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
d1 = pd.DataFrame(test_dict)
d1.to_csv("output.csv")
Cheers
The built-in zip function can join together different iterables into tuples which can be passed to writerows. Try this as the last line:
writer.writerows(zip(test_dict["review_id"], test_dict["text"]))
You can see what it's doing by making a list:
>>> list(zip(test_dict["review_id"], test_dict["text"]))
[(1, 5), (2, 6), (3, 7), (4, 8)]
Edit: In this particular case, you probably want a regular csv.Writer, since what you effectively have is now a list.
If you don't mind using a 3rd-party package, you could do it with pandas.
import pandas as pd
pd.DataFrame(test_dict).to_csv('test.csv', index=False)
update
So, you have several dictionaries and all of them seems to come from a scraping routine.
import pandas as pd
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
pd.DataFrame(test_dict).to_csv('test.csv', index=False)
list_of_dicts = [test_dict, test_dict]
for d in list_of_dicts:
pd.DataFrame(d).to_csv('test.csv', index=False, mode='a', header=False)
This time, you would be appending to the file and without the header.
The output is:
review_id,text
1,5
2,6
3,7
4,8
1,5
2,6
3,7
4,8
1,5
2,6
3,7
4,8
The problem is that with DictWriter.writerows() you are forced to have a dict for each row. Instead you can simply add the values changing your csv creation:
with open('test.csv', 'w') as csvfile:
fieldnames = test_dict.keys()
fieldvalues = zip(*test_dict.values())
writer = csv.writer(csvfile)
writer.writerow(fieldnames)
writer.writerows(fieldvalues)
You have two different problems in your question:
Create a csv file from a dictionary where the values are containers and not primitives.
For the first problem, the solution is generally to transform the container type into a primitive type. The most common method is creating a json-string. So for example:
>>> import json
>>> x = [2, 4, 6, 8, 10]
>>> json_string = json.dumps(x)
>>> json_string
'[2, 4, 6, 8, 10]'
So your data conversion might look like:
import json
def convert(datadict):
'''Generator which converts a dictionary of containers into a dictionary of json-strings.
args:
datadict(dict): dictionary which needs conversion
yield:
tuple: key and string
'''
for key, value in datadict.items():
yield key, json.dumps(value)
def dump_to_csv_using_dict(datadict, fields=None, filepath=None, delimiter=None):
'''Dumps a datadict value into csv
args:
datadict(list): list of dictionaries to dump
fieldnames(list): field sequence to use from the dictionary [default: sorted(datadict.keys())]
filepath(str): filepath to save to [default: 'tmp.csv']
delimiter(str): delimiter to use in csv [default: '|']
'''
fieldnames = sorted(datadict.keys()) if fields is None else fields
filepath = 'tmp.csv' if filepath is None else filepath
delimiter = '|' if not delimiter else delimiter
with open(filepath, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='ignore', delimiter=delimiter)
writer.writeheader()
for each_dict in datadict:
writer.writerow(each_dict)
So the naive conversion looks like this:
# Conversion code
test_data = {
"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
}
converted_data = dict(convert(test_data))
data_list = [converted_data]
dump_to_csv(data_list)
Create a final value that is actually some sort of a merging of two disparate data sets.
To do this, you need to find a way to combine data from different keys. This is not an easy problem to generically solve.
That said, it's easy to combine two lists with zip.
>>> x = [2, 4, 6]
>>> y = [1, 3, 5]
>>> zip(y, x)
[(1, 2), (3, 4), (5, 6)]
In addition, in the event that your lists are not the same size, python's itertools package provides a method, izip_longest, which will yield back the full zip even if one list is shorter than another. Note izip_longest returns a generator.
from itertools import izip_longest
>>> x = [2, 4]
>>> y = [1, 3, 5]
>>> z = izip_longest(y, x, fillvalue=None) # default fillvalue is None
>>> list(z) # z is a generator
[(1, 2), (3, 4), (5, None)]
So we could add another function here:
from itertoops import izip_longest
def combine(data, fields=None, default=None):
'''Combines fields within data
args:
data(dict): a dictionary with lists as values
fields(list): a list of keys to combine [default: all fields in random order]
default: default fill value [default: None]
yields:
tuple: columns combined into rows
'''
fields = data.keys() if field is None else field
columns = [data.get(field) for field in fields]
for values in izip_longest(*columns, fillvalue=default):
yield values
And now we can use this to update our original conversion.
def dump_to_csv(data, filepath=None, delimiter=None):
'''Dumps list into csv
args:
data(list): list of values to dump
filepath(str): filepath to save to [default: 'tmp.csv']
delimiter(str): delimiter to use in csv [default: '|']
'''
fieldnames = sorted(datadict.keys()) if fields is None else fields
filepath = 'tmp.csv' if filepath is None else filepath
delimiter = '|' if not delimiter else delimiter
with open(filepath, 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=delimiter)
for each_row in data:
writer.writerow(each_dict)
# Conversion code
test_data = {
"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
}
combined_data = combine(test_data)
data_list = [combined_data]
dump_to_csv(data_list)

Converting a raw list to JSON data with Python

I have a raw list sorteddict in the form of :
["with", 1]
["witches", 1]
["witchcraft", 3]
and I want to generate more legible data by making it a JSON object that looks like:
"Frequencies": {
"with": 1,
"witches": 1,
"witchcraft": 3,
"will": 2
}
Unfortunately so far, I have only found a manual way to create data as shown above, and was wondering if there was a much more eloquent way of generating the data rather than my messy script. I got to the point where I needed to retrieve the last item in the list and ensure that there was no comma on the last line before I thought I should seek some advice. Here's what I had:
comma_count = 0
for i in sorteddict:
comma_count += 1
with open("frequency.json", 'w') as f:
json_head = "\"Frequencies\": {\n"
f.write(json_head)
while comma_count > 0:
for s in sorteddict:
f.write('\t\"' + s[0] + '\"' + ":" + str(s[1]) + ",\n")
comma_count -= 1
f.write("}")
I have used json.JSONEncode.encode() which I thought that was what I was looking for, but what ended up happening is "Frequencies" would be prepended to each s[0] item. Any ideas to clean the code?
You need to make a nested dict out of your current one, and use json.dumps. Not sure how sorteddict works, but:
json.dumps({"Frequencies": mySortedDict})
should work.
Additionally, you say that you want something json encoded, but your example is not valid json. So I will assume that you actually want legitimate json.
Here's some example code:
In [4]: import json
In [5]: # No idea what a sorteddict is, we assume it has the same interface as a normal dict.
In [6]: the_dict = dict([
...: ["with", 1],
...: ["witches", 1],
...: ["witchcraft", 3],
...: ])
In [7]: the_dict
Out[7]: {'witchcraft': 3, 'witches': 1, 'with': 1}
In [8]: json.dumps({"Frequencies": the_dict})
Out[8]: '{"Frequencies": {"with": 1, "witches": 1, "witchcraft": 3}}'
I may not be understanding you correctly - but do you just want to turn a list of [word, frequency] lists into a dictionary?
frequency_lists = [
["with", 1],
["witches", 1],
["witchcraft", 3],
]
frequency_dict = dict(frequency_lists)
print(frequency_dict) # {'with': 1, 'witches': 1, 'witchcraft': 3}
If you then want to write this to a file:
import json
with open('frequency.json', 'w') as f:
f.write(json.dumps(frequency_dict))

Python - convert edge list to adjacency matrix

I have data in the following format:
user,item,rating
1,1,3
1,2,2
2,1,2
2,4,1
and so on
I want to convert this in matrix form
So, the out put is like this
Item--> 1,2,3,4....
user
1 3,2,0,0....
2 2,0,0,1
....and so on..
How do I do this in python?
THanks
data = [
(1,1,3),
(1,2,2),
(2,1,2),
(2,4,1),
]
#import csv
#with open('data.csv') as f:
# next(f) # Skip header
# data = [map(int, row) for row in csv.reader(f)]
# # Python 3.x: map(int, row) -> tuple(map(int, row))
n = max(max(user, item) for user, item, rating in data) # Get size of matrix
matrix = np.zeros((n, n))
for user, item, rating in data:
matrix[user-1][item-1] = rating # Convert to 0-based index.
for row in matrix:
print(row)
prints
[3, 2, 0, 0]
[2, 0, 0, 1]
[0, 0, 0, 0]
[0, 0, 0, 0]
a different approach from #falsetru,
do you read from file in write to file?
may be work with dictionary
from collections import defaultdict
valdict=defaultdict(int)
nuser=0
nitem=0
for line in infile:
eachline=line.strip().split(",")
valdict[tuple(eachline[0:2])]=eachline[2]
nuser=max(nuser,eachline[0])
nitem=max(nitem,eachline[1])
towrite=",".join(range(1,nuser+1))+"\n"
for i in range(1:nuser+1):
towrite+=str(i)
for j in range(1:nitem+1):
towrite+=","+str(valdict[i,j])
towrite+="\n"
outfile.write(towrite)

Categories