How to create a dictionary from a CSV file - python

I have a CSV file like this:
w syn
0 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
1 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
2 abaca http://kaiko.getalp.org/dbnary/fra/tagal
3 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
4 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
.. ... ...
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
96 abandon http://kaiko.getalp.org/dbnary/fra/capitulation
97 abandon http://kaiko.getalp.org/dbnary/fra/cession_de_...
98 abandon http://kaiko.getalp.org/dbnary/fra/confiance
99 abandon http://kaiko.getalp.org/dbnary/fra/défection
[100 rows x 2 columns]
6
{'abaca': 'tagal', 'abdomen': 'ventre', 'abricot': 'michemis', 'ADN': 'acide désoxyribonucléique', 'Indien': 'sauvage', 'abandon': 'défection'}
I ma trying to create a dictionary which each word and its synonym. I came up with this code but the final dictionary only contains one synonym for the word but as you can see in the csv file, a word can have more than one synonym.
# read specific columns of csv file using Pandas
df = pd.read_csv("sparql.csv", usecols = ["w","syn"]) #usecols = ["l","f","s","w","syn","synonyme"]
print(df)
liste_mot = df['w'].tolist()
liste_mot = set(liste_mot)
print(len(liste_mot))
liste_sys = []
dict_syn = {}
for index, row in df.iterrows():
k, v = row
sys = os.path.basename(v)
if "_" in sys:
sys = sys.split("_")
sys = " ".join(sys)
dict_syn[k] = sys
else:
dict_syn[k] = sys
print(dict_syn)
What I want to get is each word as key and a list of all their synonyms as its value but so far I only get one synonym (syn) per word (w) and not all of them.

Another approach:
import os
df = pd.read_csv("sparql.csv", usecols=["w","syn"])
df["syn_new"] = df.syn.map(os.path.basename).str.replace("_", " ")
dict_syn = {
key: group.syn_new.to_list()
for key, group in df[["w", "syn_new"]].groupby("w")
}
Result for your sample:
{'abaca': ['Musa textilis',
'chanvre de ...',
'tagal',
'Musa textilis',
'chanvre de ...'],
'abandon': ['apostasie',
'capitulation',
'cession de ...',
'confiance',
'défection']}
You could try if
df["syn_new"] = df.syn.str.rsplit("/", 1, expand=True)[1].str.replace("_", " ")
works too, could be faster.
And maybe you don't want lists but sets as dict_syn values to avoid duplicates:
...
key: set(group.syn_new.to_list())
...

Here's a working example based partly on your code. Synonyms are put in a list:
from io import StringIO
import pandas as pd
text = """
w syn
0 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
1 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
2 abaca http://kaiko.getalp.org/dbnary/fra/tagal
3 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
4 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
96 abandon http://kaiko.getalp.org/dbnary/fra/capitulation
97 abandon http://kaiko.getalp.org/dbnary/fra/cession_de_...
98 abandon http://kaiko.getalp.org/dbnary/fra/confiance
99 abandon http://kaiko.getalp.org/dbnary/fra/défection
"""
# read in data
df = pd.read_csv(StringIO(text), sep='\s+')
# get the synonym out of the url
df['real_syn'] = df['syn'].str.extract('.*/(.*)')
# dictionary to write results to
result = {}
# loop over every row of the dataframe
for _, row in df[['w', 'real_syn']].iterrows():
word = row['w']
syn = row['real_syn']
# check if word is already in result dictionary and make sure words are not added twice
if result.get(word) and syn not in result[word]:
result[word] = result[word] + [syn]
else:
# if word is not yet in dictionary, then add it a key, and add the synonym as a list
result[word] = [syn]
print(result)

I'm not sure if your CSV is actually fixed-width, or if that's just a nice printout.
If you don't need Pandas, Python's standard CSV module is up to the job.
import csv
import os
import pprint
from collections import defaultdict
def syn_splitter(s):
syn = os.path.basename(s)
syn = syn.replace('_', ' ')
return syn
# So we can just start appending syns, without having to "prime" the dictionary with an empty list
word_syn_map = defaultdict(list)
with open('sample.csv', 'r', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
w, syn = row
syn = syn_splitter(syn)
word_syn_map[w].append(syn)
pprint.pprint(word_syn_map)
# word_syn_map = dict(word_syn_map) if you want to get rid of the defaultdict wrapper
I mocked up sample.csv:
w,syn
abaca,http://kaiko.getalp.org/dbnary/fra/Musa_textilis
abaca,http://kaiko.getalp.org/dbnary/fra/tagal
abaca,http://kaiko.getalp.org/dbnary/fra/Musa_textilis
abandon,http://kaiko.getalp.org/dbnary/fra/apostasie
abandon,http://kaiko.getalp.org/dbnary/fra/capitulation
abandon,http://kaiko.getalp.org/dbnary/fra/confiance
abandon,http://kaiko.getalp.org/dbnary/fra/défection
and I got:
defaultdict(<class 'list'>,
{'abaca': ['Musa textilis', 'tagal', 'Musa textilis'],
'abandon': ['apostasie',
'capitulation',
'confiance',
'défection']})

Related

Blank column appearing in .csv output, how can I remove it?

*Updated to add more lines of input file
I have a .csv file with header and subsequent data as follows (shown only first few rows here):
gene_name VarXCRep.1 VarX1Rep.1 VarX2Rep.1 VarXCRep.2 VarX3Rep.2 VarX1Rep.2 VarX2Rep.2 VarXCRep.3 VarX3Rep.3 VarX1Rep.3 VarX2Rep.3
1 Soltu.DM.01G000010 360.7000522 395.2279977 323.2595994 361.5910696 327.7380499 386.8290979 336.3997167 333.0843759 317.4954424 377.756613 396.666783
2 Soltu.DM.01G000020 91.12422371 69.30538348 77.36127164 135.060696 61.85252412 110.6099 68.21624475 108.7053612 55.31681029 56.52040232 36.14709293
3 Soltu.DM.01G000030 439.1681337 183.5656103 232.0838149 579.546161 220.9018719 179.6646995 179.2348391 291.2746216 222.4196747 266.8621527 208.321404
4 Soltu.DM.01G000040 268.3102142 185.4387288 192.0217278 301.5640936 130.9345641 237.108515 203.9799475 236.921941 92.19468382 198.1791322 38.04957151
5 Soltu.DM.01G000050 341.7158389 479.5183289 504.229717 322.2876925 528.5579334 390.4957244 470.1570594 342.8399852 554.3205365 424.9761896 634.4766049
6 Soltu.DM.01G000060 468.2772607 839.1570756 759.7982036 514.516937 886.0173261 572.6048416 579.8380803 549.1014398 1011.836655 598.8300854 1077.754113
7 Soltu.DM.01G000070 2.531228436 0 5.525805117 1.429213714 8.032795341 1.83331326 5.350293706 0 4.609734191 0 7.609914302
8 Soltu.DM.01G000090 84.79615262 54.3204357 75.97982036 98.61574626 102.0165008 83.11020113 84.26712586 108.7053612 98.53306833 80.13019064 93.2214502
9 Soltu.DM.01G000100 67.07755356 73.05162042 12.43306151 118.6247383 6.426236273 77.61026135 36.11448251 97.55609336 8.643251608 67.25212429 15.2198286
10 Soltu.DM.01G000110 1.265614218 0 1.381451279 2.143820571 0 1.22220884 4.012720279 0 2.304867095 0.715448131 0.951239288
11 Soltu.DM.01G000120 821.3836276 451.4215518 846.8296342 820.3686718 737.4106123 497.4389979 835.9833915 798.5663071 752.5391067 704.7164087 532.6940011
12 Soltu.DM.01G000130 2.531228436 3.746236945 5.525805117 2.143820571 0.803279534 0.61110442 2.00636014 1.393658477 1.728650322 2.146344392 10.46363217
13 Soltu.DM.01G000140 93.65545214 127.3720561 102.2273947 105.7618148 104.4263394 108.7765868 115.7001014 98.94975183 108.9049703 110.8944603 126.5148253
14 Soltu.DM.01G000150 112.6396654 84.29033126 91.17578444 86.46742969 154.2296705 99.61002047 111.0185944 115.6736536 111.7860541 115.187149 163.6131575
15 Soltu.DM.01G000160 644.197637 573.1742525 222.413656 760.3416958 178.3280566 761.4361074 594.551388 1053.605808 222.4196747 585.2365709 303.4453328
16 Soltu.DM.01G000170 751.7748456 841.0301941 910.3763931 773.9192261 835.4107154 820.7132361 1148.975573 804.140941 849.3435247 710.4399938 946.4830913
17 Soltu.DM.01G000190 6.328071091 1.873118472 5.525805117 6.431461713 8.836074875 5.49993978 8.694227272 11.14926781 4.609734191 7.869929438 0.951239288
18 Soltu.DM.01G000200 88.59299527 73.05162042 66.30966141 74.31911313 63.45908319 78.83247019 74.23532517 86.40682554 59.35032771 59.38219485 44.70824652
19 Soltu.DM.01G000210 108.8428228 112.3871083 85.64997932 111.4786697 73.0984376 123.4430928 113.6937412 143.5468231 67.41736254 77.26839812 86.56277518
20 Soltu.DM.01G000220 5.062456873 86.16344973 93.938687 20.72359885 507.6726655 30.555221 24.74510839 6.968292383 551.4394526 54.37405793 920.7996305
This is how the file appears in Bash shell
gene_name,VarXCRep.1,VarX1Rep.1,VarX2Rep.1,VarXCRep.2,VarX3Rep.2,VarX1Rep.2,VarX2Rep.2,VarXCRep.3,VarX3Rep.3,VarX1Rep.3,VarX2Rep.3
Soltu.DM.01G000010,360.7000522,395.2279977,323.2595994,361.5910696,327.7380499,386.8290979,336.3997167,333.0843759,317.4954424,377.756613,396.666783
Soltu.DM.01G000020,91.12422371,69.30538348,77.36127164,135.060696,61.85252412,110.6099,68.21624475,108.7053612,55.31681029,56.52040232,36.14709293
Soltu.DM.01G000030,439.1681337,183.5656103,232.0838149,579.546161,220.9018719,179.6646995,179.2348391,291.2746216,222.4196747,266.8621527,208.321404
Soltu.DM.01G000040,268.3102142,185.4387288,192.0217278,301.5640936,130.9345641,237.108515,203.9799475,236.921941,92.19468382,198.1791322,38.04957151
Soltu.DM.01G000050,341.7158389,479.5183289,504.229717,322.2876925,528.5579334,390.4957244,470.1570594,342.8399852,554.3205365,424.9761896,634.4766049
Soltu.DM.01G000060,468.2772607,839.1570756,759.7982036,514.516937,886.0173261,572.6048416,579.8380803,549.1014398,1011.836655,598.8300854,1077.754113
Soltu.DM.01G000070,2.531228436,0,5.525805117,1.429213714,8.032795341,1.83331326,5.350293706,0,4.609734191,0,7.609914302
Soltu.DM.01G000090,84.79615262,54.3204357,75.97982036,98.61574626,102.0165008,83.11020113,84.26712586,108.7053612,98.53306833,80.13019064,93.2214502
Soltu.DM.01G000100,67.07755356,73.05162042,12.43306151,118.6247383,6.426236273,77.61026135,36.11448251,97.55609336,8.643251608,67.25212429,15.2198286
I was asked to remove various types of columns and associated data which I have done successfully in the following code. I was then asked to arrange the data such that the headers show control (VarXC) repeats 1, 2 and 3 and experiment 1 (VarX1) repeats in columns next to each other which also has been done in the following code:
empty_list = []
for ln in open("FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list += col
empty_list += '\n'
file_out = open("Xtest_2Var.csv", "w")
file_out.write(','.join(empty_list))
file_out.close()
When I try to compile all this information, the output shows up like this:
This is the final output
I am not sure how I am getting that space on the left side. Can someone help me remove so that all the rows shift by one cell to the left?
You should change the code a little bit to make it work as you expect. The problem with your code is that you are constructing a single list to which you add EOL \n as elements. Therefore, when you write this list to a file
file_out.write(','.join(empty_list))
there will be a comma after each line break. I construct a list of lists and add \n right after join to avoid your problem:
empty_list = []
for ln in open("files/FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list.append(col)
file_out = open("files/Xtest_2Var.csv", "w")
for item in empty_list:
file_out.write(','.join(item) + '\n')
file_out.close()
But it's better to use csv library. It is suitable for reading and writing csv files.
Using pandas:
import pandas as pd
import re
df = pd.read_csv('FinalXVartest.csv', index_col='gene_name')
parsed = sorted([(re.match(r'VarX(.)Rep.(\d)', k).groups()[::-1], k) for k in df.columns])
cols = [k for (i, j), k in parsed if j in {'1', 'C'}]
df.to_csv('Xtest_2Var.csv')
>>> df[cols]
VarX1Rep.1 VarXCRep.1 VarX1Rep.2 VarXCRep.2 VarX1Rep.3 VarXCRep.3
gene_name
Soltu.DM.01G000010 395.227998 360.700052 386.829098 361.591070 377.756613 333.084376
Soltu.DM.01G000020 69.305383 91.124224 110.609900 135.060696 56.520402 108.705361
Soltu.DM.01G000030 183.565610 439.168134 179.664700 579.546161 266.862153 291.274622
Soltu.DM.01G000040 185.438729 268.310214 237.108515 301.564094 198.179132 236.921941
Soltu.DM.01G000050 479.518329 341.715839 390.495724 322.287692 424.976190 342.839985
Soltu.DM.01G000060 839.157076 468.277261 572.604842 514.516937 598.830085 549.101440
Soltu.DM.01G000070 0.000000 2.531228 1.833313 1.429214 0.000000 0.000000
Soltu.DM.01G000090 54.320436 84.796153 83.110201 98.615746 80.130191 108.705361
Soltu.DM.01G000100 73.051620 67.077554 77.610261 118.624738 67.252124 97.556093

How to sort out a text file in python using numbers in the text file

I have the following text file:
345 eee
12 nt
3 s
9 test
How can I make it so it sorts it in numerical order with the text there?
The output I'm hoping for is
345 eee
12 nt
9 test
3 s
Note: I'm grabbing data from text files
45 eee
12 nt
945 test
344 s
45 gh
Current Code
Credit: #CypherX
import pandas as pd
s = """
345 eee
1200 nt
9 test
-3 s
"""
# Custom Function
def sort_with_digits(s, ascending = True):
lines = s.strip().split('\n')
df = pd.DataFrame({'Lines': lines})
df2 = df.Lines.str.strip().str.split(' ', expand=True).rename(columns={0: 'Numbers', 1: 'Text'})
df['Numbers'] = df2['Numbers'].astype(float)
df['Text'] = df2['Text'].str.strip()
df.sort_values(['Numbers', 'Text'], ascending = ascending, inplace=True)
return df.Lines.tolist()
print(s)
sort_with_digits(s, ascending = True) # this is your output
Using python and no system calls:
# This is the function to amend when you want to change the ordering
def key_function(line):
# To sort by the first number when there is a space
return int(line.split()[0])
To extract any number that begins the line you can use a regex
def key_function(line):
match = re.match('^\d+', line)
if match:
return int(match.group())
else:
return 0
Then the rest of the method is the same
with open(file_name, 'r') as f:
# Read all lines into a list
lines = f.readlines()
with open(file_name, 'w') as f:
# Sort all the lines by "key_function"
for line in sorted(lines, key=key_function, reverse=True):
f.write(line + '\n')
Here is the solution in bash. you can use subprocess to run it in python.
sort -k1 -r -n file > new_file
Using this with pyhton subprocess
import subprocess
# Simple command
subprocess.Popen(['sort -k1 -r -n test.txt'], shell=True)
EDIT:
The OP described later that the requirement is to first order by numbers and then order by the rest of text that follows. The solution now reflects this requirement.
I wrote a custom function (sort_with_digits) which finds out the numbers and then sorts the lines accordingly using pandas library. All you have to do is:
#read-in data from a text file:
with open('input.txt', 'r') as f:
s = f.read()
sort_with_digits(s, ascending = True)
Code with Example Data
s = """
345 eee
12 nt
9 test
3 s
"""
import pandas as pd
# Custom Function
def sort_with_digits(s, ascending = True):
lines = s.strip().split('\n')
df = pd.DataFrame({'Lines': lines})
df2 = df.Lines.str.strip().str.split(' ', expand=True).rename(columns={0: 'Numbers', 1: 'Text'})
df['Numbers'] = df2['Numbers'].astype(float)
df['Text'] = df2['Text'].str.strip()
df.sort_values(['Numbers', 'Text'], ascending = ascending, inplace=True)
return df.Lines.tolist()
sort_with_digits(s, ascending = True)
Output:
['3 s', '9 test', '12 nt', '345 eee']
Note:
If you use a simple '\n'.join(result) on a list (result) that will produce a string formatted similar to out the input (s).
result = sort_with_digits(s, ascending = True)
print('\n'.join(result))
Output:
12 nt
45 eee
45 gh
344 s
945 test
With Another Dummy Dataset
Dummy Data: A
s = """
345 eee
1200 nt
9 test
-3 s
"""
# Expected Result: ['-3 s', '9 test', '345 eee', '1200 nt']
# And the solution produces this as well.
Dummy Data: B
s = """
45 eee
12 nt
945 test
344 s
45 gh
"""
# Expected Result: ['12 nt', '45 eee', '45 gh', '344 s', '945 test']
# And the solution produces this as well.
All right, here's a bad answer:
#!/usr/bin/python
import os
os.system('sort -n -r /path/to/file')
I used python to run a shell command "sort" using the numeric and reverse options.
I used python because you tagged the question python.
I used the -r option because your output example seems to be sorted in reverse order.
This would be a better answer if it used subprocess instead of os.system (as the other answer mentions).

How to skip more then one lines of header in RDD in Spark

Data in my first RDD is like
1253
545553
12344896
1 2 1
1 43 2
1 46 1
1 53 2
Now the first 3 integers are some counters that I need to broadcast.
After that all the lines have the same format like
1 2 1
1 43 2
I will map all those values after 3 counters to a new RDD after doing some computation with them in function.
But I'm not able to understand how to separate those first 3 values and map the rest normally.
My Python code is like this
documents = sc.textFile("file.txt").map(lambda line: line.split(" "))
final_doc = documents.map(lambda x: (int(x[0]), function1(int(x[1]), int(x[2])))).reduceByKey(lambda x, y: x + " " + y)
It works only when first 3 values are not in the text file but with them it gives error.
I don't want to skip those first 3 values, but store them in 3 broadcast variables and then pass the remaining dataset in map function.
And yes the text file has to be in that format only. I cannot remove those 3 values/counters
Function1 is just doing some computation and returning the values.
Imports for Python 2
from __future__ import print_function
Prepare dummy data:
s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"
with open("file.txt", "w") as fw: fw.write(s)
Read raw input:
raw = sc.textFile("file.txt")
Extract header:
header = raw.take(3)
print(header)
### [u'1253', u'545553', u'12344896']
Filter lines:
using zipWithIndex
content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
print(content.first())
## 1 2 1
using mapPartitionsWithIndex
from itertools import islice
content = raw.mapPartitionsWithIndex(
lambda i, iter: islice(iter, 3, None) if i == 0 else iter)
print(content.first())
## 1 2 1
NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).
In my case I have a csv file like below
----- HEADER START -----
We love to generate headers
#who needs comment char?
----- HEADER END -----
colName1,colName2,...,colNameN
val__1.1,val__1.2,...,val__1.N
Took me a day to figure out
val rdd = spark.read.textFile(pathToFile) .rdd
.zipWithIndex() // get tuples (line, Index)
.filter({case (line, index) => index > numberOfLinesToSkip})
.map({case (line, index) => l}) //get rid of index
val ds = spark.createDataset(rdd) //convert rdd to dataset
val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv
Sorry code in scala, however can be easily converted to python
First take the values using take() method as zero323 suggested
raw = sc.textfile("file.txt")
headers = raw.take(3)
Then
final_raw = raw.filter(lambda x: x != headers)
and done.

Need to create a median function that draws from a dictionary

I need to find the median of all the integers associated with each key (AA, BB). The basic format my code leads to:
AA - 21
AA - 52
BB - 3
BB - 2
My code:
def scoreData(filename):
d = dict()
fin = open(filename)
contents = fin.readlines()
for line in contents:
parts = linesplit()
part[i] = int(part[1])
if parts[0] not in d:
d[parts[0]] = list(parts[1])
else:
d[parts[0]].append(parts[1])
names = list(d.keys())
names.sort() #alphabeticez the names
print("Name\+Max\+Min\+Median")
for name in names: #makes the table
print (name"\+", max(d[name]),\+min(d[name]),"\+"median(d[name]))
I'm afraid following the same format as the "names" and "names.sort" will completely restructure the data. I've thought about "from statistics import median," but once again I do not know how to only select the values associated with each of the same keys.
Thanks in advance
You can do it easily with pandas and numpy:
import pandas
import numpy as np
and aggregating by first row:
score = pandas.read_csv(filename, delimiter=' - ', header=None)
print score.groupby(0).agg([np.median, np.min, np.max])
which returns:
1
median amin amax
0
AA 36.5 21 52
BB 2.5 2 3
There are many, many ways you can go about this. But here's a 'naive' implementation that will get the job done.
Assuming your data looks like:
AA 1
BB 5
AA 2
CC 7
BB 1
You can do the following:
import numpy as np
from collections import defaultdict
def find_averages(input_file)
result_dict = defaultdict(list)
for line in input_file.readlines()
key, value = line.split()
result_dict[key].append[int(value)]
return [(key, np.mean(value)) for key,value in result_dict.iteritems()]

manipulating two values of a key in dictionary at the same time

i am reading a file which is in the format below:
0.012281001 00:1c:c4:c2:1f:fe 1 30
0.012285001 00:1c:c4:c2:1f:fe 3 40
0.012288001 00:1c:c4:c2:1f:fe 2 50
0.012292001 00:1c:c4:c2:1f:fe 4 60
0.012295001 24:1c:c4:c2:2f:ce 5 70
I intend to make column 2 entities as keys and columns 3 and 4 as separate values. For each line I encounter, for that particular key, their respective values must add up (value 1 and value 2 should aggregate separately for that key). In the above example mentioned, I need to get the output like this:
'00:1c:c4:c2:1f:fe': 10 : 180, '24:1c:c4:c2:2f:ce': 5 : 70
The program i have written for simple 1 key 1 value is as below:
#!/usr/bin/python
import collections
result = collections.defaultdict(int)
clienthash = dict()
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = float(hashes[2])
result[ckey] += val1
print result
How can I extend this for 2 values and how can I print them as the output mentioned above. I am not getting any ideas. Please help! BTW i am using python2.6
You can store all of the values in a single dictionary, using a tuple as the stored value:
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = int(hashes[2])
val2 = int(hashes[3])
a,b = result[ckey]
result[ckey] = (a+val1, b+val2)
print result

Categories