Create a dictionary from lists, overwrite duplicate keys - python

I have my code below. I am trying to create a dictionary from my lists extracted from a txt file but the loop overwrites the previous information:
f = open('data.txt','r')
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('data.txt')]
columns=lines.pop(0)
for i in range(len(lines)):
lines[i]=lines[i].split(',')
dictt={}
for line in lines:
dictt[line[0]]=line[1:]
print('\n')
print(lines)
print('\n')
print(dictt)
I know I have to play with:
for line in lines:
dictt[line[0]] = line[1:]
part but what can I do , do I have to use numpy? If so, how?
My lines list is :
[['USS-Enterprise', '6', '6', '6', '6', '6'],
['USS-Voyager', '2', '3', '0', '4', '1'],
['USS-Peres', '10', '4', '0', '0', '5'],
['USS-Pathfinder', '2', '0', '0', '1', '2'],
['USS-Enterprise', '2', '2', '2', '2', '2'],
['USS-Voyager', '2', '1', '0', '1', '1'],
['USS-Peres', '8', '5', '0', '0', '4'],
['USS-Pathfinder', '4', '0', '0', '2', '1']]
My dict becomes:
{'USS-Enterprise': ['2', '2', '2', '2', '2'],
'USS-Voyager': ['2', '1', '0', '1', '1'],
'USS-Peres': ['8', '5', '0', '0', '4'],
'USS-Pathfinder': ['4', '0', '0', '2', '1']}
taking only the last ones, I want to add the values together. I am really confused.

You are trying to append multiple values for the same key. You can use defaultdict for that, or modify your code and utilize the get method for dictionaries.
for line in lines:
dictt[line[0]] = dictt.get(line[0], []).extend(line[1:])
This will look for each key, assign the line[1:] if the key is unique, and if it is duplicate, simply append those values onto the previous values.

dict_output = {}
for line in list_input:
if line[0] not in dict_output:
dict_output[line[0]] = line[1:]
else:
dict_output[line[0]] += line[1:]

EDIT: You subsequently clarified in comments that your input has duplicate keys, and you want later rows to overwrite earlier ones.
ORIGINAL ANSWER: The input is not a dictionary, it's a CSV file. Just use pandas.read_csv() to read it:
import pandas as pd
df = pd.read_csv('my.csv', sep='\s+', header=None)
df
0 1 2 3 4 5
0 USS-Enterprise 6 6 6 6 6
1 USS-Voyager 2 3 0 4 1
2 USS-Peres 10 4 0 0 5
3 USS-Pathfinder 2 0 0 1 2
4 USS-Enterprise 2 2 2 2 2
5 USS-Voyager 2 1 0 1 1
6 USS-Peres 8 5 0 0 4
7 USS-Pathfinder 4 0 0 2 1
Seems your input didn't have a header row. If your input columns had names, you can add them with df.columns = ['Ship', 'A', 'B', 'C', 'D', 'E'] or whatever.
If you really want to write a dict output (beware of duplicate keys being suppressed), see df.to_dict()

Related

column comparison of two dataframe, return df with mismatches python

I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])

Python if/else statement confusion

How can you create an if else statement in python when you have a file with both text and numbers. Let's say I want to replace the values from the third to last column in the file below. I want to create an if else statement to replace values <5 or if there's a dot "." with a zero, and if possible to use that value as integer for a sum.
A quick and dirty solution using awk would look like this, but I'm curious on how to handle this type of data with python:
awk -F"[ :]" '{if ( (!/^#/) && ($9<5 || $9==".") ) $9="0" ; print }'
So how do you solve this problem?
Thanks
Input file:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:1:.:3
sample2 1 4 3 5 1:3:2:.:3:3
sample3 2 4 6 7 .:0:6:5:4:0
Desired output:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:0:0:3
sample2 1 4 3 5 1:3:2:0:3:3
sample3 2 4 6 7 .:0:6:5:4:0
SUM = 5
Result so far
['sample1', '1', '2', '3', '4', '1', '0', '2', '0', '0', '3\n']
['sample2', '1', '4', '3', '5', '1', '3', '2', '0', '3', '3\n']
['sample3', '2', '4', '6', '7', '.', '0', '6', '5', '4', '0']
Here's what I have tried so far:
import re
data=open("inputfile.txt", 'r')
for line in data:
if not line.startswith("#"):
nodots = line.replace(":.",":0")
final_nodots=re.split('\t|:',nodots)
if (int(final_nodots[8]))<5:
final_nodots[8]="0"
print (final_nodots)
else:
print(final_nodots)
data=open("inputfile.txt", 'r')
import re
sums = 0
for line in data:
if not line.startswith("#"):
nodots = line.replace(".","0")
final_nodots=list(re.findall('\d:.+\d+',nodots)[0])
if (int(final_nodots[6]))<5:
final_nodots[6]="0"
print(final_nodots)
sums += int(final_nodots[6])
print(sums)
You were pretty close but you your final_nodots returns a split on : instead of a split on the first few numbers, so your 8 should have been a 3. After that just add a sums counter to keep track of that slot.
['sample1 1 2 3 4 1', '0', '2', '0', '0', '3\n']
There are better ways to achieve what you want but I just wanted to fix your code.

Python pandas: Nested List of Dictionary into Dataframe

Here is how the data structure is below... It is a List with inner List that contains two Dictionaries each.
I want it into dataframe with these headings: hasPossession, score and spread.
[[{'hasPossession': '0', 'score': '23', 'spread': '-0'},
{'hasPossession': '0', 'score': '34', 'spread': '0.0'}],
[{'hasPossession': '0', 'score': '', 'spread': '-7.5'},
{'hasPossession': '0', 'score': '', 'spread': '7.5'}],
[{'hasPossession': '0', 'score': '', 'spread': '-1'},
{'hasPossession': '0', 'score': '', 'spread': '1.0'}]]
Generally, above structure is a List that contains 3 Lists and each List contains 2 Dictionary with 2 elements.
How do I transform such into pandas dataframe?
flatten the list and use the default constructor
pd.DataFrame([k for item in initial_list for k in item])
hasPossession score spread
0 0 23 -0
1 0 34 0.0
2 0 -7.5
3 0 7.5
4 0 -1
5 0 1.0

Separate a file in paragraphs

I have a file like this:
cluster number 1
1
2
3
cluster number 2
1
2
3
cluster number x
1
2
3
I want to split this file in paragraph of cluster numbers, like this
cluster number 1
1
2
3
I try to search for an answer but I can't handle it.
Thanks for your help!
user regular expression
import re
input_text = "..."
r = re.findall(r"(cluster number (\d+)\n\n(\d+)\n\n(\d+)\n\n(\d+))", input_text)
print r
this code return below list
[('cluster number 1\n\n1\n\n2\n\n3', '1', '1', '2', '3'),
('cluster number 2\n\n1\n\n2\n\n3', '2', '1', '2', '3')]
you can also see the detail explanation from here
As recommended, you should use regular expressions. Perhaps the re.split function would be suitable here:
>>> l = re.split('cluster number (?:\d+)', x)[1:]
>>> [a.split() for a in l]
[['1', '2', '3'], ['1', '2', '3'], ...]

Python: split by (different) n spaces

I have lines like this:
2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000
6 30 139 "guid" Other name^7 0 ip.a.dd.res:port 932 25000
I would like to split this, but the problem is that there is different number of spaces between this "words"...
How can I do this?
Python's split function doesn't care about the number of spaces:
>>> ' 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000'.split()
['2', '20', '164', '"guid"', 'Some', 'name^7', '0', 'ip.a.dd.res:port', '-21630', '25000']
Have you tried split()? It will "compress" spaces, so after split you will get:
'2', '20', '164', '"guid'" etc.
>>> l = "1 2 4 'ds' 5 66"
>>> l
"1 2 4 'ds' 5 66"
>>> l.split(' ')
['1', '', '', '2', '', '', '4', "'ds'", '5', '', '66']
>>> [x for x in l.split()]
['1', '2', '4', "'ds'", '5', '66']
Just use split() function. The delimiter is \s+ that is any kind and any number of space

Categories