Merged 2 csv files with unique filter as 'NAME' using pandas. Further trying to compare 'STANCE' values to 'bipedal' and print. Summarizing, would like to know the method to compare any column values with a string.
s1:
NAME LEG_LENGTH DIET
0 Hadrosaurus 1.20 herbivore
s2:
NAME STRIDE_LENGTH STANCE
3 Hadrosaurus 1.40 bipedal
merged:
NAME LEG_LENGTH DIET STRIDE_LENGTH STANCE
0 Hadrosaurus 1.20 herbivore 1.40 bipedal
Code:
import pandas as pd
import csv
from collections import defaultdict
csv1 = 'dataset1.csv'
csv2 = 'dataset2.csv'
g = 9.8
def splits(c1, c2):
s1 = pd.read_csv(c1)
s2 = pd.read_csv(c2)
print '%s\n%s' % (s1,s2)
merged = s1.merge(s2, on="NAME", how = "outer") # Mergin two files on column NAME
print (merged)
return
splits(csv1, csv2)
hey little Pandas apprentice, try that
df.loc[df.STANCE.str.contains('bipedal')]
Related
new to python want to ask a quick question on how to replace multiple characters simultaneously given that the entries may have different data types. I just want to change the strings and keep everything else as it is:
import pandas as pd
def test_me(text):
replacements = [("ID", ""),("u", "a")] #
return [text.replace(a, b) for a, b in replacements if type(text) == str]
cars = {'Brand': ['HonduIDCivic', 1, 3.2,'CarIDA4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Brand'] = df['Brand'].apply(test_me)
resulting in
Brand Price
0 [HonduCivic, HondaIDCivic] 22000
1 [] 25000
2 [] 27000
3 [CarA4, CarIDA4] 35000
rather than
Brand Price
0 HondaCivic 22000
1 1 25000
2 3.2 27000
3 CarA4 35000
Appreciate any suggestions!
If the replacements never have identical search phrases, it will be easier to convert the list of tuples into a dictionary and then use
import re
#...
def test_me(text):
replacements = dict([("ID", ""),("u", "a")])
if type(text) == str:
return re.sub("|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)), lambda x: replacements[x.group()], text)
else:
return text
The "|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)) part will create a regular expression out of re.escaped dictionary keys starting with the longest so as to avoid issues when handling nested search phrases that share the same prefix.
Pandas test:
>>> df['Brand'].apply(test_me)
0 HondaCivic
1 1
2 3.2
3 CarA4
Name: Brand, dtype: object
I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.
This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)
You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object
With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object
You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
I have the following data-frame.
and I have an input list of values
I want to match each item from the input list to the Symbol and Synonym column in the data-frame and to extract only those rows where the input value appears in either the Symbol column or Synonym column(Please note that here the values are separated by '|' symbol).
In the output data-frame I need an additional column Input_symbol which denotes the matching value. So here in this case the desired output will should be like the image bellow.
How can I do the same ?.
IIUIC, use
In [346]: df[df.Synonyms.str.contains('|'.join(mylist))]
Out[346]:
Symbol Synonyms
0 A1BG A1B|ABG|GAB|HYST2477
1 A2M A2MD|CPAMD5|FWP007|S863-7
2 A2MP1 A2MP
6 SERPINA3 AACT|ACT|GIG24|GIG25
Check in both columns by str.contains and chain conditions by | (or), last filter by boolean indexing:
mylist = ['GAB', 'A2M', 'GIG24']
m1 = df.Synonyms.str.contains('|'.join(mylist))
m2 = df.Symbol.str.contains('|'.join(mylist))
df = df[m1 | m2]
Another solution is logical_or.reduce all masks created by list comprehension:
masks = [df[x].str.contains('|'.join(mylist)) for x in ['Symbol','Synonyms']]
m = np.logical_or.reduce(masks)
Or by apply, then use DataFrame.any for check at least one True per row:
m = df[['Symbol','Synonyms']].apply(lambda x: x.str.contains('|'.join(mylist))).any(1)
df = df[m]
print (df)
Symbol Synonyms
0 A1BG A1B|ABG|GAB|HYST2477
1 A2M A2MD|CPAMD5|FWP007|S863-7
2 A2MP1 A2MP
6 SERPINA3 AACT|ACT|GIG24|GIG25
The question has changed. What you want to do now is to look through the two columns (Symbol and Synonyms) and if you find a value that is inside mylist return it. If no match you can return 'No match!' (for instance).
import pandas as pd
import io
s = '''\
Symbol,Synonyms
A1BG,A1B|ABG|GAB|HYST2477
A2M,A2MD|CPAMD5|FWP007|S863-7
A2MP1,A2MP
NAT1,AAC1|MNAT|NAT-1|NATI
NAT2,AAC2|NAT-2|PNAT
NATP,AACP|NATP1
SERPINA3,AACT|ACT|GIG24|GIG25'''
mylist = ['GAB', 'A2M', 'GIG24']
df = pd.read_csv(io.StringIO(s))
# Store the lookup serie
lookup_serie = df['Symbol'].str.cat(df['Synonyms'],'|').str.split('|')
# Create lambda function to return first value from mylist, No match! if stop-iteration
f = lambda x: next((i for i in x if i in mylist), 'No match!')
df.insert(0,'Input_Symbol',lookup_serie.apply(f))
print(df)
Returns
Input_Symbol Symbol Synonyms
0 GAB A1BG A1B|ABG|GAB|HYST2477
1 A2M A2M A2MD|CPAMD5|FWP007|S863-7
2 No match! A2MP1 A2MP
3 No match! NAT1 AAC1|MNAT|NAT-1|NATI
4 No match! NAT2 AAC2|NAT-2|PNAT
5 No match! NATP AACP|NATP1
6 GIG24 SERPINA3 AACT|ACT|GIG24|GIG25
Old solution:
f = lambda x: [i for i in x.split('|') if i in mylist] != []
m1 = df['Symbol'].apply(f)
m2 = df['Synonyms'].apply(f)
df[m1 | m2]
I have the following code from this question Df groupby set comparison:
import pandas as pd
wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''
for index, row in wordlist.iterrows() :
row['split'] = list(row['word'])
anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
wordlist['anagrams'] = anaglist
wordlist = wordlist.drop(['split'], axis=1)
wordlist = wordlist['anagrams'].drop_duplicates(keep='first')
print(wordlist)
print(wordlist.dtypes)
Some input in my example.txt file seems to be being read as ints, particularly if the strings are of different character lengths. I can't seem to force pandas to see the data as strings using .astype(str)
What's going on?
First for force read column to string is possible use parameter dtype=str in read_csv, but it is used if numeric columns is necessary explicitly converting. So it seems because string values all values in column are converted to str implicitly.
I try a bit change your code:
Setup:
import pandas as pd
import numpy as np
temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
word
0 acb
1 acb
2 bca
3 foo
4 oof
5 spaniel
#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
print (wordlist)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo
5 spaniel aeilnps
#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')
#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
word anagrams
2 bca abc
4 oof foo
#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo
I need to find the median of all the integers associated with each key (AA, BB). The basic format my code leads to:
AA - 21
AA - 52
BB - 3
BB - 2
My code:
def scoreData(filename):
d = dict()
fin = open(filename)
contents = fin.readlines()
for line in contents:
parts = linesplit()
part[i] = int(part[1])
if parts[0] not in d:
d[parts[0]] = list(parts[1])
else:
d[parts[0]].append(parts[1])
names = list(d.keys())
names.sort() #alphabeticez the names
print("Name\+Max\+Min\+Median")
for name in names: #makes the table
print (name"\+", max(d[name]),\+min(d[name]),"\+"median(d[name]))
I'm afraid following the same format as the "names" and "names.sort" will completely restructure the data. I've thought about "from statistics import median," but once again I do not know how to only select the values associated with each of the same keys.
Thanks in advance
You can do it easily with pandas and numpy:
import pandas
import numpy as np
and aggregating by first row:
score = pandas.read_csv(filename, delimiter=' - ', header=None)
print score.groupby(0).agg([np.median, np.min, np.max])
which returns:
1
median amin amax
0
AA 36.5 21 52
BB 2.5 2 3
There are many, many ways you can go about this. But here's a 'naive' implementation that will get the job done.
Assuming your data looks like:
AA 1
BB 5
AA 2
CC 7
BB 1
You can do the following:
import numpy as np
from collections import defaultdict
def find_averages(input_file)
result_dict = defaultdict(list)
for line in input_file.readlines()
key, value = line.split()
result_dict[key].append[int(value)]
return [(key, np.mean(value)) for key,value in result_dict.iteritems()]