I have the following dataframe with the columns id, start, end, name:
A 7 340 string1
B 12 113 string2
B 139 287 string3
B 301 348 string4
B 379 434 string5
C 41 73 string6
C 105 159 string7
and I am reading this into python3 using pandas:
import pandas
df = pandas.read_csv("table", comment="#", header=None, names=["id", "start", "end", "name"])
Now I need to parse the df and extract for each id the start, end and name into a list of the following format:
mylist = [GraphicFeature(start=XXX, end=YYY, color="#ffffff", label="ZZZ")]
XXX here is the start, YYY is the end, ZZZ is the "name". The list has therefore as many items as number of rows per id.
GraphicFeature is just a member name of a module.
I thought of looping over the dataframe like this:
uniq_val = list(df["id"].unique())
for i in uniq_val:
extracted = df.loc[df["id"] == i]
But how do I construct mylist? (There will be some other plotting commands after constructing the list).
My expected "output" in the loop therefore is:
for id A:
mylist = [GraphicFeature(start=7, end=340, color="#ffffff", label="string1")]
for id B:
mylist = [GraphicFeature(start=12, end=113, color="#ffffff", label="string2"), GraphicFeature(start=139, end=287, color="#ffffff", label="string3"), GraphicFeature(start=301, end=348, color="#ffffff", label="string4"), GraphicFeature(start=379, end=434, color="#ffffff", label="string5")]
for id C:
mylist = [GraphicFeature(start=41, end=73, color="#ffffff", label="string6"), GraphicFeature(start=105, end=159, color="#ffffff", label="string7")]
Using for loop
l=[[GraphicFeature(start=x[0], end=x[1], color="#ffffff", label=x[2])for x in zip(y.start,y.end,y.name) ] for _,y in df.groupby('id')]
One approach would be to let
mylists = df.groupby('id').apply(lambda group: group.apply(lambda row: GraphicFeature(start=row['start'], end=row['end'], color='#ffffff', label=row['name']), axis=1).tolist())
Spelling this out a little bit, note that pandas operations tends to fit together most tidily if one takes a functional programming approach; we want to turn each row into a GraphicFeature, and in turn we want to turn each group of rows with the same id into a list of GraphicFeature. As such, the above could also be expanded to
def row_to_graphic_feature(row):
return GraphicFeature(start=row['start'], end=row['end'], color='#ffffff', label=row['name'])
def id_group_to_list(group):
return group.apply(row_to_graphic_feature, axis=1).tolist()
mylists = df.groupby('id').apply(id_group_to_list)
With your example data:
In [38]: df
Out[38]:
id start end name
0 A 7 340 string1
1 B 12 113 string2
2 B 139 287 string3
3 B 301 348 string4
4 B 379 434 string5
5 C 41 73 string6
6 C 105 159 string7
In [39]: mylists = df.groupby('id').apply(id_group_to_list)
In [40]: mylists['A']
Out[40]: [GraphicFeature(start=7, end=340, color='#ffffff', label='string1')]
In [41]: mylists['B']
Out[41]:
[GraphicFeature(start=12, end=113, color='#ffffff', label='string2'),
GraphicFeature(start=139, end=287, color='#ffffff', label='string3'),
GraphicFeature(start=301, end=348, color='#ffffff', label='string4'),
GraphicFeature(start=379, end=434, color='#ffffff', label='string5')]
In [42]: mylists['C']
Out[42]:
[GraphicFeature(start=41, end=73, color='#ffffff', label='string6'),
GraphicFeature(start=105, end=159, color='#ffffff', label='string7')]
Related
I have a data frame where in the first column I have to concatenate the other two if this record is empty.
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
NaN 345 B
NaN 987 A
for x in df1["Cuenta CeCo"].isna():
if x:
df1["Cuenta CeCo"]=df1["GLAccount"].apply(str)+" "+df1["CeCoCeBe"]
else :
df1["Cuenta CeCo"]
TYPES:
df1["Cuenta CeCo"] = dtype('O')
df1["GLAccount"] = dtype('float64')
df1["CeCoCeBe"] = dtype('O')
expected output:
Cuenta CeCo GLAccount CeCoCeBe
123 A 123 A
234 S 234 S
345 B 345 B
987 A 987 A
however it seems that when concatenating it does something strange and throws me other numbers and letters
Cuenta CeCo
251 O
471 B
791 R
341 O
Could someone support me to know why this happens and how to correct it to have my expected exit?
Iterating over dataframes is typically bad practice and not what you intend. As you have done it, you are actually iterating over the columns. Try
for x in df:
print(x)
and you will see it print the column headings.
As for what you're trying to do, try this:
cols = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
mask = df[cols[0]].isna()
df.loc[mask, cols[0]] = df.loc[mask, cols[1]].map(str) + " " + df.loc[mask, cols[2]]
This generates a mask (in this case a series of True and False) that we use to get a series of just the NaN rows, then replace them by getting the string of the second column and concatenating with the third, using the mask again to get only the rows we need.
import pandas as pd
import numpy as np
df = pd.DataFrame([
['123 A', 123, 'A'],
['234 S', 234, 'S'],
[np.NaN, 345, 'B'],
[np.NaN, 987, 'A']
], columns = ['Cuenta CeCo', 'GLAccount', 'CeCoCeBe']
)
def f(r):
if pd.notna(r['Cuenta CeCo']):
return r['Cuenta CeCo']
else:
return f"{r['GLAccount']} {r['CeCoCeBe']}"
df['Cuenta CeCo'] = df.apply(f, axis=1)
df
prints
index
Cuenta CeCo
GLAccount
CeCoCeBe
0
123 A
123
A
1
234 S
234
S
2
345 B
345
B
3
987 A
987
A
Sample data below
enter image description here
input of file A and File B is given and the output format also given . can someone help me on this
I'd also be curious to see a clever/pythonic solution to this. My "ugly" solution iterating over index is as follows:
dfa, dfb are the two dataframes, columns named as in example.
dfa = pd.DataFrame({'c1':['v','f','h','m','s','d'],'c2':['100','110','235','999','333','39'],'c3':['tech','jjj',None,'iii','mnp','lf'],'c4':['hhh','scb','kkk','lop','sos','kdk']})
dfb = pd.DataFrame({'c1':['v','h','m','f','L','s'],'c2':['100','235','999','110','777','333'],'c3':['tech',None,'iii','jkl','9kdf','mnp1'],'c4':['hhh','mckkk','lok','scb','ooo','sos1']})
Now let's create lists of indexes to identify the rows that don't match between dfa and dfb
dfa, dfb = dfa.set_index(['c1','c2']), dfb.set_index(['c1','c2'])
mismatch3, mismatch4 = [],[]
for i in dfa.index:
if i in dfb.index:
if dfa.loc[i,'c3']!=dfb.loc[i,'c3']:
mismatch3.append(i)
if dfa.loc[i,'c4']!=dfb.loc[i,'c4']:
mismatch4.append(i)
mismatch = list(set(mismatch3+mismatch4))
Now that this is done, we want to rename dfb, perform the join operation on the mismatched indexes, and add the "status" columns based on mismatch3 and mismatch4.
dfb = dfb.rename(index=str, columns={'c3':'b_c3','c4':'b_c4'})
df = dfa.loc[mismatch].join(dfb)
df['c3_status'] = 'match'
df['c4_status'] = 'match'
df.loc[mismatch3, 'c3_status'] = 'mismatch'
df.loc[mismatch4, 'c4_status'] = 'mismatch'
Finally, let's get those columns in the right order :)
result = df[['c3','b_c3','c3_status','c4','b_c4','c4_status']]
Once again, I'd love to see a prettier solution. I hope this helps!
Here are four lines of code that may do what you are looking for:
columns_to_compare =['c2','c3']
dfa['Combo'] = dfa[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
dfb['Combo1'] = dfb[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
explanation
Assume that you want to see what dfb rows are not in dfa, for columns c2 and c3.
To do this, consider the following approach:
Create a column "Combo" in dfa where each row of "Combo" contains a comma separated string, representing the values of the chosen columns to compare (for the row concerned)
dfa['Combo'] = dfa[dfa.columns].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
c1 c2 c3 c4 Combo
0 v 100 tech hhh 100, tech
1 f 110 jjj scb 110, jjj
2 h 235 None kkk 235
3 m 999 iii lop 999, iii
4 s 333 mnp sos 333, mnp
5 d 39 lf kdk 39, lf
Apply the same logic to dfb
c1 c2 c3 c4 Combo1
0 v 100 tech hhh 100, tech
1 h 235 None mckkk 235
2 m 999 iii lok 999, iii
3 f 110 jkl scb 110, jkl
4 L 777 9kdf ooo 777, 9kdf
5 s 333 mnp1 sos1 333, mnp1
Create a list containing the required indices from dfb:
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
or to show the actual row values (not indices):
[[x] for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
Row Index Result
[3, 4, 5]
Row Value Result
[['110, jkl'], ['777, 9kdf'], ['333, mnp1']]
I would like to replace the letters by their order number in the alphabet
import string
import pandas as pd
new_vals = {c: ord(c)-96 for c in string.ascii_lowercase}
df = pd.DataFrame({'Values': ['aaa', 'abc', 'def']})
df['Values_new'] = [''.join(str(new_vals[c]) for c in row) for row in df['Values']]
df is now:
>>> df
Values Values_new
0 aaa 111
1 abc 123
2 def 456
Then you can go in and add your what-seems-like-decimal notation, although the logic there seems a little unclear to me (you have a comma listed above):
df['Values_new'] = [v[0] + '.' + v[1:] for v in df['Values_new']]
Result:
>>> df
Values Values_new
0 aaa 1.11
1 abc 1.23
2 def 4.56
I have a dataframe like the one displayed below:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks'],
'company': ['1st', '1st', '2nd', '2nd'],
'deaths': ['kkk', 52, '25', 616],
'battles': [5, '42', 2, 2],
'size': ['l', 'll', 'l', 'm']}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'deaths', 'battles', 'size'])
My goal is to transform every single string inside of the dataframe to upper case so that it looks like this:
Notice: all data types are objects and must not be changed; the output must contain all objects. I want to avoid to convert every single column one by one... I would like to do it generally over the whole dataframe possibly.
What I tried so far is to do this but without success
df.str.upper()
astype() will cast each series to the dtype object (string) and then call the str() method on the converted series to get the string literally and call the function upper() on it. Note that after this, the dtype of all columns changes to object.
In [17]: df
Out[17]:
regiment company deaths battles size
0 Nighthawks 1st kkk 5 l
1 Nighthawks 1st 52 42 ll
2 Nighthawks 2nd 25 2 l
3 Nighthawks 2nd 616 2 m
In [18]: df.apply(lambda x: x.astype(str).str.upper())
Out[18]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
You can later convert the 'battles' column to numeric again, using to_numeric():
In [42]: df2 = df.apply(lambda x: x.astype(str).str.upper())
In [43]: df2['battles'] = pd.to_numeric(df2['battles'])
In [44]: df2
Out[44]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
In [45]: df2.dtypes
Out[45]:
regiment object
company object
deaths object
battles int64
size object
dtype: object
This can be solved by the following applymap method:
df = df.applymap(lambda s: s.lower() if type(s) == str else s)
Loops are very slow instead of using apply function to each and cell in a row, try to get columns names in a list and then loop over list of columns to convert each column text to lowercase.
Code below is the vector operation which is faster than apply function.
for columns in dataset.columns:
dataset[columns] = dataset[columns].str.lower()
Since str only works for series, you can apply it to each column individually then concatenate:
In [6]: pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
Out[6]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
Edit: performance comparison
In [10]: %timeit df.apply(lambda x: x.astype(str).str.upper())
100 loops, best of 3: 3.32 ms per loop
In [11]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
100 loops, best of 3: 3.32 ms per loop
Both answers perform equally on a small dataframe.
In [15]: df = pd.concat(10000 * [df])
In [16]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
10 loops, best of 3: 104 ms per loop
In [17]: %timeit df.apply(lambda x: x.astype(str).str.upper())
10 loops, best of 3: 130 ms per loop
On a large dataframe my answer is slightly faster.
try this
df2 = df2.apply(lambda x: x.str.upper() if x.dtype == "object" else x)
If you want to conserve the dtype use isinstance(obj,type)
df.apply(lambda x: x.str.upper().str.strip() if isinstance(x, object) else x)
if you want conserv dtype or change only one type.. try for and if:
for x in dados.columns:
if dados[x].dtype == 'object':
print('object - allow upper')
dados[x] = dados[x].str.upper()
else:
print('other? - not allow upper')
dados[x] = dados[x].str.upper()
You can apply it for every cols...
oh_df.columns = map(str.lower, oh_df.columns)
I have the following dataframe:
ID first mes1.1 mes 1.2 ... mes 1.10 mes2.[1-10] mes3.[1-10]
123df John 5.5 130 45 [12,312,...] [123,346,53]
...
where I have abbreviated columns using [] notation. So in this dataframe I have 31 columns: first, mes1.[1-10], mes2.[1-10], and mes3.[1-10]. Each row is keyed by a unique index: ID.
I would like to form a new table where I've replicated all column values, (represented here by ID and first) and move the mes2 and mes3 columns (20 of them) "down" giving me something like this:
ID first mes1 mes2 ... mes10
123df John 5.5 130 45
123df John 341 543 53
123df John 123 560 567
...
# How I set up your dataframe (please include a reproducible df next time)
df = pd.DataFrame(np.random.rand(6,31), index=["ID" + str(i) for i in range(6)],
columns=['first'] + ['mes{0}.{1}'.format(i, j) for i in range(1,4) for j in range(1,11)])
df['first'] = 'john'
Then there are two ways to do this
# Generate new underlying array
first = np.repeat(df['first'].values, 3)[:, np.newaxis]
new_vals = df.values[:, 1:].reshape(18,10)
new_vals = np.hstack((first, new_vals))
# Create new df
m = pd.MultiIndex.from_product((df.index, range(1,4)), names=['ID', 'MesNum'])
pd.DataFrame(new_vals, index=m, columns=['first'] + list(range(1,11)))
or using only Pandas
df.columns = ['first'] + list(range(1,11))*3
pieces = [df.iloc[:, i:i+10] for i in range(1,31, 10)]
df2 = pd.concat(pieces, keys = ['first', 'second', 'third'])
df2 = df2.swaplevel(1,0).sortlevel(0)
df2.insert(0, 'first', df['first'].repeat(3).values)