I have a csv table like so:
a, b, c, d
value, value, value, value
value, value, value, value
which I'm loading into a DataFrame. I also have a dictionary that looks like this:
data = {'a': some_value, 'b' = some_value, 'c' = some_value}
I want to navigate to the cell in column d where the row has values a, b and c as specified by this dictionary. I know that there exists exactly one cell that matches these criteria. How can I do this?
You could convert the data into a dataframe, then use a merge:
data = pd.DataFrame({'a':[1,2,3,4], 'b':[1,2,3,4],'c':[1,2,3,4],'d':[1,2,3,4]})
lookup = {'a':2,'b':2, 'c':2}
lookupdf = pd.DataFrame(lookup, index = [1]) #need the index, as they are all scalar
pd.merge(lookupdf, data)
a b c d
0 2 2 2 2
Another approach, would be use reduce on boolean conditions
In [1034]: data[np.logical_and.reduce(pd.DataFrame(data[x] == lookup[x] for x in lookup))]
Out[1034]:
a b c d
1 2 2 2 2
In [1035]: data[reduce(lambda x, y: x & y, [data[x] == lookup[x] for x in lookup])]
Out[1035]:
a b c d
1 2 2 2 2
Another way, could be using pd.query
In [1009]: query = ' and '.join(['%s == %s' % (x, lookup[x]) for x in lookup])
In [1010]: query
Out[1010]: 'a == 2 and c == 2 and b == 2'
In [1011]: data.query(query)
Out[1011]:
a b c d
1 2 2 2 2
Details
In [1012]: lookup
Out[1012]: {'a': 2, 'b': 2, 'c': 2}
In [1013]: data
Out[1013]:
a b c d
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
Related
I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need
I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?
We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.
You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c
I have two dataframes:
ONE=pd.read_csv('ONE.csv')
value_one value_two
2 4
3 1
4 2
TWO=pd.read_csv('TWO.csv')
X 1 2 3 4 5 6 7 8
1 a c j a d c c d
2 c k a d c c d e
3 f c k a d c c d
4 c k a d c c d j
I need to create additional column in ONE dataframe ( ONE['result'])
in conditions:
if value_one is equal to value from header of dataframe TWO
and value_two is equal to value from TWO dataframe in X column,
set in new column common value.
expected result:
value_one value_two result
2 4 k
3 1 j
4 2 d
I tried: use to compare only header if ONE[value_one]==TWO.iloc[0]
Thank you,
S.
lookup
You can lookup your second dataframe:
df_two = df_two.set_index('X') # set 'X' column as index
df_two.columns = df_two.columns.astype(int) # ensure column labels are numeric
df_one['result'] = df_two.lookup(df_one['value_two'], df_one['value_one'])
print(df_one)
value_one value_two result
0 2 4 k
1 3 1 j
2 4 2 d
I have the following question: I have the following table:
A B C
1 A A
2 A A.B
3 B B.C
4 A,B A.A,A.B,B.C
Column A is an index (1 through 4). Column B lists the letters, which appear in column C before the point (if there is any, if there is none, this is implicit, so the entry in (C,1) = A is the letter after the (.) (so this entry = A.A).
And column C either lists both letters before and after or only after the point.
The idea is to split these points and lists up. So column C should first be split up by the comma to separate rows (that works). Problematic here is, whenever there are different letter possible in B - because after splitting up, B should also only contain 1 letter (the correct on for column C).
So the result should look like this:
A B C
1 A A
2 A B
3 B C
4 A A
4 B B
4 B C
Can someone help me with ensuring, that column B contains the correct (i.e., fitting) information, which is denoted in column C?
Thanks and kind regards.
First, stack your dataframe to get your combinations:
out = (
df.set_index(['A', 'B']).C
.str.split(',').apply(pd.Series)
.stack().reset_index([0,1]).drop('B', 1)
)
A 0
0 1 A
1 2 A.B
2 3 B.C
3 4 A.A
4 4 A.B
5 4 B.C
Then replace single entries with their counterpart and apply pd.Series:
(out.set_index('A')[0].str
.replace(r'^([A-Z])$', r'\1.\1')
.str.split('.').apply(pd.Series)
.reset_index()
).rename(columns={0: 'B', 1: 'C'})
Output:
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
4 4 A B
5 4 B C
With a comprehension
def m0(x):
"""Take a string, return a dictionary split on '.' or a self mapping"""
if '.' in x:
return dict([x.split('.')])
else:
return {x: x}
def m1(s):
"""split string on ',' then do the dictionary thing in m0"""
return [*map(m0, s.split(','))]
pd.DataFrame([
(a, b, m[b])
for a, B, C in df.itertuples(index=False)
for b in B.split(',')
for m in m1(C) if b in m
], df.index.repeat(df.C.str.count(',') + 1), df.columns)
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
3 4 A B
3 4 B C
Is there any way to merge two data frames while one of them has duplicated indices such as following:
dataframe A:
value
key
a 1
b 2
b 3
b 4
c 5
a 6
dataframe B:
number
key
a I
b X
c V
after merging, I want to have a data frame like the following:
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Or maybe there are better ways to do it using groupby?
>>> a.join(b).sort('value')
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Use join:
>>> a = pd.DataFrame(range(1,7), index=list('abbbca'), columns=['value'])
>>> b = pd.DataFrame(['I', 'X', 'V'], index=list('abc'), columns=['number'])
>>> a.join(b)
value number
a 1 I
a 6 I
b 2 X
b 3 X
b 4 X
c 5 V