How to lookup value in another table in Python - python

I have two (actually many, but stick with two) datasets and I need to merge them together. However, they are not same range and they have different reference values. Lets consider
a 1
b 2
c 3
e 4
and
a 2
b 3
d 7
e 2
I tried to simulate Excel index and match function, but I am not able to get the right result
b = []
f = []
for i in data1["c1"]:
if i in data2["c1"]:
a = d3[data2["c4"].index[i]]
f = b.append(a)
else:
continue
print(f)
Can you please help me how this works? I would also welcome some link with further information about this topic. Thank you

If you want to create a consolidated file from the two above like:
Col1 Col2 Col3
a 1 2
b 2 3
c 3 7
d 4 2
You can simply use dictionaries, with keys as your column 1 values: a, b, c, d and values as list of the 2nd column values from your two DataFrames respectively like:
your_dict = {a:[1,2], b:[2,3], c:[3,7], d:[4,2]}
Then to output that into one DataFrame such as the one above, just use the .from_dict() method in pandas with the orient parameter equal to 'index' see documentation here.

Related

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

how to set first column to index using iloc[:,0]

I have a dataframe,and want to set first column to index using iloc[:,0],but something's wrong.
I apply iloc[:,0] to set first column to index.
data12 = pd.DataFrame({"b":["a","h","r","e","a"],
"a":range(5)})
data2 = data12.set_index(data12.iloc[:,0])
data2
b a
b
a a 0
h h 1
r r 2
e e 3
a a 4
I want to get the follwing result.
a
b
a 0
h 1
r 2
e 3
a 4
thank you very much
Use the name of the Series, not the Series itself.
data12.set_index(data12.iloc[:, 0].name) # or data12.columns[0]
a
b
a 0
h 1
r 2
e 3
a 4
From the documentation for set_index
keys This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
You need to pass a key, not an array if you want to set the index and have the respective Series no longer included as a column of the DataFrame.

programmatically add pandas DataFrame name to columns

This should be a pretty simple question, but I'm looking to programmatically insert the name of a pandas DataFrame into that DataFrame's column names.
Say I have the following DataFrame:
name_of_df = pandas.DataFrame({1: ['a','b','c','d'], 2: [1,2,3,4]})
print name_of_df
1 2
0 a 1
1 b 2
2 c 3
3 d 4
I want to have following:
name_of_df = %%some_function%%(name_of_df)
print name_of_df
name_of_df1 name_of_df2
0 a 1
1 b 2
2 c 3
3 d 4
..where, as you can see, the name of the DataFrame is programatically inputted into the column names. I know pandas DataFrames don't have a __name__ attribute, so I'm drawing a blank on how to do this.
Please note that I want to do this programatically, so altering the names of the columns with a hardcoded 'name_of_df' string won't work.
From the linked question, you can do something like this. Multiple names can point to the same DataFrame, so this will just grab the "first" one.
def first_name(obj):
return [k for k in globals() if globals()[k] is obj and not k.startswith('_')][0]
In [24]: first_name(name_of_df)
Out[24]: 'name_of_df'

Categories