As part of a program that reads pandas data frame. One of these columns contains many values separate by : in the same column. To know what these values means, there is another column that says what each value is.
I want to split these values and put them in new columns the problem is that not all input in my programs receive exactly the same type of data frame and the order or new values can appear.
With an example is easier to explain:
df1
Column1 Column2
GT:AV:AD 0.1:123:23
GT:AV:AD 0.2:456:24
df2
Column1 Column2
GT:AD:AV 0.4:23:123
GT:AD:AV 0.5:12:323
Before being awera of this issue what I did to split this data and put them in new columns was something like this:
file_data["GT"] = file_data[name_sample].apply(lambda x: x.split(":")[1])
file_data["AD"] = file_data[name_sample].apply(lambda x: x.split(":")[2])
If what I want is GT and AD (if there are in the input data frame) how can I do this in a more secure way?
import pandas as pd
df = pd.DataFrame({"col1":["GT:AV:AD","GT:AD:AV"],"col2":["0.1:123:23","0.4:23:123"]})
df["keyvalue"] = df.apply(lambda x:dict(zip(x.col1.split(":"),x.col2.split(":"))), axis=1)
print(df)
output
col1 col2 keyvalue
0 GT:AV:AD 0.1:123:23 {'GT': '0.1', 'AV': '123', 'AD': '23'}
1 GT:AD:AV 0.4:23:123 {'GT': '0.4', 'AD': '23', 'AV': '123'}
Explanation: I create column keyvalue holding keys (from col1) and values (from col2), using dict(zip(keys_list, values_list)) construct, as dicts. apply with axis=1 apply function to each row, lambda is used in python for creating nameless function. If you wish to have rather pandas.DataFrame than column with dicts, you might do
df2 = df.apply(lambda x:dict(zip(x.col1.split(":"),x.col2.split(":"))), axis=1).apply(pd.Series)
print(df2)
output
GT AV AD
0 0.1 123 23
1 0.4 123 23
have a look at this answer:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
dictionary = dict(zip(keys, values))
print(dictionary) # {'a': 1, 'b': 2, 'c': 3}
you need to split your column 1 to array (keys) and column 2 to values.
this way you will have dictionary["GT"] etc.
I have two dataframes:
df1 = pd.DataFrame({'Code' : ['10', '100', '1010'],
'Value' : [25, 50, 75]})
df2 = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'Codes' : ['10', '100;1010', '100'],
'Value' : [25, 125, 50]})
Column "Codes" in df2 can contain multiple codes separated by ";". If this is the case, I need to sum up their values from df1.
I tried .map(), but this did not work for rows with multiple codes in a row. Also, I end up converting code '1010' to value '2525'.
How do I specify a perfect match and the summation for ";" separated values?
explode() the list of Codes
merge() with df1 and calculate total, grouping on the index of df2
have created a new column with this calculated
df1 = pd.DataFrame({"Code": ["10", "100", "1010"], "Value": [25, 50, 75]})
df2 = pd.DataFrame(
{"ID": ["A", "B", "C"], "Codes": ["10", "100;1010", "100"], "Value": [25, 125, 50]}
)
df2.join(
df2["Codes"]
.str.split(";")
.explode()
.reset_index()
.merge(df1, left_on="Codes", right_on="Code")
.groupby("index")
.agg({"Value": "sum"}),
rsuffix="_calc",
)
ID
Codes
Value
Value_calc
0
A
10
25
25
1
B
100;1010
125
125
2
C
100
50
50
def sum(df1, df2):
df1['sum'] = df1['Value'] + df2['Value']
print(df1)
df1.loc[df2['Codes'].isin(df1['Code'])].apply(sum(df1, df2))
If the code in df2 is in df1 theen add values
We can make a lookup table of Code to Value mapping from df1, then use .map() on df2 to map the expanded list of Codes to the mapping. Finally, sum up the mapped values for the same ID to arrive at the desired value, as follows:
1. Make a lookup table of Code to Value mapping from df1:
mapping = df1.set_index('Code')['Value']
2. Use .map() on df2 to map the expanded list of Codes to the mapping. Sum up the mapped values for the same ID to arrive at the desired value:
df2a = df2.set_index('ID') # set `ID` as index
df2a['value_map'] = (
df2a['Codes'].str.split(';') # split by semicolon
.explode() # expand splitted values into rows
.map(mapping) # map Code from mapping
.groupby('ID').sum() # group sum by ID
)
df2 = df2a.reset_index() # reset `ID` from index back to data column
Result:
print(df2)
ID Codes Value value_map
0 A 10 25 25
1 B 100;1010 125 125
2 C 100 50 50
I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()
I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.
data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)
data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)
DF1 - id (15k rows)
DF2 - id, col1 (50M rows)
Desired output
data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)
def func(df1):
temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on=
['id'])
temp1 = temp3.B.values
return temp1
df1['B']=df1.apply(func,axis=1))
I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?
Are you looking for a simple filter and grouped listification?
df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)
id B
0 1 [ab, bc]
1 2 [ca, as]
2 3 [ad, ds, sd]
Note that grouping as lists is considered suboptimal in terms of performance.
This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.
The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2
You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')
You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')
Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )
Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.
You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78
You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})
I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)
import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3
To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀
I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
```
the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2
This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!
If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)
Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3
simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )
Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}
Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']
You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})
Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])