I have this situation and cannot find a way with Pandas to do get the result I want.
I have this df with only one column.
enter image description here
And I want to transpose to get like this:
enter image description here
I alreary tried transpose but not getting the result i want.
And is there an easy way to put each value in a specif column?
For example: Y-1 in a column named Y-1, Y-3 in a column named Y-3. And if there is no Y-2 value, leave it blank in the column.
enter image description here
You might be able to get away with dropping down to numpy to simply reshape. However, if there is a variable number of entries for each row, you can use a pivot with custom indices:
import pandas as pd
df = pd.DataFrame({"MSG": ["MSG XXX", "Y-1", "Y-2", "Y-3", "Y-5", "Y-7", "Y-19", "MSG XYZ", "Y-1", "Y-3", "Y-11", "Y-12", "Y-17", "Y-19"]})
groups = df["MSG"].str.startswith("MSG").cumsum()
out = (
df
.assign(index=groups, columns=df.groupby(groups).cumcount())
.pivot(index="index", columns="columns", values="MSG")
)
out:
columns 0 1 2 3 4 5 6
index
1 MSG XXX Y-1 Y-2 Y-3 Y-5 Y-7 Y-19
2 MSG XYZ Y-1 Y-3 Y-11 Y-12 Y-17 Y-19
Related
I have two data frames:
import pandas as pd
import numpy as np
sgRNA = pd.Series(["ABL1_sgABL1_130854834","ABL1_sgABL1_130862824","ABL1_sgABL1_130872883","ABL1_sgABL1_130884018"])
sequence = pd.Series(["CTTAGGCTATAATCACAATG","GGTTCATCATCATTCAACGG","TCAGTGATGATATAGAACGG","TTGCTCCCTCGAAAAGAGCG"])
df1=pd.DataFrame(sgRNA,columns=["sgRNA"])
df1["sequence"]=sequence
df2=pd.DataFrame(columns=["column"],
index=np.arange(len(df1) * 2))
I want to add values from both columns from df1 to df2 every other row, like this:
ABL1_sgABL1_130854834
CTTAGGCTATAATCACAATG
ABL1_sgABL1_130862824
GGTTCATCATCATTCAACGG
ABL1_sgABL1_130872883
TCAGTGATGATATAGAACGG
ABL1_sgABL1_130884018
TTGCTCCCTCGAAAAGAGCG
To do this for df1["sgRNA"] I used this code:
df2.iloc[0::2, :]=df1["sgRNA"]
But I get this error:
ValueError: could not broadcast input array from shape (4,) into shape (4,1).
What am I doing wrong?
I think you're looking for DataFrame.stack():
df2["column"] = df1.stack().reset_index(drop=True)
print(df2)
Prints:
column
0 ABL1_sgABL1_130854834
1 CTTAGGCTATAATCACAATG
2 ABL1_sgABL1_130862824
3 GGTTCATCATCATTCAACGG
4 ABL1_sgABL1_130872883
5 TCAGTGATGATATAGAACGG
6 ABL1_sgABL1_130884018
7 TTGCTCCCTCGAAAAGAGCG
Besides Andrej Kesely's superior solution, to answer the question of what went wrong in the code, it's really minor:
df1["sgRNA"] is a series, one-dimensional, while df2.iloc[0::2, :] is
a dataframe, two-dimensional.
The solution would be to make the "df2" part one-dimensional by selecting the
one and only column, instead of selecting a slice of "all one columns", so to
say:
df2.iloc[0::2, 0] = df1["sgRNA"]
I am trying to complete missing information in some rows from a column in a dataframe, using another dataframe. I have in the first df(dfPivote), two columns of interest 'Entrega' and 'Transportador' which is the one with missing information. I have a second df (dfTransportadoEntregadoFaltante) with two columns of interest 'EntregaBusqueda' which is the key to my other df, and 'Transportador' with the information missing from the other df. I have the following code, and it is not working. How could I solve this problem?
I would recommend using dataframe operations to fill in missing values. If I've followed your example code correctly, I think you're trying to do something like this:
import pandas as pd
import numpy as np
# Create fake data
# "dfPivote" dataframe with an empty string in the "Transportador" column:
dfPivote = pd.DataFrame({'Entrega':[1,2,3],'Transportador':['a','','c']})
# "dfTransportadoEntregadoFaltante" lookup dataframe
dfTransportadoEntregadoFaltante = pd.DataFrame({'EntregaBusqueda':[1,2,3], 'Transportador':['a','b','c']})
# 1. Replace empty strings in dfPivote['Transportador'] with np.nan values:
dfPivote['Transportador'] = dfPivote['Transportador'].apply(lambda x: np.nan if len(x)==0 else x)
# 2. Merge the two dataframes together on the "Entrega" and "EntregaBusqueda" columns respectively:
df = dfPivote.merge(dfTransportadoEntregadoFaltante, left_on='Entrega', right_on='EntregaBusqueda', how='left')
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 NaN 2 b
# 3 c 3 c
# 3. Fill NaNs in "Transportador_x" column with corresponding values in "Transportador_y" column:
df['Transportador_x'] = df['Transportador_x'].fillna(df['Transportador_y'])
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 b 2 b
# 3 c 3 c
Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.
I would like to analyse and transform the following DataFrame
import random
import string
import numpy as np
import pandas as pd
# generate example dataframe
df=pd.DataFrame()
df['Name']=[str(x) for x in np.random.choice(['a','b','c'],10)]
df['Cat1']=[str(x) for x in np.random.choice(['x',''],10)]
df['Cat2']=[str(x) for x in np.random.choice(['x',''],10)]
df['Cat3']=[str(x) for x in np.random.choice(['x',''],10)]
df.head(10)
This produces a DataFrame like this:
Sample DataFrame
The task is to count the 'x' in columns Cat1, Cat2, Cat3 for each unique entry in column 'Name'. This can be achieved wth help ofthe groupby() function:
grouped=df.groupby(['Name'])
dfg=grouped['Cat1','Cat2','Cat3'].sum()
dfg
Result of analysis
And the result is this almost what I wanted. Now, I needed to replace the 'x' by a number, e.g., 'xxxx' by 4, 'x' by 1, and so forth. The solution uses a loop over all columns:
for col in range(0,len(dfg.columns)):
dfg[dfg.columns[col]]=list(map(lambda x: len(x), dfg[dfg.columns[col]]))
dfg
Final result.
Now, I wonder how I can avoid that loop and achieve the same final result?
Thanks a lot for sharing your ideas and guidance.
Try:
df.set_index('Name').eq('x')\
.groupby('Name')['Cat1','Cat2','Cat3'].sum()\
.astype(int).reset_index()
Output:
Name Cat1 Cat2 Cat3
0 a 5 3 4
1 b 1 1 0
2 c 1 1 1
Depending on your source of data, this could be easily solved by replacing the "x" with a 1 and setting the empty cells to 0. So you also had to change the datatype of the column to integer.
Calling sum() then on your group will already give you the numeric answer.
I'm not used to pandas at all, thus the several question on my problem.
I have a function computing computing a list called solutions. This list can either be made of tuples of 3 values (a, b, c) or empty.
solutions = [(a,b,c), (d,e,f), (g,h,i)]
To save it, I first turn it into a numpy array, and then I save it with pandas after naming the columns.
solutions = np.asarray(solutions)
df = pd.DataFrame(solutions)
df.columns = ["Name1", "Name2", "Name3"]
df.to_pickle(path)
My issue is that I sometimes have a empty solutions list: solutions = []. Thus, the line df.columns raises an error. To bypass it, I currently check the size of solutions, and if it is empty, I do:
pickle.dump([], path, "wb")
I would like to be a more consistent between my data type, and to save the SAME format between both scenario.
=> If the list is empty, I would like to save the 3 columns name with an empty data frame. Ultimate goal, is to reopen the file with pd.read_pickle() and to access easily the data in it.
Second issue, I would like to reopen the files pickled, and to add a column. Could you show me the right way to do so?
And third question, how can I select a part of the dataframe. For instance, I want all lines in which the column Name1 value % 0.25 == 0.
Thanks
Create your dataframe using:
df = pandas.DataFrame(data=solutions, columns=['name1', 'name2', 'name3'])
If solutions is empty, it will nevertheless create a dataframe with 3 columns and 0 row.
In [2]: pd.DataFrame(data=[(1,2,3), (4,5,6)], columns=['a','b','c'])
Out[2]:
a b c
0 1 2 3
1 4 5 6
In [3]: pd.DataFrame(data=[], columns=['a','b','c'])
Out[3]:
Empty DataFrame
Columns: [a, b, c]
Index: []
For your third question:
df["Name1"] % 0.25 == 0
computes a series of booleans which are true where the value in the first column can be divided by 0.25. You can use it to select the rows of your dataframe:
df[ df["Name1"] % 0.25 == 0 ]