How to create 1 row dataframe from a dataset in pandas - python

I have a .csv file with many rows and columns. For analysis purposes, I want to select a row number from the dataset and pass it as a dataframe in pandas.
Instead of writing the column names and input values inside a dict, how can I make it faster?
Right now I have:
df= pd.read_csv('filename.csv')
df2= pd.DataFrame({'var1': 5, 'var2': 10, 'var3': 15})
var1,var2,var3 are df columns. I want to make a seperate dataframe with df data.
You can either select a random row, or a given row number.
Thank you for your help.

df2 = df.iloc[rownum:rownum + 1, :]

If you want to filter out data as new dataframe from existing one you can use something like this -
based on particular rows required
df2 = df.iloc[4:5,:]
or data using some condition
df3 = df[df['var1'] < 10]

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Subtract dataframes with completely different row names and column names

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.
Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values
You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.
Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']
I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Creating a function which creates a new column based on values in two columns?

I have data frame like -
ID Min_Value Max_Value
1 0 0.10562
2 0.10563 0.50641
3 0.50642 1.0
I have another data frame that contains Value as a column. I want to create a new column in second data frame which returns ID when Value is between Min_Value and Max_Value for a given ID as above data frame. I can use if-else conditions but number of ID's are large and code becomes too bulky. Is there a efficient way to do this?
If I understand correctly, just join/merge it into one DataFrame, using "between" function you can choose right indexes which will be located in the second DataFrame.
import pandas as pd
data = {"Min_Value": [0, 0.10563, 0.50642],
"Max_Value": [0.10562, 0.50641, 1.0]}
df = pd.DataFrame(data,
index=[1, 2, 3])
df2 = pd.DataFrame({"Value": [0, 0.1, 0.58]}, index=[1,2,3])
df = df.join(df2)
mask_between_values = df['Value'].between(df['Min_Value'], df['Max_Value'], inclusive="neither")
# This is the result
df2[mask_between_values]
1 0.00
3 0.58
Suppose you have two dataframes df and new_df. You want to assign a new column as 'new_column' into new_df dataframe. The value of 'Value' column must be between 'Min_Value' and 'Max_Value' from df dataframe. Then this code may help you.
for i in range(0,len(df)):
if df.loc[i,'Max_Value'] > new_df.loc[i,'Value'] and df.loc[i,'Min_value'] < new_df.loc[i,'Value']:
new_df.loc[i,'new_column'] = df.loc[i, 'ID']

Panda append and concat reorders the dataframe?

I created an empty dataframe (df1) with the python pandas package, only containing the columns: [var1, var2, var3]
I also have another dataframe (df2) which looks like this:
columns: [var 2, var1, var3]
values: [1, 2, 3]
When I append df2 to df1 the orders of the columns in the dataframe change. I tried to reorder the dataframe with the old list of columns with sort_values and sort, but it didn't work. Does anyone know how I can solve it? I am using python version 2.7
If I'm understanding this correctly, append is not a problem. It is only about column order. To change the order of columns in a Dataframe, simply slice with column names.
df1 = pds.DataFrame(columns=['var1', 'var2', 'var3']) # desired order
# generate a disordered df somehow
df_disordered = pds.DataFrame(columns=['var2', 'var1', 'var3'])
df_adjusted = df_disordered[df1.columns] # slice with column names
# or explicitly df_disordered[['var1', 'var2', 'var3']]
# now df_adjusted has the same column order as df1

Python Pandas Timeseries Sum Daily Column Data

I'm stuck trying to figure out how to sum one of the columns in my dataframe based on day/month/year etc. I don't want to perform the aggregation on the other columns. As the dataframe will become shorter, I would like to use the minimum value from the other columns of the dataframe.
This is what I have, but it does not produce what I want. It only sums the first and last part and then gives me NaN values for the rest.
df = pd.DataFrame(zip(points, data, junk), columns=['Dates', 'Data', 'Junk'])
df.set_index('Dates', inplace=True)
_add = {'Data': np.sum, 'Junk': np.min}
newdf = df.resample('D', how=_add)
Thanks

Categories