How can i stack pandas dataframes with different column names vertically - python

I have 2 dataframes that looks like this:
Index1 Games1
1 1
2 5
3 10
Index2 Games2
4 2
5 4
6 6
How can I combine them to make it like this:
Index Games
1 1
2 5
3 10
4 2
5 4
6 6
Thank you!

Try this:
import pandas as pd
import numpy
# Assuming your dataframes are named df1, and df2
new_frame = pd.DataFrame(numpy.vstack((df1.values, df2.values)))
print(new_frame)
This method creates a new dataframe by performing the vstack operation out of the numpy library.
Vstack is essentially a way of concatenating, but stacks them in sequence, preserving their row order.

Related

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Pandas Dataframe groupby with overlapping

I'm using a pandas dataframe to read a csv that has data points for machine learning. I'm trying to come up with a way that would allow me to index a dataframe where it would get that index and the next N number of rows. I don't want to group the data frame into bins with no overlap (i.e. index 0:4, 4:8, etc.) What I do want is to get a result like this: index 0:4, 1:5, 2:6,etc. How would this be done?
Maybe you can create a list of DataFrames, like:
import pandas as pd
import numpy as np
nrows = 7
group_size = 5
df = pd.DataFrame({'col1': np.random.randint(0, 10, nrows)})
print(df)
grp = [df.iloc[x:x+5,] for x in range(df.shape[0] - group_size + 1)]
print(grp[1])
Original DataFrame:
col1
0 2
1 6
2 6
3 5
4 3
5 3
6 8
2nd DataFrame from the list of DataFrames:
col1
1 6
2 6
3 5
4 3
5 3

How to read in Pandas DataFrame while ignoring index and column labels?

A while back I made a DataFrame full of ints with strings for column and index labels and saved it as a .csv.
Something like this:
A B C
A 1 5 8
B 5 2 4
C 8 4 0
Now I am trying to read the csv and perform operations on it. In order to do that, I have to get rid of those labels. I have tried using drop but they don't go away. This is my code:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='path',header=None,index_col=False)
print(df.head())
This is what comes out:
0 ... 12
0 NaN ... 10.1021/nn502895s
1 10.1063/1.4973245 ... 3.1641066942926606
2 10.3891/acta.chem.scand.26-0333 ... 3.8644527240688675
3 10.1063/1.463096 ... 2.9273855677735448
4 10.1146/annurev-physchem-040412-110130 ... 6.1534904155247325
How do I get rid of the labels (the strings)?
Thank you!
Use parameter skiprows=1 for avoid header of csv to first line of DataFrame, then add index_col=[0] for correct parsing index and last remove it by DataFrame.reset_index with drop=True:
df = pd.read_csv('file.csv', header=None, skiprows=1, index_col=[0]).reset_index(drop=True)
print (df)
1 2 3
0 1 5 8
1 5 2 4
2 8 4 0

I want to pick out one column of the DataFrame but the result is automatically ordered by values

I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F

Using itertools.combinations with columns

I have a Dataframe df with 3 columns. A,B and C
A B C
2 4 4
5 2 5
6 9 5
My goal is to use itertools.combinations to find all non-repeating column pairs and to put the first column pair in one DataFrame and the second in the other. So all pairs of this would give A:B,A:C,B:C.
So the first dataframe df1 would have the first of of those column pairs:
df=A A B
2 4 4
5 5 2
6 5 9
and the second df2:
B C C
4 4 4
3 5 5
9 5 5
I'm trying to do something with itertools like:
for cola, colb in itertools.combinations(df, 2):
df1[cola]=cola
df2[colb]=colb
I know that makes no sense but i can change each column to a list and itertool a list of lists and then append each to a list A and B and then turn that list back into a Dataframe but then Im missing the headers. And I tried adding the headers to the list but when i try and remake it back to a DataFrame the indexing seems off and I cant seem to fix it. So I'm just trying to see if there is a way to just itertool entire columns with the headers.
Utilize the zip function to group the columns to be used in each DataFrame separately, and then use pandas.concat to construct your new DataFrames:
from itertools import combinations
df1_cols, df2_cols = zip(*combinations(df.columns,2))
df1 = pd.concat([df[col] for col in df1_cols],axis=1)
df2 = pd.concat([df[col] for col in df2_cols],axis=1)

Categories