I have two csv files (say, a and b) and both contain different datasets. The only common between those two CSV files is id_no. I would like to create a final csv file that contains all the datasets of both CSV files whose id_no are matching.
a looks like
id_no a1 a2 a3 a4
1 0.5 0.2 0.1 10.20
2 1.5 0.1 0.2 11.25
3 2.5 0.7 0.3 12.90
4 3.5 0.8 0.4 13.19
5 7.5 0.6 0.3 14.21
b looks like
id_no A1
6 10.1
8 2.5
4 12.5
2 20.5
1 2.51
I am looking for a final csv file, say c that shows the following output
id_no a1 a2 a3 a4 A1
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.5
3 2.5 0.7 0.3 12.90 0
4 3.5 0.8 0.4 13.19 12.5
5 7.5 0.6 0.3 14.21 0
Use pandas.merge:
import pandas as pd
a = pd.read_csv("data1.csv")
b = pd.read_csv("data2.csv")
output = a.merge(b, on="id_no", how="left").fillna(0).set_index("id_no")
output.to_csv("output.csv")
>>> output
a1 a2 a3 a4 A1
id_no
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.50
3 2.5 0.7 0.3 12.90 0.00
4 3.5 0.8 0.4 13.19 12.50
5 7.5 0.6 0.3 14.21 0.00
Using plain old python:
from csv import reader, writer
from pathlib import Path
with Path("file2.csv").open as f:
r = reader(f)
header = next(r)
data = {k:v for row in r for k, v in [row]}
rows = []
with Path("file1.csv").open() as f:
r = reader(f)
header.append(next(r)[-1])
for i, *row in r:
if i in data:
rows.append([i] + row + data[i])
else:
rows.append([i] + row + [0])
with Path("file1.csv").open("w") as f:
w = writer(f)
w.write_row(header)
w.write_rows(rows)
for example= df is the data with features. I want to split the train + test from the data whose indices have been given. How shall I get train/test df.
df=
0 2 0.3 0.5 0.5
1 4 0.5 0.7 0.4
2 2 0.5 0.1 0.4
3 4 0.4 0.1 0.3
4 2 0.3 0.1 0.5
where train.txt is
train=pd.read_csv(data_train.txt)
where in this dataframe indices are given. How should I get the training data from those indices?
Contents in data_train.txt(there are 10000 of data in which train indices are given in this txt file)
0
2
4
I want these indices for training data with feature:- like
final train should look like this (see the index):
0 2 0.3 0.5 0.5
2 2 0.5 0.1 0.4
4 2 0.3 0.1 0.5
If you have a df as given by:
0 1 2 3 4
0 0 2 0.3 0.5 0.5
1 1 4 0.5 0.7 0.4
2 2 2 0.5 0.1 0.4
3 3 4 0.4 0.1 0.3
4 4 2 0.3 0.1 0.5
and another train_indices as given by:
0
0 0
1 2
2 4
then all you need to do to get the corresponding rows of df depends on how the data is organised:
#if you're trying to match the index of the df itself
train_df = df.iloc[train_indices]
#if you're trying to match column 0, which might be important
#if it's not aligned to the index
train_df = df.loc[df[0].isin(train_indices)]
Both of these (in this case) return:
0 1 2 3 4
0 0 2 0.3 0.5 0.5
2 2 2 0.5 0.1 0.4
4 4 2 0.3 0.1 0.5
I've search for quite a time, but I haven't found any similar question. If there is, please let me know!
I am currently trying to divide one dataframe into n dataframes where the n is equal to the number of columns of the original dataframe. All the new resulting dataframes must always keep the first column of the original dataframe. An extra would be gather all togheter in a list, for example, for further access.
In order to visualize my intention, here goes an brief example:
>> original df
GeneID A B C D E
1 0.3 0.2 0.6 0.4 0.8
2 0.5 0.3 0.1 0.2 0.6
3 0.4 0.1 0.5 0.1 0.3
4 0.9 0.7 0.1 0.6 0.7
5 0.1 0.4 0.7 0.2 0.5
My desired output would be something like this:
>> df1
GeneID A
1 0.3
2 0.5
3 0.4
4 0.9
5 0.1
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
And so on, until all the columns from the original dataframe be covered.
What would be the better solution ?
You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]
You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}
You can create a list of column names, and manually loop through and create a new DataFrame each loop.
>>> import pandas as pd
>>> d = {'col1':[1,2,3], 'col2':[3,4,5], 'col3':[6,7,8]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2 col3
0 1 3 6
1 2 4 7
2 3 5 8
>>> newstuff=[]
>>> columns = list(df)
>>> for column in columns:
... newstuff.append(pd.DataFrame(data=df[column]))
Unless your dataframe is unreasonably massive, above code should serve its job.
I have a pandas dataframe which looks like one long row.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
________________________________________________________________________________________________
2010 | 0.1 0.5 0.5 0.7 0.5 0.5 0.5 0.5 0.9 0.5 0.5 0.8 0.3 0.3 0.6
I would like to reshape it as:
0 1 2 3 4
____________________________________
|0| 0.1 0.5 0.5 0.7 0.5
2010 |1| 0.5 0.5 0.5 0.9 0.5
|2| 0.5 0.8 0.3 0.3 0.6
I can certainly do it using a loop, but I'm guessing (un)stack and/or pivot might be able to do the trick, but I couldn't figure it out how...
Symmetry/filling up blanks - if the data is not integer divisible by the number of rows after unstack - is not important for now.
EDIT:
I coded up the loop solution meanwhile:
df=my_data_frame
dk=pd.DataFrame()
break_after=3
for i in range(len(df)/break_after):
dl=pd.DataFrame(df[i*break_after:(i+1)*break_after]).T
dl.columns=range(break_after)
dk=pd.concat([dk,dl])
If there is only one index (2010), this will work fine.
df1 = pd.DataFrame(np.reshape(df.values,(3,5)))
df1['Index'] = '2010'
df1.set_index('Index',append=True,inplace=True)
df1 = df1.reorder_levels(['Index', None])
Output:
0 1 2 3 4
Index
2010 0 0.1 0.5 0.5 0.7 0.5
1 0.5 0.5 0.5 0.9 0.5
2 0.5 0.8 0.3 0.3 0.6
I have two pandas data frames, which look like this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
df_one
A B1 B2 B3
0 1 0.5 0.2 0.1
1 1 0.0 0.3 0.2
2 2 0.2 0.1 0.0
3 3 0.1 0.5 0.9
4 4 0.3 0.3 0.0
5 4 0.2 0.1 0.3
6 4 0.1 0.2 0.5
df_two
A C1 C2 C3
0 1 1.0 2.0 5.0
1 2 9.0 3.0 4.0
2 3 2.1 0.7 2.3
3 4 9.0 1.1 3.4
What I would like to do is compute is a scalar product where I would be multiplying rows of the first data frame by the rows of the second data frame, i.e., \sum_i B_i * C_i, but in such a way that a row in the first data frame is multiplied by a row in the second data frame only if the values of the A column match in both frames. I know how to do it looping and using if's but I would like to do that in a more efficient numpy-like or pandas-like way. Any help much appreciated :)
Not sure if you want unique values for column A (If you do, use groupby on the result below)
pd.merge(df_one, df_two, on='A')
A B1 B2 B3 C1 C2 C3
0 1 0.5 0.2 0.1 1.0 2.0 5.0
1 1 0.0 0.3 0.2 1.0 2.0 5.0
2 2 0.2 0.1 0.0 9.0 3.0 4.0
3 3 0.1 0.5 0.9 2.1 0.7 2.3
4 4 0.3 0.3 0.0 9.0 1.1 3.4
5 4 0.2 0.1 0.3 9.0 1.1 3.4
6 4 0.1 0.2 0.5 9.0 1.1 3.4
pd.merge(df_one, df_two, on='A').apply(lambda s: sum([s['B%d'%i] * s['C%d'%i] for i in range(1, 4)]) , axis=1)
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82
Another approach would be something similar to this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
lookup = df_two.groupby(df_two.A)
def multiply_rows(row):
other = lookup.get_group(row['A'])
# We want every column after "A"
x = row.values[1:]
# In this case, other is a 2D array with one row, similar to "row" above...
y = other.values[0, 1:]
return x.dot(y)
# The "axis=1" makes each row to be passed in, rather than each column
result = df_one.apply(multiply_rows, axis=1)
print result
This results in:
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82
I would zip together the rows and use a filter or a comprehension that takes only the rows where A columns match.
Something like
[scalar_product(a,b) for a,b in zip (frame1, frame2) if a[0]==b[0]]
assuming that you're willing to fill in the appropriate material for scalar_product
(apologies if I've made a thinko here - this code is for example purposes only and has not been tested!)