Finding common values between two csv file - python

I have two csv files (say, a and b) and both contain different datasets. The only common between those two CSV files is id_no. I would like to create a final csv file that contains all the datasets of both CSV files whose id_no are matching.
a looks like
id_no a1 a2 a3 a4
1 0.5 0.2 0.1 10.20
2 1.5 0.1 0.2 11.25
3 2.5 0.7 0.3 12.90
4 3.5 0.8 0.4 13.19
5 7.5 0.6 0.3 14.21
b looks like
id_no A1
6 10.1
8 2.5
4 12.5
2 20.5
1 2.51
I am looking for a final csv file, say c that shows the following output
id_no a1 a2 a3 a4 A1
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.5
3 2.5 0.7 0.3 12.90 0
4 3.5 0.8 0.4 13.19 12.5
5 7.5 0.6 0.3 14.21 0

Use pandas.merge:
import pandas as pd
a = pd.read_csv("data1.csv")
b = pd.read_csv("data2.csv")
output = a.merge(b, on="id_no", how="left").fillna(0).set_index("id_no")
output.to_csv("output.csv")
>>> output
a1 a2 a3 a4 A1
id_no
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.50
3 2.5 0.7 0.3 12.90 0.00
4 3.5 0.8 0.4 13.19 12.50
5 7.5 0.6 0.3 14.21 0.00

Using plain old python:
from csv import reader, writer
from pathlib import Path
with Path("file2.csv").open as f:
r = reader(f)
header = next(r)
data = {k:v for row in r for k, v in [row]}
rows = []
with Path("file1.csv").open() as f:
r = reader(f)
header.append(next(r)[-1])
for i, *row in r:
if i in data:
rows.append([i] + row + data[i])
else:
rows.append([i] + row + [0])
with Path("file1.csv").open("w") as f:
w = writer(f)
w.write_row(header)
w.write_rows(rows)

Related

create new rows based specific condition and iterate over a list in pandas

I have a df as shown below
B_ID No_Show Session slot_num Cumulative_no_show
1 0.4 S1 1 0.4
2 0.3 S1 2 0.7
3 0.8 S1 3 1.5
4 0.3 S1 4 1.8
5 0.6 S1 5 2.4
6 0.8 S1 6 3.2
7 0.9 S1 7 4.1
8 0.4 S1 8 4.5
9 0.6 S1 9 5.1
12 0.9 S2 1 0.9
13 0.5 S2 2 1.4
14 0.3 S2 3 1.7
15 0.7 S2 4 2.4
20 0.7 S2 5 3.1
16 0.6 S2 6 3.7
17 0.8 S2 7 4.5
19 0.3 S2 8 4.8
The code to create above df is shown below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'B_ID': [1,2,3,4,5,6,7,8,9,12,13,14,15,20,16,17,19],
'No_Show': [0.4,0.3,0.8,0.3,0.6,0.8,0.9,0.4,0.6,0.9,0.5,0.3,0.7,0.7,0.6,0.8,0.3],
'Session': ['s1','s1','s1','s1','s1','s1','s1','s1','s1','s2','s2','s2','s2','s2','s2','s2','s2'],
'slot_num': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8],
})
df['Cumulative_no_show'] = df.groupby(['Session'])['No_Show'].cumsum()
and a list called walkin_no_show = [ 0.3, 0.4, 0.3, 0.4, 0.3, 0.4 and so on with length 1000]
From the above when ever u_cumulative > 0.8 create a new row just below that with
df[No_Show] = walkin_no_show[i]
and its Session and slot_num should be same as previous one and create a new column called u_cumulative by subtracting (1 - walkin_no_show[i]) from the previous.
Expected Output:
B_ID No_Show Session slot_num Cumulative_no_show u_cumulative
1 0.4 S1 1 0.4 0.4
2 0.3 S1 2 0.7 0.7
3 0.8 S1 3 1.5 1.5
walkin1 0.3 S1 3 1.5 0.8
4 0.3 S1 4 1.8 1.1
walkin2 0.4 S1 4 1.8 0.5
5 0.6 S1 5 2.4 1.1
walkin3 0.3 S1 5 2.4 0.4
6 0.8 S1 6 3.2 1.2
walkin4 0.4 S1 6 3.2 0.6
7 0.9 S1 7 4.1 1.5
walkin5 0.3 S1 7 4.1 0.8
8 0.4 S1 8 4.5 1.2
walkin6 0.4 S1 8 4.5 0.6
9 0.6 S1 9 5.1 1.2
12 0.9 S2 1 0.9 0.9
walkin1 0.3 S2 1 0.9 0.2
13 0.5 S2 2 1.4 0.7
14 0.3 S2 3 1.7 1.0
walkin2 0.4 S2 3 1.7 0.4
15 0.7 S2 4 2.4 1.1
walkin3 0.3 S2 4 2.4 0.4
20 0.7 S2 5 3.1 1.1
walkin4 0.4 S2 5 3.1 0.5
16 0.6 S2 6 3.7 1.1
walkin5 0.3 S2 6 3.7 0.4
17 0.8 S2 7 4.5 1.2
walkin6 0.4 S2 7 4.5 0.6
19 0.3 S2 8 4.8 0.9
I tried below code minor edit. As answered by #Ben.T on the below mentioned my question.
create new rows based the values of one of the column in pandas or numpy
Thanks #Ben.T. Full credit to you..
def create_u_columns (ser):
l_index = []
arr_ns = ser.to_numpy()
# array for latter insert
arr_idx = np.zeros(len(ser), dtype=int)
walkin_id = 1
for i in range(len(arr_ns)-1):
if arr_ns[i]>0.8:
# remove 1 to u_no_show
arr_ns[i+1:] -= (1-walkin_no_show[arr_idx])
# increment later idx to add
arr_idx[i] = walkin_id
walkin_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_cumulative': arr_ns, 'mask_idx':arr_idx}, index=ser.index)
df[['u_cumulative', 'mask_idx']]= df.groupby(['Session']['Cumulative_no_show'].apply(create_u_columns)
# select the rows
df_toAdd = df.loc[df['mask_idx'].astype(bool), :].copy()
# replace the values as wanted
df_toAdd['No_Show'] = walkin_no_show[mask_idx]
df_toAdd['B_ID'] = 'walkin'+df_toAdd['mask_idx'].astype(str)
df_toAdd['u_cumulative'] -= 1
# add 0.5 to index for later sort
df_toAdd.index += 0.5
new_df_0.8 = pd.concat([df,df_toAdd]).sort_index()\
.reset_index(drop=True).drop('mask_idx', axis=1)
Also I would like to iterarate over a list. where we can change (arr_ns[i]>0.8) [0.8, 0.9, 1.0] and create 3 df such as new_df_0.8, new_df_0.9 and new_df_1.0
IIUC, you can do it this way:
def create_u_columns (ser, threshold_ns = 0.8):
arr_ns = ser.to_numpy()
# array for latter insert
arr_idx = np.zeros(len(ser), dtype=int)
walkin_id = 0 #start at 0 not 1 for list indexing
for i in range(len(arr_ns)-1):
if arr_ns[i]>threshold_ns:
# remove 1 to u_no_show
arr_ns[i+1:] -= (1-walkin_no_show[walkin_id]) #this is slightly different
# increment later idx to add
arr_idx[i] = walkin_id+1
walkin_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_cumulative': arr_ns, 'mask_idx':arr_idx}, index=ser.index)
#create empty dict for storing the dataframes
d_dfs = {}
#iterate over the value for the threshold
for th_ns in [0.8, 0.9, 1.0]:
#create a copy and do the same kind of operation
df_ = df.copy()
df_[['u_cumulative', 'mask_idx']]= \
df_.groupby(['Session'])['Cumulative_no_show']\
.apply(lambda x: create_u_columns(x, threshold_ns=th_ns))
# select the rows
df_toAdd = df_.loc[df_['mask_idx'].astype(bool), :].copy()
# replace the values as wanted
df_toAdd['No_Show'] = np.array(walkin_no_show)[df_toAdd.groupby('Session').cumcount()]
df_toAdd['B_ID'] = 'walkin'+df_toAdd['mask_idx'].astype(str)
df_toAdd['u_cumulative'] -= (1 - df_toAdd['No_Show'])
# add 0.5 to index for later sort
df_toAdd.index += 0.5
d_dfs[th_ns] = pd.concat([df_,df_toAdd]).sort_index()\
.reset_index(drop=True).drop('mask_idx', axis=1)
Then if you want to have access to the dataframes, you can do for example:
for th, df_ in d_dfs.items():
print (th)
print (df_.head(4))
The only trick that you have to consider is the way you increase the index values.
Here is a solution:
walkin_no_show = [0.3, 0.4, 0.3, 0.4, 0.3]
df = pd.DataFrame({'B_ID': [1,2,3,4,5],
'No_Show': [0.1,0.1,0.3,0.5,0.6],
'Session': ['s1','s1','s1','s2','s2'],
'slot_num': [1,2,3,1,2],
'Cumulative_no_show': [1.5, 0.4, 1.6, 0.3, 1.9]
})
df = df[['B_ID', 'No_Show', 'Session', 'slot_num', 'Cumulative_no_show']]
df['u_cumulative'] = df['Cumulative_no_show']
print(df.head())
Output:
B_ID No_Show Session slot_num Cumulative_no_show u_cumulative
0 1 0.1 s1 1 1.5 1.5
1 2 0.1 s1 2 0.4 0.4
2 3 0.3 s1 3 1.6 1.6
3 4 0.5 s2 1 0.3 0.3
4 5 0.6 s2 2 1.9 1.9
then:
def Insert_row(row_number, df, row_value):
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
walkin_count = 1
skip = False
last_Session = ''
i = 0
while True:
row = df.loc[i]
if row['Session'] != last_Session:
walkin_count = 1
last_Session = row['Session']
values_to_append = ['walkin{}'.format(walkin_count), walkin_no_show[i],
row['Session'], row['slot_num'], row['Cumulative_no_show'], (1 - walkin_no_show[i])]
if row['Cumulative_no_show'] > 0.8:
df = Insert_row(i+1, df, values_to_append)
walkin_no_show.insert(i+1, 0)
walkin_count += 1
i += 1
i += 1
if i == df.shape[0]:
break
print(df)
output:
B_ID No_Show Session slot_num Cumulative_no_show u_cumulative
0 1 0.1 s1 1 1.5 1.5
1 walkin1 0.3 s1 1 1.5 0.7
2 2 0.1 s1 2 0.4 0.4
3 3 0.3 s1 3 1.6 1.6
4 walkin2 0.3 s1 3 1.6 0.7
5 4 0.5 s2 1 0.3 0.3
6 5 0.6 s2 2 1.9 1.9
7 walkin3 0.3 s2 2 1.9 0.7
I hope it helps.
The used function imported from: Insert row at given position

Python pandas data frame reshape

The data shown below is an simplified example. The actual data frame is 3750 rows 2 columns data frame. I need to reshape the data frame into another structure.
A A2
0.1 1
0.4 2
0.6 3
B B2
0.8 1
0.7 2
0.9 3
C C2
0.3 1
0.6 2
0.8 3
How can I reshape above data frame into horizontal as following:
A A2 B B2 C C2
0.1 1 0.8 1 0.3 1
0.4 2 0.7 2 0.6 2
0.6 3 0.9 3 0.8 3
You can reshape your data and create a new dataframe:
cols = 6
rows = 4
df = pd.DataFrame(df.values.T.reshape(cols,rows).T)
df.rename(columns=df.iloc[0]).drop(0)
A B C A2 B2 C2
1 0.1 0.8 0.3 1 1 1
2 0.4 0.7 0.6 2 2 2
3 0.6 0.9 0.8 3 3 3
try this, If you don't want to hard code your values.
df['header']=pd.to_numeric(df[0],errors='coerce')
l= df['header'].values
m_l = l.reshape((np.isnan(l).sum(),-1))[:,1:]
h=df[df['header'].isnull()][0].values
print pd.DataFrame(dict(zip(h,m_l)))
Output:
A B C
0 0.1 0.8 0.3
1 0.4 0.7 0.6
2 0.6 0.9 0.8

How to make big file into small chunk and then use the chunk file for comparison with another file in linux or R or python

I have file-1 which is a big file with 9.9 million rows and 8 column. I have another file 2 with 3.3 million rows and 9 columns. Actually, I am extracting the rows from file 1 if column 2 of file-1 is matching with column 2 of file-2.
I used the following code
awk 'BEGIN { while ((getline<"file-1")>0) {REC[$2]=$0}} {print REC[$2]}' < file-2 > file-3
But file 1 is with 9.9 million row this code is not giving me any output. Whereas if I split it 9.9 million rows into 100 thousand rows then I get the output. If i split it with 500 thousand rows then again no output.
This was my split command
split -l 100000 -d file-1 smallfile- --additional-suffix=.txt
Now I need help so that I can split this files into a small chunk of 100 thousand rows and run all the small chunk into above awk command that will give me output with suffix ending as _split_file_100_thousand .. so on for example output of first 100 thousand split file3_split_100_thousand
Similarly, for 9.9 million row it will be file3_split_9.9_million
My example file 1
A B C D E F G H
chr1 1 0.5 0.6 0.7 0.7 0.9 10.0
chr1 2 0.5 0.6 0.11 0.7 0.9 10.0
chr1 3 0.5 0.6 0.1 0.7 0.9 10.0
chr1 4 0.5 0.6 0.7 0.7 0.9 10.0
File2
A B C D E F G H I
chr1 3 0.1 0.2 0.3 0.7 0.9 10.0 19
chr1 4 0.100 0.3 0.7 0.7 0.9 10.0 19
File3
A B C D E F G H
chr1 3 0.5 0.6 0.1 0.7 0.9 10.0
chr1 4 0.5 0.6 0.7 0.7 0.9 10.0

Reshape (stack) pandas dataframe based on predefined number of rows

I have a pandas dataframe which looks like one long row.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
________________________________________________________________________________________________
2010 | 0.1 0.5 0.5 0.7 0.5 0.5 0.5 0.5 0.9 0.5 0.5 0.8 0.3 0.3 0.6
I would like to reshape it as:
0 1 2 3 4
____________________________________
|0| 0.1 0.5 0.5 0.7 0.5
2010 |1| 0.5 0.5 0.5 0.9 0.5
|2| 0.5 0.8 0.3 0.3 0.6
I can certainly do it using a loop, but I'm guessing (un)stack and/or pivot might be able to do the trick, but I couldn't figure it out how...
Symmetry/filling up blanks - if the data is not integer divisible by the number of rows after unstack - is not important for now.
EDIT:
I coded up the loop solution meanwhile:
df=my_data_frame
dk=pd.DataFrame()
break_after=3
for i in range(len(df)/break_after):
dl=pd.DataFrame(df[i*break_after:(i+1)*break_after]).T
dl.columns=range(break_after)
dk=pd.concat([dk,dl])
If there is only one index (2010), this will work fine.
df1 = pd.DataFrame(np.reshape(df.values,(3,5)))
df1['Index'] = '2010'
df1.set_index('Index',append=True,inplace=True)
df1 = df1.reorder_levels(['Index', None])
Output:
0 1 2 3 4
Index
2010 0 0.1 0.5 0.5 0.7 0.5
1 0.5 0.5 0.5 0.9 0.5
2 0.5 0.8 0.3 0.3 0.6

numpy/pandas effective multiplication of arrays/dataframes

I have two pandas data frames, which look like this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
df_one
A B1 B2 B3
0 1 0.5 0.2 0.1
1 1 0.0 0.3 0.2
2 2 0.2 0.1 0.0
3 3 0.1 0.5 0.9
4 4 0.3 0.3 0.0
5 4 0.2 0.1 0.3
6 4 0.1 0.2 0.5
df_two
A C1 C2 C3
0 1 1.0 2.0 5.0
1 2 9.0 3.0 4.0
2 3 2.1 0.7 2.3
3 4 9.0 1.1 3.4
What I would like to do is compute is a scalar product where I would be multiplying rows of the first data frame by the rows of the second data frame, i.e., \sum_i B_i * C_i, but in such a way that a row in the first data frame is multiplied by a row in the second data frame only if the values of the A column match in both frames. I know how to do it looping and using if's but I would like to do that in a more efficient numpy-like or pandas-like way. Any help much appreciated :)
Not sure if you want unique values for column A (If you do, use groupby on the result below)
pd.merge(df_one, df_two, on='A')
A B1 B2 B3 C1 C2 C3
0 1 0.5 0.2 0.1 1.0 2.0 5.0
1 1 0.0 0.3 0.2 1.0 2.0 5.0
2 2 0.2 0.1 0.0 9.0 3.0 4.0
3 3 0.1 0.5 0.9 2.1 0.7 2.3
4 4 0.3 0.3 0.0 9.0 1.1 3.4
5 4 0.2 0.1 0.3 9.0 1.1 3.4
6 4 0.1 0.2 0.5 9.0 1.1 3.4
pd.merge(df_one, df_two, on='A').apply(lambda s: sum([s['B%d'%i] * s['C%d'%i] for i in range(1, 4)]) , axis=1)
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82
Another approach would be something similar to this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
lookup = df_two.groupby(df_two.A)
def multiply_rows(row):
other = lookup.get_group(row['A'])
# We want every column after "A"
x = row.values[1:]
# In this case, other is a 2D array with one row, similar to "row" above...
y = other.values[0, 1:]
return x.dot(y)
# The "axis=1" makes each row to be passed in, rather than each column
result = df_one.apply(multiply_rows, axis=1)
print result
This results in:
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82
I would zip together the rows and use a filter or a comprehension that takes only the rows where A columns match.
Something like
[scalar_product(a,b) for a,b in zip (frame1, frame2) if a[0]==b[0]]
assuming that you're willing to fill in the appropriate material for scalar_product
(apologies if I've made a thinko here - this code is for example purposes only and has not been tested!)

Categories