I have two csv, one is excerpted from other. I want to compare two csv and have a new csv with difference between them. Orignally i used implemented sets and compare both. if both set don't mactch appended to new file. however i ran to issue that with slight change in number it is showing difference in both the files. is there a way to write the code memory efficient way where it compares the cell value and if the difference is more than 2 then it appends the row.
following are the sample:
code that i have used is as below:
orig = open('T1.csv','r')
new = open('T2.csv','r')
bigb = set(new) - set(orig)
print(bigb)
with open('different.csv', 'w') as file_out:
for line in bigb:
file_out.write(line)
orig.close()
new.close()
file_out.close()
enter code here
csv1:
0 1.1 -19.1 -29.1
1 2.1 -18.1 -28.1
2 3.1 -17.1 -27.1
3 4.1 -16.1 -26.1
4 5.1 -15.1 -25.1
5 6.1 -14.1 -24.1
6 7.1 -13.1 -23.1
7 8.1 -12.1 -22.1
8 9.1 -11.1 -21.1
9 10.1 -10.1 -20.1
10 11.1 -9.1 -19.1
csv2:
0 1.4 -19.6 -29.8
1 2.4 -18.6 -28.8
2 3.4 -17.6 -27.8
3 4.4 -16.6 -26.8
4 5.4 -15.6 -25.8
5 6.4 -14.6 -24.8
6 7.4 -13.6 -23.8
7 8.4 -12.6 -22.8
8 9.4 -11.6 -21.8
9 10.4 -10.6 -20.8
10 11.4 -9.6 -19.8
Related
I have 4 CSV files with \t or tab as delimiter.
alok#alok-HP-Laptop-14s-cr1:~/tmp/krati$ for file in sample*.csv; do echo $file; cat $file; echo ; done
sample1.csv
ProbeID p_code intensities
B1_1_3 6170 2
B2_1_3 6170 2.2
B3_1_4 6170 2.3
12345 6170 2.4
1234567 6170 2.5
sample2.csv
ProbeID p_code intensities
B1_1_3 5320 3
B2_1_3 5320 3.2
B3_1_4 5320 3.3
12345 5320 3.4
1234567 5320 3.5
sample3.csv
ProbeID p_code intensities
B1_1_3 1234 4
B2_1_3 1234 4.2
B3_1_4 1234 4.3
12345 1234 4.4
1234567 1234 4.5
sample4.csv
ProbeID p_code intensities
B1_1_3 3120 5
B2_1_3 3120 5.2
B3_1_4 3120 5.3
12345 3120 5.4
1234567 3120 5.5
All 4 files have same headers.
ProbeID is same across all files, order is also same. Each file have same p_code across single CSV file.
I have to merge all these CSV files into one in this format.
alok#alok-HP-Laptop-14s-cr1:~/tmp/krati$ cat output1.csv
ProbeID 6170 5320 1234 3120
B1_1_3 2 3 4 5
B2_1_3 2.2 3.2 4.2 5.2
B3_1_4 2.3 3.3 4.3 5.3
12345 2.4 3.4 4.4 5.4
1234567 2.5 3.5 4.5 5.5
In this output file columns are dynamic based on p_code value.
I can do this easily in Python using dictionary. How can I produce such output using Pandas?
We can achieve this using pandas.concat and DataFrame.pivot_table:
import os
import pandas as pd
df = pd.concat(
[pd.read_csv(f, sep="\t") for f in os.listdir() if f.endswith(".csv") and f.startswith("sample")],
ignore_index=True
)
df = df.pivot_table(index="ProbeID", columns="p_code", values="intensities", aggfunc="sum")
print(df)
I am reading a text file using Pandas in Python. I am using Python 2.7. The dataset in use in this question is related to a question that I had asked before here. To be specific, the first two rows, and the first column of my data comprise of text information. The following is a snapshot of a truncated version of my dataset.
The data file can be found here. I am using the helpful answers given here to load the dataset (df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)).
I want to get descriptive statistics of my pandas dataframe along rows, instead of columns. I have tried using df.describe(), but it gives me descriptive statistics along columns. I had a look at the answers given in this question, but I get the following error when I use the answers suggested in that link.
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index foxq1')
How can I get descriptive statistics using Pandas for the numerical entries in every row for the dataset that I have? Thanks in advance.
Following a few comments, I am including the actual code that I am using, and the error message:
The actual code is this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
df.apply(pd.DataFrame.describe, axis=1)
Error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-0d7a5fde0f42> in <module>()
----> 1 df.apply(pd.DataFrame.describe, axis=1)
2 #df.apply(pd.DataFrame.describe, axis=1)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260 f, axis,
4261 reduce=reduce,
-> 4262 ignore_failures=ignore_failures)
4263 else:
4264 return self._apply_broadcast(f, axis)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
4356 try:
4357 for i, v in enumerate(series_gen):
-> 4358 results[i] = func(v)
4359 keys.append(v.name)
4360 except Exception as e:
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index object1')
From the question you referenced, you can just use this code (in other words apply describe along the rows):
df.apply(pd.DataFrame.describe, axis=1)
And you get the following result:
count mean std min 25% 50% 75% max
object1 5.0 3.1 1.581139 1.1 2.1 3.1 4.1 5.1
object2 5.0 3.2 1.581139 1.2 2.2 3.2 4.2 5.2
object3 5.0 3.3 1.581139 1.3 2.3 3.3 4.3 5.3
object4 5.0 3.4 1.581139 1.4 2.4 3.4 4.4 5.4
object5 5.0 3.5 1.581139 1.5 2.5 3.5 4.5 5.5
object6 5.0 3.6 1.581139 1.6 2.6 3.6 4.6 5.6
object7 5.0 3.7 1.581139 1.7 2.7 3.7 4.7 5.7
object8 5.0 3.8 1.581139 1.8 2.8 3.8 4.8 5.8
You can try to use numpy to obtain much of the statistics for rows:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
print df
Type T1 T2 T3 T4 T5 T6 T7
Tag Tag1 Tag1 Tag1 Tag5 Tag5 Tag6 Tag6
object1 1.1 2.1 3.1 4.1 5.1 6.1 7.1
object2 1.2 2.2 3.2 4.2 5.2 6.2 7.2
object3 1.3 2.3 3.3 4.3 5.3 6.3 7.3
object4 1.4 2.4 3.4 4.4 5.4 6.4 7.4
object5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
object6 1.6 2.6 3.6 4.6 5.6 6.6 7.6
object7 1.7 2.7 3.7 4.7 5.7 6.7 7.7
object8 1.8 2.8 3.8 4.8 5.8 6.8 7.8
data = df.values
data_mean = np.mean(data, axis=1)
data_std = np.std(data, axis=1)
data_min = np.min(data, axis=1)
data_max = np.max(data, axis=1)
print data_mean
[ 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8]
print data_std
[ 2. 2. 2. 2. 2. 2. 2. 2.]
print data_min
[ 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8]
print data_max
[ 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8]
I am struggling with some simple from_dict conversion. I have a list nested dictionaries in dictionary as below. (quite confusing to me as well)
dict_total = {'Jane' : {'a1' : [1.1,1.3,1.4,1.9],
'a2' : [3.1,2.4,2.3,1.2],
'a3' : [4.3,2.3,1.5,5.3],
'st' : ['d','dc','sc','sc']},
'Mark' : {'a1' : [3.1,2.3,1.3,1.9],
'a2' : [1.2,2.3,9.3,1.2],
'a3' : [1.1,5.5,1.2,5.3],
'st' : ['cs','s','wc','cd']}
}
Above is just simple example, but my original contains more then 20000+ keys in dict_total. I want to convert this dictionary to dataframe (hopefully on loops) like below.
df_total =
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 sc
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
As you can see the keys for dict_total would be the index of dataframe, and each keys for "Jane" and "Mark" will be the column name, and lists for values.
Hope there is a pythonic way to solve this. Thanks
I think need concat with dict comprehension, last remove first level by reset_index:
df_total = (pd.concat({k: pd.DataFrame(v) for k, v in dict_total.items()})
.reset_index(level=1, drop=True))
print (df_total)
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 s
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100
I have a dataframe df1
pid stat h1 h2 h3 h4 h5 h6 ... h20
1 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
: : : : : : : : : :
2 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
2 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
2 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
2 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
3 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
3 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
3 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
3 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
I would like to obtain groups indexed on pid and stat and then subtract h values of group1 from h values of group2 for a final dataframe (df2). This final dataframe needs to be reindexed with numbers starting from 0:len(groups) Repeat it iteratively for all permutations of pid like 1-2, 1-3, 1-4, 2-1, 2-3 ... etc. I need to perform other calculations on the on the final dataframe df2(values in the below df2 are not exact subtracted, but just a representation)
pid(string) stat h1p1-h1p2 h2p1-h2p2 h3p1-h3p2 h4p1-h4p2 h5p1-h5p2 h6p1-h6p2 ... h20p1-h2p2
1-2 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1-2 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1-2 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1-2 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
1-3 ....
I looked at options of;
for (pid, stat), group in df1.groupby(['pid', 'stat']):
print('pid = %s Stat = %s' %(pid, stat))
print group
this gives me groups but, I am not sure how to access dataframes from this for loop and use it for subtracting from other groups. Also
df_grouped = df.groupby(['pid', 'stat']).groups()
still not sure how to access the new dataframe of groups and perform operations. I would like to know, if this can be done using groupby or if there is any better approach. Thanks in advance!
I implemented a generator and ignored the stat column because it makes no different in any groups according to your sample. Please tell me if I did it wrong.
import pandas as pd
from itertools import permutations
def subtract_group(df, col):
pid = df['pid'].unique()
# select piece with pid == i
segment = lambda df, i: df[df['pid'] == i].reset_index()[col]
for x, y in permutations(pid, 2):
result_df = pd.DataFrame(segment(df, x) - segment(df, y))
# rename columns
result_df.columns=["%sp%d-%sp%d" % (c, x, c, y) for c in col]
# insert pid column
result_df.insert(0, 'pid', '-'.join([str(x), str(y)]))
yield result_df
You can test it with:
# column name in your case
columns = ['h' + str(i+1) for i in range(20)]
print next(subtract_group(df1, columns))
Hope it helps.