I have a file whose content looks like:
A B 2 4
C D 1 2
A D 3 4
A D 1 2
A B 4 7 and so on..
My objective is to get the final output as below:
A B 3 5.5
C D 1 2
A D 2 3
That is, for each unique combination of first two columns, the result should be column-wise average of other two columns of the file. I tried using loops and it is just increasing the complexity of the program. Is there any other way to achieve the objective.
Sample Code:
with open(r"C:\Users\priya\Desktop\test.txt") as f:
content = f.readlines()
content = [x.split() for x in content]
for i in range(len(content)):
valueofa=[content[i][2]]
valueofb=[content[i][3]]
for j in xrange(i+1,len(content)):
if content[i][0]==content[j][0] and content[i][1]==content[j][1]:
valueofa.append(content[j][2])
valueofb.append(content[j][3])
and I intended to take the average of both lists by index.
You can store each combination of letters as a tuple in a dictionary and then average at the end, e.g.:
d = {}
with open(r"C:\Users\priya\Desktop\test.txt") as f:
for line in f:
a, b, x, y = line.split()
d.setdefault((a, b), []).append((int(x), int(y)))
for (a, b), v in d.items():
xs, ys = zip(*v)
print("{} {} {:g} {:g}".format(a, b, sum(xs)/len(v), sum(ys)/len(v)))
Output:
A B 3 5.5
C D 1 2
A D 2 3
If you can use pandas, it will much simpler:
import pandas as pd
df = pd.read_csv(r"C:\Users\priya\Desktop\test.txt", names=['A','B','C','D'])
df
A B C D
0 A B 2 4
1 C D 1 2
2 A D 3 4
3 A D 1 2
4 A B 4 7
df.groupby(['A','B']).mean().reset_index()
A B C D
0 A B 3.0 5.5
1 A D 2.0 3.0
2 C D 1.0 2.0
Related
I'm trying to count values in two columns and then put the results in the same table.
dict = { "before": list("ABCDEFABDCFEFF"),
"after" : list("FABFCFFEEDEBFF") }
df = pd.DataFrame(dict)
Output
before after
0 A F
1 B A
2 C B
3 D F
4 E C
5 F F
6 A F
7 B E
8 D E
9 C D
10 F E
11 E B
12 F F
13 F F
I've achieved something close to what I want, but this looks messy, and I'm hoping for a "smoother" solution.
df.melt().groupby("variable")["value"].value_counts().to_frame().unstack()
Output:
value
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4
df.apply(lambda x: x.value_counts())
If you want to have before and after as the row indexes as shown in your current output, you should use the following.
df.apply(lambda x: x.value_counts()).transpose()
A different way with melt using pivot_table:
>>> df.melt().assign(count=1).pivot_table('count', 'variable', 'value', aggfunc='count')
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4
My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file
You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64
I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?
We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.
You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c
Sorry if this seems simple but have been struggling to find an answer to this.
I have a large dataframe of the format in the picture:
Each row can be uniquely identified by the multi-index built from the columns "trip_id", "direction_id", "stop_sequence".
I would like to request methods using loops to create a python-dictionary of dataframes where each dataframe is a subset of the large dataframe which contains all the rows for each "trip_id" + "direction_id" multi-index.
At the end of the loops I would like to be able to have a python-dictionary of dataframes where I can access each dictionary with a simple index key such as from 0 - 10,000 or the key being the combination of trip_id and direction_id
E.g. for the image above, I would like all the rows where the trip_id is "17067064.T0.2-EPP-F-mjp-1.8.R" and the direction ID is "1" to be in one dataframe of this dictionary collection.
Thank you for your help.
Kind regards,
Ben
Use groupby with dictionary comprehension:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,5,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B','C'])
print (df)
A D E
F B C
a 4 7 a 1 5
5 8 b 3 3
9 c 5 6
b 5 4 d 7 9
2 e 1 2
4 3 f 0 4
A D E
#python 3.6+
dfs = {f'{a}_{b}':v for (a, b), v in df.groupby(level=['F','B'])}
#python bellow
#dfs = {'{}_{}'.format(a,b):v for (a, b), v in df.groupby(level=['F','B'])}
print (dfs)
{'a_4': A D E
F B C
a 4 7 a 1 5, 'a_5': A D E
F B C
a 5 8 b 3 3
9 c 5 6, 'b_4': A D E
F B C
b 4 3 f 0 4, 'b_5': A D E
F B C
b 5 4 d 7 9
2 e 1 2}
print (dfs['a_4'])
A D E
F B C
a 4 7 a 1 5
I want to append 2 dataframes:
data1:
a
1 a
2 b
3 c
4 d
5 e
data2:
b
1 f
2 g
3 h
4 i
5 j
output:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
currently i am using:
all_data= data1.append(data2, ignore_index=True)
this gives me result as:
a b
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
i.e. in different columns.
How can i get them in the same column?
Also tried converting the dataframes into list and then tring to append it. But it gave me the error:
TypeError: append() takes no keyword arguments
Also, is there any other function to remove duplicates from the datarame of strings? The drop_duplicates() function does not work in my case. The data still has duplicates.
You need to change one column name, so append can detect hat you want to do:
data2.columns = ["a"]
or
data1.columns = ["b"]
And then, after using data2.columns = ["a"]:
all_data = data1.append(data2, ignore_index=True)
all_data
a
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
And here you have your column named after the column's name of data1, which you can rename if you want:
all_data.columns = ["Foo"]
merge or concat work on keys. In this case, there are no common columns. However, why not use numpy append and create the dataframe?
In [68]: pd.DataFrame(pd.np.append(data1.values, data2.values), columns=['A'])
Out[68]:
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
df1.columns = ['b']
Out[78]:
b
0 a
1 b
2 c
3 d
4 e
pd.concat([df1 , df2] , ignore_index=True)
Out[80]:
b
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j