Merging two rows in pandas into one

Merging two rows in pandas into one - python

I have a data frame like this
no, frc, val
1121,1,"John"
1121,0,236
3612,1,"Mary"
3612,0,545
I want to combine data like this
"John",236
"Mary",545

you can self join two subsets of this DF, using merge() method:
In [21]: (df[df['frc']==1]
.drop('frc',1)
.rename(columns={'val':'name'})
.merge(df[df['frc']==0].drop('frc',1)))
Out[21]:
no name val
0 1121 John 236
1 3612 Mary 545

df.set_index(['no', 'frc']).val.unstack().rename(columns={0:'val', 1:'name'})
frc val name
no
1121 236 John
3612 545 Mary
Or to produce OP output
print(
df.set_index(['no', 'frc']).val
.unstack()[[1, 0]]
.to_csv(index=False, header=False)
)
John,236
Mary,545

Related

Calculate and add up Data from a reference dataframe

I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.

Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted

How to view or amend values in a multi index dataframe in python

I have a dataframe with the following structure:
Cluster 1 Cluster 2 Cluster 3
ID Name Revenue ID Name Revenue ID Name Revenue
1234 John 123 1235 Jane 761 1237 Mary 276
1376 Peter 254 1297 Paul 439 1425 David 532
However I am unsure how to perform basic functions like .unique or .value_count for columns as I am unsure how to refer to them in the code...
For example, if I want to see the unique values in the Cluster 2 Name column, how would I code that?
Usually I would type df.Name.unique() or df['Name'].unique() but neither of these work.
My original data looked like this:
ID Name Revenue Cluster
1234 John 123 1
1235 Jane 761 2
1237 Mary 276 3
1297 Paul 439 2
1376 Peter 254 1
1425 David 532 3
And I used this code to get me to my current point:
df = (df.set_index([df.groupby('Cluster').cumcount(), 'Cluster'])
.unstack()
.swaplevel(1,0, axis=1)
.sort_index(axis=1)
.rename(columns=lambda x: f'Cluster {x}', level=0))```

You just need to subset by the index in sequence.
So your first step would be to subset Cluster 2, then get unique names.
For example:
df["Cluster 2"]["Names"].unique()

Pandas - Grouping columns based on other columns and tagging them into new column

I have a data frame which I want to group based on the value of another column in the same data frame.
For example:
The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree.
The dataframe looks like (input from a csv file)
No Name ID Parent_Id
1 Tom 211 111
2 Galie 209 111
3 Remo 200 101
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111
7 Armin 234 101
8 Boris 454 109
9 Katya 109 323
I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function.
Alfred
|_ Galie
_ Tom
_ Marvela
|_ Remo
_ Armin
Carmen
Katya
|_ Boris
And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like:
No Name ID Parent_Id Tag
1 Tom 211 111 Alfred
2 Galie 209 111 Alfred
3 Remo 200 101 Marvela, Alfred
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111 Alfred
7 Armin 234 101 Marvela, Alfred
8 Boris 454 109 Katya
9 Katya 109 323
Note that the names can repeat, but the ID will be unique.
Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file.
If a child has other child (like marvel), it qualifies to have its own csv file.
And the final output would be
Alfred.csv - All records matching Galie, Tom, Marvela
Marvela.csv - All records matching Remo, Armin
Carmen.csv - Only record matching carmen (row)
Katya.csv - all records matching katya, boris

I am assuming your dataframe as a dictionary:
mydf = ({"No":[1,2,3,4,5,6,7,8,9],"Name":["Tom","Galie","Remo","Carmen","Alfred","Marvela","Armin","Boris","Katya"],
"ID":[211,209,200,212,111,101,234,454,109],"Parent_Id":[111,111,101,121,191,111,101,109,323]})
df = pd.DataFrame(mydf)
Then, I identify the Parent_Id from each row. Finally stored them into new column:
tag = []
for z in df['Parent_Id']:
try:
tag.append(df.query('ID==%s'%z)['Name'].item())
except:
tag.append('')
df['Tag'] = tag
To filter the dataframe based on a value in column Tag, e.g. Alfred:
df[df['Tag'].str.match('Alfred')]
Then save it in a csv file. Repeat for other values. Alternatively, if you have a large number of names in column Tag, then use for loop.

Convert a list of lists to a structured pandas dataframe

I am trying to convert the following data structure;
To the format below in python 3;

if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses

Python Pandas compare CSV keyerror

I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.

most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.