CSV value mapping from 2 files like map in pandas - python

I have two csv files that i created with python from an unstructured data but i don't want my script to output two files once i run the script on a json. So lets say i have a file A with columns as follows:
File 1:
feats ID A B C E
AA 123 3343 234 2342 112
BB 121 3342 237 2642 213``
CC 122 3341 232 2352 912
DD 123 3343 233 5342 12
EE 121 3345 235 2442 2112
...and so on with lets say, 10000 rows of different values and 6 columns. Now I want to check these values of column "ID" against file 2 and merge on the values of ID.
File 2:
Char_Name ID Cosmic Awareness
Uatu 123 3.4
Galan 121 4.5 ``
Norrin Radd 122 1.6
Shalla-bal 124 0.3
Nova 125 1.2
This file 2 has only 5 rows for 5 different values for b and lets say 23
column values. I can do this easily with map or apply in pandas but i'm
dealing with 1000's of files and don't wanna do that. Is their any way
like mapping the file 2 values (name and cosmic awareness columns) to File 1 by adding new columns titled 'name' and 'cosmic' (from file 2) by matching the values with corresponding ID values on File 1 and File 2. The expected output should be somewhat like this.
Final File:
feats ID A B C E Char_Name Cosmic Awareness
AA 123 3343 234 2342 112 Uatu 3.4
BB 121 3342 237 2642 213`` Galan 4.5
CC 122 3341 232 2352 912 Norrin Radd 1.6
DD 123 3343 233 5342 12 Uatu 3.4
EE 121 3345 235 2442 2112 Galan 4.5
Thanks in advance and if their is any way to improve this question, the suggestions are welcome. I will incorporate them here. I have added the expected outcome above.

I think need glob for all file names and then in list comprehension create DataFrame:
from functools import reduce
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
Last merge together:
df = reduce(lambda left,right: pd.merge(left,right,on='ID'), dfs)
For outer join is possible use concat:
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['ID']) for fp in files]
df = pd.concat(dfs, axis=1)

Related

how to get excel particular cell value for in while loop

I am new with python and pandas, I have a text file (data.txt) in which "content" is like. "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110 ............. " etc. and having an excel file (combination.xlsx) which is carrying some combination. (In excel sheet cell A1 = 123 456, A2 = 456 789, A3 = 789 101123, .................), my Problem is how to use/get each cell value from (combination.xlsx) to use for count of frequency of occurrence which may available in data.txt and print in another text file (final.txt). want to make a while loop which will start with picking the first cell value )A1) and start a loop and if it is = to or more then 1 then it will print in final.txt otherwise it should pick second cell value(A2).. till cell value/data is empty.
It seems to me that you don't need an explicit while loop here. You can get each cell value using pd.read_excel
which returns a dataframe with all cells. To count the frequency of occurrence, for each row of the dataframe you can use len over the re.findall with the following regular expression: \b({x})\b. This regex assures that the number sequence (x on this particular f-string) will match only between word boundaries. To print to another file, you can use df["Qnt"].to_csv.
import pandas as pd
import re
data_txt = "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110"
# read XLSX cells
df = pd.read_excel("combination.xlsx", header=None, names=["Comb"])
# count occurrences
find_qnt = lambda x: len(re.findall(rf"\b({x})\b", data_txt))
# apply to each row
df["Qnt"] = df["Comb"].apply(find_qnt)
print(df)
# print into another text file
df["Qnt"].to_csv("final.txt", index=False)
Output from df
Comb Qnt
0 123 456 3
1 456 789 4
2 789 101123 1

Splitting string multiple times in DataFrame

I have a column in a DataFrame that contains a string from which I must retrieve two pieces of information by different separators:
ID STR
280 11040402-38.58551%;11050101-9.29086%;11070101-52.12363%
351 11130203-35%;11130230-65%
510 11070103-69%
655 11090103-41.63463%;11160102-58.36537%
666 11130205-50.00%;11130207-50%
I have been trying to use the .apply method on this series together with a lambda function to make the splitting in one go, to no avail:
df['STR'].apply(lambda x: y.split('-') for y in x.split(';'))
Ideally, not only I would be able to split the string in one go, but also separate the left side of the - from the right side:
ID STR.LEFT STR.RIGHT
280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
351 [11130203, 11130230] [35%, 65%]
510 [11070103] [69%]
655 [11090103, 11160102] [41.63463%, 58.36537%]
666 [11130205, 11130207] [50.00%, 50%]
I believe this could be achievable with .apply and slicing, but any other solution is welcome.
You can try splitting several times:
# set ID as index
df.set_index('ID', inplace=True)
new_series = df.STR.str.split(';', expand=True).stack().reset_index(level=-1,drop=True)
new_df = new_series.str.split('-', expand=True)
new_df.groupby('ID').agg(list).reset_index()
Output:
ID 0 1
-- ---- ------------------------------------ --------------------------------------
0 280 ['11040402', '11050101', '11070101'] ['38.58551%', '9.29086%', '52.12363%']
1 351 ['11130203', '11130230'] ['35%', '65%']
2 510 ['11070103'] ['69%']
3 655 ['11090103', '11160102'] ['41.63463%', '58.36537%']
4 666 ['11130205', '11130207'] ['50.00%', '50%']
str.split
Assuming the pattern always leaves 'l-r;l-r;l-r...'
s = df.STR.str.split('-|;')
df[['ID']].join(pd.concat({'STR.LEFT': s.str[::2], 'STR.RIGTH': s.str[1::2]}, axis=1))
ID STR.LEFT STR.RIGTH
0 280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
1 351 [11130203, 11130230] [35%, 65%]
2 510 [11070103] [69%]
3 655 [11090103, 11160102] [41.63463%, 58.36537%]
4 666 [11130205, 11130207] [50.00%, 50%]
If you want to explode these lists into separate rows
s = df.STR.str.split('-|;')
i = np.arange(len(df)).repeat(s.str.len() // 2)
d = {'STR.LEFT': np.concatenate(s.str[::2]),
'STR.RIGHT': np.concatenate(s.str[1::2])}
df[['ID']].iloc[i].assign(**d).reset_index(drop=True)
ID STR.LEFT STR.RIGHT
0 280 11040402 38.58551%
1 280 11050101 9.29086%
2 280 11070101 52.12363%
3 351 11130203 35%
4 351 11130230 65%
5 510 11070103 69%
6 655 11090103 41.63463%
7 655 11160102 58.36537%
8 666 11130205 50.00%
9 666 11130207 50%
A single str.extractall call will suffice to extract the pairs into separate columns. You can then aggregate them into lists using groupby.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False))
STR.LEFT STR.RIGHT
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%]
1 [11130203, ;11130230] [35%, 65%]
2 [11070103] [69%]
3 [11090103, ;11160102] [41.63463%, 58.36537%]
4 [11130205, ;11130207] [50.00%, 50%]
To join with ID, you use just that: join.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False)
.join(df['ID'])
STR.LEFT STR.RIGHT ID
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%] 280
1 [11130203, ;11130230] [35%, 65%] 351
2 [11070103] [69%] 510
3 [11090103, ;11160102] [41.63463%, 58.36537%] 655
4 [11130205, ;11130207] [50.00%, 50%] 666

Pandas - Grouping columns based on other columns and tagging them into new column

I have a data frame which I want to group based on the value of another column in the same data frame.
For example:
The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree.
The dataframe looks like (input from a csv file)
No Name ID Parent_Id
1 Tom 211 111
2 Galie 209 111
3 Remo 200 101
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111
7 Armin 234 101
8 Boris 454 109
9 Katya 109 323
I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function.
Alfred
|_ Galie
_ Tom
_ Marvela
|_ Remo
_ Armin
Carmen
Katya
|_ Boris
And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like:
No Name ID Parent_Id Tag
1 Tom 211 111 Alfred
2 Galie 209 111 Alfred
3 Remo 200 101 Marvela, Alfred
4 Carmen 212 121
5 Alfred 111 191
6 Marvela 101 111 Alfred
7 Armin 234 101 Marvela, Alfred
8 Boris 454 109 Katya
9 Katya 109 323
Note that the names can repeat, but the ID will be unique.
Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file.
If a child has other child (like marvel), it qualifies to have its own csv file.
And the final output would be
Alfred.csv - All records matching Galie, Tom, Marvela
Marvela.csv - All records matching Remo, Armin
Carmen.csv - Only record matching carmen (row)
Katya.csv - all records matching katya, boris
I am assuming your dataframe as a dictionary:
mydf = ({"No":[1,2,3,4,5,6,7,8,9],"Name":["Tom","Galie","Remo","Carmen","Alfred","Marvela","Armin","Boris","Katya"],
"ID":[211,209,200,212,111,101,234,454,109],"Parent_Id":[111,111,101,121,191,111,101,109,323]})
df = pd.DataFrame(mydf)
Then, I identify the Parent_Id from each row. Finally stored them into new column:
tag = []
for z in df['Parent_Id']:
try:
tag.append(df.query('ID==%s'%z)['Name'].item())
except:
tag.append('')
df['Tag'] = tag
To filter the dataframe based on a value in column Tag, e.g. Alfred:
df[df['Tag'].str.match('Alfred')]
Then save it in a csv file. Repeat for other values. Alternatively, if you have a large number of names in column Tag, then use for loop.

How to preserve the column ordering when accessing a multi-index dataframe using `.loc`?

Let's be given the following dataframe with multi-index columns
import numpy as np
import pandas as pd
a = ['i', 'ii']
b = list('abc')
mi = pd.MultiIndex.from_product([a,b])
df = pd.DataFrame(np.arange(100,100+len(mi)*3).reshape([-1,len(mi)]),
columns=mi)
print(df)
# i ii
# a b c a b c
# 0 100 101 102 103 104 105
# 1 106 107 108 109 110 111
# 2 112 113 114 115 116 117
Using .loc[] and pd.IndexSlice I try to select the columns 'c' and 'b', in that very ordering.
idx = pd.IndexSlice
df.loc[:, idx[:, ['c','b']]]
However, if I look at the output, the requested ordering is not respected!
# i ii
# b c b c
# 0 101 102 104 105
# 1 107 108 110 111
# 2 113 114 116 117
Here are my questions:
Why is the ordering not preserved by pandas? I consider this pretty dangerous, because the list ['c', 'b'] implies an ordering from a user point of view.
How to access the columns via loc[] while preserving the ordering at the same time?
Update: (02.02.2020)
The issue has been identified as pandas bug. In the process of fixing it, this related issue has been identified, which addresses a semantic ambiguity for expressions like df.loc[:, pd.IndexSlice[:, ['c','b']]].
In the meantime, the problem can be circumvented using the approach described in the accepted answer.
Quoting from this link:
I don't think we make guarantees about the order of returned values
from a .loc operation so I am inclined to say this is not a bug but
let's see what others say
So we should be using reindex instead:
df.reindex(columns=pd.MultiIndex.from_product([a,['c','b']]))
i ii
c b c b
0 102 101 105 104
1 108 107 111 110
2 114 113 117 116

Convert a list of lists to a structured pandas dataframe

I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses

Categories