I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
Related
I have two Pandas DataFrame with different columns number.
df1 is a single row DataFrame:
a X0 b Y0 c
0 233 100 56 shark -23
df2, instead, is multiple rows Dataframe:
d X0 e f Y0 g h
0 snow 201 32 36 cat 58 336
1 rain 176 99 15 tiger 63 845
2 sun 193 81 42 dog 48 557
3 storm 100 74 18 shark 39 673 # <-- This row
4 cloud 214 56 27 wolf 66 406
I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns.
In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0.
The output for this example is:
True
Note: True/False as output is enough for me, I don't care about index of matched row.
I founded similar questions but all of them check the entire row...
Use df.merge with an if condition check on len:
In [219]: if len(df1[['X0', 'Y0']].merge(df2)):
...: print(True)
...:
True
OR:
In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty
Out[225]: True
Try this:
df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))]
Output:
d X0 e f Y0 g h
3 storm 100 74 18 shark 39 673
duplicated
df2.append(df1).duplicated(['X0', 'Y0']).iat[-1]
True
Save a tad bit of time
df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]
I have two dataframes one with some missing values, and another with values that need to replace the missing values. So, the 2nd dataframe is shorter in length than the 1st one.
The missing values in the first dataframe is noted by either "Height Info Not Found" or "Player Info Not Found"
Is there a way to replace the missing values in the first dataframe with the corresponding values from the 2nd dataframe without looping?
I tried using .map() but values not replaced are returned NaN.
filled_df['height']= filled_df['height'].astype(str) #dataframe with real values
main_df['height']= main_df['height'].astype(str) #dataframe with missing values
mapping = dict(filled_df[['name','height']].values)
main_df['height'] = main_df['url_names'].map(mapping,na_action='ignore')
print(main_df)
name url_names height
0 John Mcenroe John_Mcenroe Height Info Not Found
1 Jimmy Connors Jimmy_Connors Player Info Not Found
2 Ivan Lendl Ivan_Lendl 1.88 m (6 ft 2 in)
3 Mats Wilander Mats_Wilander 1.83 m (6 ft 0 in)
4 Andres Gomez Andres_Gomez 1.93 m (6 ft 4 in)
5 Anders Jarryd Anders_Jarryd 1.80 m (5 ft 11 in)
6 Henrik Sundstrom Henrik_Sundstrom 1.88 m (6 ft 2 in)
7 Pat Cash Pat_Cash Height Info Not Found
8 Eliot Teltscher Eliot_Teltscher 1.75 m (5 ft 9 in)
9 Yannick Noah Yannick_Noah 1.93 m (6 ft 4 in)
10 Joakim Nystrom Joakim_Nystrom 1.87 m (6 ft 2 in)
11 Aaron Krickstein Aaron_Krickstein 6 ft 2 in (1.88 m)
12 Johan Kriek Johan_Kriek 1.75 m (5 ft 9 in)
name height
0 John_Mcenroe 1.80
1 Jimmy_Connors 1.78
2 Pat_Cash 183
3 Jimmy_Arias 175
4 Juan_Aguilera 1.82
5 Henri_Leconte 1.84
6 Balazs_Taroczy 1.82
7 Sammy_Giammalva_Jr 1.78
8 Thierry_Tulasne 1.77
Ithink you need replace only misisng values by matched values by dictionary:
main_df['height'] = main_df['height'].fillna(main_df['url_names'].map(mapping))
This code can do the job
import pandas as pd
d = {'url_names': ['John_Mcenroe', 'Jimmy_Connors', 'Ivan_Lendl'], 'height': ['Height Info Not Found', 'Player Info Not Found', '1.88 m (6 ft 2 in)']}
main_df = pd.DataFrame(d)
d = {'url_names': ['John_Mcenroe', 'Jimmy_Connors'], 'height': ['1.80', '1.78']}
filled_df = pd.DataFrame(d)
df1 = main_df[(main_df.height == 'Height Info Not Found') | (main_df.height == 'Player Info Not Found')].drop(['height'], axis=1).merge(filled_df, on="url_names")
df2 = main_df[(main_df.height != 'Height Info Not Found') & (main_df.height != 'Player Info Not Found')]
pd.concat([df1, df2])
I have the following multi-indexed dataframe:
count
site_url visit_id
a.com 1 100
11 102
21 99
b.com 2 231
12 229
22 229
Where all the first level index groups have the same size, that is for each site_url I have an equal number N of elements (I only included 3 in my example but there are actually more).
I would like to turn this data frame into:
a.com b.com
visit_index
1 100 231
2 102 229
3 99 229
.
.
.
N ... ...
So that I can plot each column as a line with N data points.
The issue I'm running into is, how do I convert each visit_id value (all unique) into a "visit index relative to its website"? That is, visit ids 1, 11, 21, ... for website a.com would map to visit indexes 1, 2, 3, ... N, and so would visit ids for b.com, c.com, etc.
Once this is done, I believe I can use values of site_url as columns with df.unstack().
df['visit_index'] = df.groupby(level='site_url').cumcount()
df = df.reset_index().set_index('visit_index')
print(df.pivot(columns='site_url', values='count'))
Prints:
site_url a.com b.com
visit_index
0 100 231
1 102 229
2 99 229
I have two csv files that i created with python from an unstructured data but i don't want my script to output two files once i run the script on a json. So lets say i have a file A with columns as follows:
File 1:
feats ID A B C E
AA 123 3343 234 2342 112
BB 121 3342 237 2642 213``
CC 122 3341 232 2352 912
DD 123 3343 233 5342 12
EE 121 3345 235 2442 2112
...and so on with lets say, 10000 rows of different values and 6 columns. Now I want to check these values of column "ID" against file 2 and merge on the values of ID.
File 2:
Char_Name ID Cosmic Awareness
Uatu 123 3.4
Galan 121 4.5 ``
Norrin Radd 122 1.6
Shalla-bal 124 0.3
Nova 125 1.2
This file 2 has only 5 rows for 5 different values for b and lets say 23
column values. I can do this easily with map or apply in pandas but i'm
dealing with 1000's of files and don't wanna do that. Is their any way
like mapping the file 2 values (name and cosmic awareness columns) to File 1 by adding new columns titled 'name' and 'cosmic' (from file 2) by matching the values with corresponding ID values on File 1 and File 2. The expected output should be somewhat like this.
Final File:
feats ID A B C E Char_Name Cosmic Awareness
AA 123 3343 234 2342 112 Uatu 3.4
BB 121 3342 237 2642 213`` Galan 4.5
CC 122 3341 232 2352 912 Norrin Radd 1.6
DD 123 3343 233 5342 12 Uatu 3.4
EE 121 3345 235 2442 2112 Galan 4.5
Thanks in advance and if their is any way to improve this question, the suggestions are welcome. I will incorporate them here. I have added the expected outcome above.
I think need glob for all file names and then in list comprehension create DataFrame:
from functools import reduce
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
Last merge together:
df = reduce(lambda left,right: pd.merge(left,right,on='ID'), dfs)
For outer join is possible use concat:
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['ID']) for fp in files]
df = pd.concat(dfs, axis=1)
I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.
most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88