This is a question that concerns me for a long time. I have log files that I want to convert to csv. My problem is that the empty fields have been omitted in the log files. I want to end up with a csv file containing all fields.
Now I'm parsing the log files and write them to xml because one of the nice features of Microsoft Excel is that when you open a xml file with a different number of elements, Excel shows you all elements as separate columns.
Last week I came up with the idea that this might be possible with Pandas, but I can not find a good example to get this done.
Someone a good idea how I can get this done?
Updated
I can't share the actual logs here. Below a fictional sample:
Sample 1:
First : John Last : Doe Address : Main Street Email : j_doe#notvalid.gov Sex : male State : TX City : San Antonio Country : US Phone : 210-354-4030
First : Carolyn Last : Wysong Address : 1496 Hewes Avenue Sex : female State : TX City : KEMPNER Country : US Phone : 832-600-8133 Bank_Account : 0123456789
regex :
matches = re.findall(r'(\w+) : (.*?) ', line, re.IGNORECASE)
Sample 2:
:1: John :2: Doe :3: Main Street :4: j_doe#notvalid.gov :5: male :6: TX :7: San Antonio :8: US :9: 210-354-4030
:1: Carolyn :2: Wysong :3: 1496 Hewes Avenue :5: female :6: TX :7: KEMPNER :8: US :9: 832-600-8133 :10: 0123456789
regex:
matches = re.findall(r':(\d+): (.*?) ', line, re.IGNORECASE)
Allow me to concentrate on your first example. Your regex only matches the first word of each field, but let's keep it like that for now as I'm sure you can easily fix that.
You can create a pandas DataFrame to store your parsed data, then for each line you run your regexp, convert it to a dictionary and load it into a pandas Series. Then you append it to your dataframe. Pandas is smart enough to fill missing data with NaN.
df = pd.DataFrame()
for l in lines:
matches = re.findall(r'(\w+) : (.*?) ', l, re.IGNORECASE)
s = pd.Series(dict(matches))
df = df.append(s, ignore_index=True)
>>> print(df)
Address City Country Email First Last Sex State Phone
0 Main San US j_doe#notvalid.gov John Doe male TX NaN
1 1496 KEMPNER US NaN Carolyn Wysong female TX 832-600-8133
I'm not sure the dict step is needed, maybe there's a pandas way to directly parse your list of tuples.
Then you can easily convert it to csv, you will retain all your columns with empty fields where appropriate.
df.to_csv("result.csv", index=False)
>>> !cat result.csv
Address,City,Country,Email,First,Last,Sex,State,Phone
Main,San,US,j_doe#notvalid.gov,John,Doe,male,TX,
1496,KEMPNER,US,,Carolyn,Wysong,female,TX,832-600-8133
About big files performances, if you know all the field names in advance you can initialize the dataframe with a columns argument and run the parsing and csv saving one chunk at the time. IIRC there's a mode parameter for to_csv that should allow you to append to an existing file.
Related
I have a *.txt file coming from a SQL query organised in rows.
I'm reading it with pandas library through:
df = pd.read_csv(./my_file_path/my_file.txt, sep = '\n', head = 0)
df.rename(columns = {list(df.columns)[0]: 'cols'}, inplace = True)
the output are rows with the information separated by spaces in an standard structure (dots are meant to be spaces):
name................address........country..........age
0 Bob.Hope............Broadway.......United.States....101
1 Richard.Donner......Park.Avenue....United.States.....76
2 Oscar.Meyer.........Friedrichshain.Germany...........47
I tried to create a dictionary to get the info with comprehensive lists:
col_dict = {'name': [df.cols[i][0:20].strip() for i in range(0,len(df.cols))],
'address': [df.cols[I][21:36].strip() for i in range(0,len(df.cols))],
'country': [df.cols[i][36:52].strip() for i in range(0,len(df.cols))],
'age': [df.cols[i][53:].strip() for i in range(0,len(df.cols))],
}
This script runs well in order to create a dictionary as a basis for a dataframe to work with. But I were asking myself if there is any other way to make the script more pythonic, looping directly through a dictionary with the column names and avoiding the repetition of the same code for every column -the actual dataset is much longer-.
The question is how can I store de string indexation to use it later with the column names to parse everything at once.
You can read it directly with pandas:
df = pd.read_csv(./my_file_path/my_file.txt, delim_whitespace=True)
If you know that the space between the columns is going to be at least 2 spaces, you can do it this way:
df = pd.read_csv(./my_file_path/my_file.txt, sep='\s{2,}')
In your case, the file is fixed width so you need to use a different method:
df = pd.read_fwf(StringIO(my_text), widths=[20,15,16, 10],skiprows=1)
The pandas.read_fwf method is what you are looking for.
df = pd.read_fwf( 'data.txt' )
data.txt
name address country age
Bob Hope Broadway United States 101
Richard Donner Park Avenue United States 76
Oscar Meyer Friedrichshain Germany 47
df
id
name
address
country
age
0
Bob Hope
Broadway
United States
101
1
Richard Donner
Park Avenue
United States
76
2
Oscar Meyer
Friedrichshain
Germany
47
I have a pandas dataframe which is essentially 2 columns and 9000 rows
CompanyName | CompanyAddress
and the address is in the form
Line1, Line2, ..LineN, PostCode
i.e. basically different numbers of comma-separated items in a string (or dtype 'object'), and I want to just pull out the post code i.e. the item after the last comma in the field
I've tried the Dot notation string manipulation suggestions (possibly badly):
df_address['CompanyAddress'] = df_address['CompanyAddress'].str.rsplit(', ')
which just put '[ ]' around the fields - I had no success trying to isolate the last component of any split-up/partitioned string, with maxsplit kicking up errors.
I had a small degree of success following EdChums comment to Pandas split Column into multiple columns by comma
pd.concat([df_address[['CompanyName']], df_address['CompanyAddress'].str.rsplit(', ', expand=True)], axis=1)
However, whilst isolating the Postcode, this just creates multiple columns and the post code is in columns 3-6... equally no good.
It feels incredibly close, please advise.
EmployerName Address
0 FAUCET INN LIMITED [Union, 88-90 George Street, London, W1U 8PA]
1 CITIBANK N.A [Citigroup Centre,, Canary Wharf, Canada Squar...
2 AGENCY 2000 LIMITED [Sovereign House, 15 Towcester Road, Old Strat...
3 Transform Trust [Unit 11 Castlebridge Office Village, Kirtley ...
4 R & R.C.BOND (WHOLESALE) LIMITED [One General Street, Pocklington Industrial Es...
5 MARKS & SPENCER FINANCIAL SERVICES PLC [Marks & Spencer Financial, Services Kings Mea...
Given the DataFrame,
df = pd.DataFrame({'Name': ['ABC'], 'Address': ['Line1, Line2, LineN, PostCode']})
Address Name
0 Line1, Line2, LineN, PostCode ABC
If you need only post code, you can extract that using rsplit and re-assign it to the column Address. It will save you the step of concat.
df['Address'] = df['Address'].str.rsplit(',').str[-1]
You get
Address Name
0 PostCode ABC
Edit: Give that you have dataframe with address values in list
df = pd.DataFrame({'Name': ['FAUCET INN LIMITED'], 'Address': [['Union, 88-90 George Street, London, W1U 8PA']]})
Address Name
0 [Union, 88-90 George Street, London, W1U 8PA] FAUCET INN LIMITED
You can get last element using
df['Address'] = df['Address'].apply(lambda x: x[0].split(',')[-1])
You get
Address Name
0 W1U 8PA FAUCET INN LIMITED
Just rsplit the existing column into 2 columns - the existing one and a new one. Or two new ones if you want to keep the existing column intact.
df['Address'], df['PostCode'] = df['Address'].str.rsplit(', ', 1).str
Edit: Since OP's Address column is a list with 1 string in it, here is a solution for that specifically:
df['Address'], df['PostCode'] = df['Address'].map(lambda x: x[0]).str.rsplit(', ', 1).str
rsplit returns a list, try rsplit(‘,’)[0] to get last element in source line
I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:
Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:
HISTORY: History of meningioma, surveillance
FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.
Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
HISTORY: Admitted patient with new fever, concern for pneumonia
FINDINGS: A CT of the chest demostrates bla bla bla
bla bla bla
[report_end]
When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.
It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.
Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).
Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).
Any thoughts?
Thanks!
Maya
Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.
Assuming your file name is test.txt
txt = open('test.txt').read()
names, txt_ = txt.split('\n', 1)
names = names.split('|')
pd.DataFrame(
[t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
columns=names)
Name MRN DOB Type_Imaging Report_Status Report_Text
0 John Doe 1234 01/01/1995 MRI Complete Exam Number: A5678\nReport status: final\nTyp...
1 James Smith 5678 05/05/1987 CT Complete Exam Number: A8623\nReport status: final\nType...
I ended up doing this which worked:
import re
import pandas as pd
f = open("filename.txt", "r”)
data = f.read().replace("\n", “”)
matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)
df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”)
I got this in my data frame
name : john,
address : Milton Kings,
phone : 43133241
Concern:
customer complaint about the services is so suck
thank you
How can I process the above to remove only line of text in data frame containing :? My objective is to get the lines which contains the following only.
customer complaint about the services is so suck
Kindly help.
One thing you can do is to separate the sentence after ':' from your data frame. And you can do this by creating a series from your data frame.
Let's say c is your series.
c=pd.Series(df['column'])
s=[c[i].split(':')[1] for i in range(len(c))]
By doing this you will be able to separate your sentence from colon.
Assuming you want to keep the second part of the sentences, you can use the applymap
method to solve your problem.
import pandas as pd
#Reproduce the dataframe
l = ["name : john",
"address : Milton Kings",
"phone : 43133241",
"Concern : customer complaint about the services is so suck" ]
df = pd.DataFrame(l)
#split on each element of the dataframe, and keep the second part
df.applymap(lambda x: x.split(":")[1])
input :
0
0 name : john
1 address : Milton Kings
2 phone : 43133241
3 Concern : customer complaint about the services is so suck
output :
0
0 john
1 Milton Kings
2 43133241
3 customer complaint about the services is so suck
I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.