Extract data from file.txt text file to csv - python

I have this .txt file and i would like to convert it to .csv in Python, could you help me?
File.txt
I need to extract the txt data to csv file in this format:
|Case code| Site of injury| If is not fatal,day away from home| Gender | Nationality |...........
| 239 | Head |1 | M | ITALY |...........
And so on for every tag before " : " .
This is the result i'm trying to achieve: Final results
Please let me know how can i solve this problem. I'm a beginner in programming and I don't know where to start. Thank you.

Here is a proposition with pandas.read_fwf and pandas.DataFrame.tranpose :
import pandas as pd
(
pd.read_fwf("input.txt")
.squeeze()
.loc[lambda x: x.str.contains(":", na=False)]
.str.split(":", expand=True)
.set_index(0)
.transpose()
.to_csv("output.csv", index=False)
)
# Output :
First nine columns :
0 Case code Site of injury Type of injury day away from home Gender Nationality Type of work contract Job Seniority of job
1 239 Head Fracture 1 M ITALY Permanent employee Other jobs over 3 years
Shape of the output : (1 row, 17 columns)

Related

Getting empty dataframe after foreachPartition execution in Pyspark

I'm kinda new in PySpark and I'm trying to perform a foreachPartition function in my dataframe and then I want to perform another function with the same dataframe.
The problem is that after using the foreachPartition function, my dataframe gets empty, so I cannot do anything else with it. My code looks like the following:
def my_random_function(partition, parameters):
#performs something with the dataframe
#does not return anything
my_py_spark_dataframe.foreachPartition(
lambda partition: my_random_function(partition, parameters))
Could someone tell me how can I perform this foreachPartition and also use the same dataframe to perform other functions?
I saw some users talking about copying the dataframe using df.toPandas().copy() but in my case this causes some perform issues, so I would like to use the same dataframe instead of creating a new one.
Thank you in advance!
It is not clear which operation you are trying; but here is a sample usage of foreachPartition:
The sample data is a list of coutries from three continents:
+---------+-------+
|Continent|Country|
+---------+-------+
| NA| USA|
| NA| Canada|
| NA| Mexico|
| EU|England|
| EU| France|
| EU|Germany|
| ASIA| India|
| ASIA| China|
| ASIA| Japan|
+---------+-------+
Following code partitions the data by "Continent", iterates each partition using foreachPartition and writes the "Country" name to each file of that specific partition i.e. continent.
df = spark.createDataFrame(data=[["NA", "USA"], ["NA", "Canada"], ["NA", "Mexico"], ["EU", "England"], ["EU", "France"], ["EU", "Germany"], ["ASIA", "India"], ["ASIA", "China"], ["ASIA", "Japan"]], schema=["Continent", "Country"])
df.withColumn("partition_id", F.spark_partition_id()).show()
df = df.repartition(F.col("Continent"))
df.withColumn("partition_id", F.spark_partition_id()).show()
def write_to_file(rows):
for row in rows:
with open(f"/content/sample_data/{row.Continent}.txt", "a+") as f:
f.write(f"{row.Country}\n")
df.foreachPartition(write_to_file)
Output:
Three files: one for each partition.
!ls -1 /content/sample_data/
ASIA.txt
EU.txt
NA.txt
Each file has country names for that continent (partition):
!cat /content/sample_data/ASIA.txt
India
China
Japan
!cat /content/sample_data/EU.txt
England
France
Germany
!cat /content/sample_data/NA.txt
USA
Canada
Mexico

Parsing log files and write to csv (different number of fields)

This is a question that concerns me for a long time. I have log files that I want to convert to csv. My problem is that the empty fields have been omitted in the log files. I want to end up with a csv file containing all fields.
Now I'm parsing the log files and write them to xml because one of the nice features of Microsoft Excel is that when you open a xml file with a different number of elements, Excel shows you all elements as separate columns.
Last week I came up with the idea that this might be possible with Pandas, but I can not find a good example to get this done.
Someone a good idea how I can get this done?
Updated
I can't share the actual logs here. Below a fictional sample:
Sample 1:
First : John Last : Doe Address : Main Street Email : j_doe#notvalid.gov Sex : male State : TX City : San Antonio Country : US Phone : 210-354-4030
First : Carolyn Last : Wysong Address : 1496 Hewes Avenue Sex : female State : TX City : KEMPNER Country : US Phone : 832-600-8133 Bank_Account : 0123456789
regex :
matches = re.findall(r'(\w+) : (.*?) ', line, re.IGNORECASE)
Sample 2:
:1: John :2: Doe :3: Main Street :4: j_doe#notvalid.gov :5: male :6: TX :7: San Antonio :8: US :9: 210-354-4030
:1: Carolyn :2: Wysong :3: 1496 Hewes Avenue :5: female :6: TX :7: KEMPNER :8: US :9: 832-600-8133 :10: 0123456789
regex:
matches = re.findall(r':(\d+): (.*?) ', line, re.IGNORECASE)
Allow me to concentrate on your first example. Your regex only matches the first word of each field, but let's keep it like that for now as I'm sure you can easily fix that.
You can create a pandas DataFrame to store your parsed data, then for each line you run your regexp, convert it to a dictionary and load it into a pandas Series. Then you append it to your dataframe. Pandas is smart enough to fill missing data with NaN.
df = pd.DataFrame()
for l in lines:
matches = re.findall(r'(\w+) : (.*?) ', l, re.IGNORECASE)
s = pd.Series(dict(matches))
df = df.append(s, ignore_index=True)
>>> print(df)
Address City Country Email First Last Sex State Phone
0 Main San US j_doe#notvalid.gov John Doe male TX NaN
1 1496 KEMPNER US NaN Carolyn Wysong female TX 832-600-8133
I'm not sure the dict step is needed, maybe there's a pandas way to directly parse your list of tuples.
Then you can easily convert it to csv, you will retain all your columns with empty fields where appropriate.
df.to_csv("result.csv", index=False)
>>> !cat result.csv
Address,City,Country,Email,First,Last,Sex,State,Phone
Main,San,US,j_doe#notvalid.gov,John,Doe,male,TX,
1496,KEMPNER,US,,Carolyn,Wysong,female,TX,832-600-8133
About big files performances, if you know all the field names in advance you can initialize the dataframe with a columns argument and run the parsing and csv saving one chunk at the time. IIRC there's a mode parameter for to_csv that should allow you to append to an existing file.

Dataframe getting all data in one 'cell'

I'm having some problems to put the data I want in a specific df .
When I print the value without df i get
link to image bc if i type the answer it doesn't shows up
Then I try to use pandas dataframe to insert it in a dataframe and I get:
dinner = pd.DataFrame([dinner])
dinner.head()
Home Made - Tuna Poke, 472 gm (4 Ounces) {'cal...
So, basically, everything gets in just one cell. I would like to get something like:
A | Calories | carbohydrates
Home made - tuna poke | 592 | 8
Does anyone know how can I do it?
dinner looks like a string parsed from html text. If it is the case and there is regular pattern for parsing data, then the following code may work.
nutritions = dinner.split('{')[1].split('}')[0].split(', ')
menu = dinner.split('{')[0].strip('<').strip()
dict_dinner = {}
for n in nutritions:
item, qty = n.split(': ')
dict_dinner[item.strip("'")] = qty
df = pd.DataFrame(dict_dinner, index=[menu])
print(df)
This outputs:

Python : How to collapse the contents of several rows into one cell when importing csv in pandas

I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:
Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:
HISTORY: History of meningioma, surveillance
FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.
Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
HISTORY: Admitted patient with new fever, concern for pneumonia
FINDINGS: A CT of the chest demostrates bla bla bla
bla bla bla
[report_end]
When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.
It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.
Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).
Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).
Any thoughts?
Thanks!
Maya
Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.
Assuming your file name is test.txt
txt = open('test.txt').read()
names, txt_ = txt.split('\n', 1)
names = names.split('|')
pd.DataFrame(
[t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
columns=names)
Name MRN DOB Type_Imaging Report_Status Report_Text
0 John Doe 1234 01/01/1995 MRI Complete Exam Number: A5678\nReport status: final\nTyp...
1 James Smith 5678 05/05/1987 CT Complete Exam Number: A8623\nReport status: final\nType...
I ended up doing this which worked:
import re
import pandas as pd
f = open("filename.txt", "r”)
data = f.read().replace("\n", “”)
matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)
df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”)

How to join two or more DataFrames in pandas, python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I've trying to join 3 dataframes, however I'm having problems to do it. Let me show the scenario.
I have 3 dataframes:
First: Country
just two columns: Country_ID, Country_Name
Primary Key: Country_ID
Country_ID | Country_Name
BR | Brazil
Second: Sports
just three columns: Country ID, Sport_ID, Sport_Name
Primary Key: Country_ID, Sport_ID
Country_ID | Sport_ID| Sport_Name
BR | 1234 | Football
Third: University
just four columns: Country_ID, University_ID, University_Name
Primary Key: Country_ID, University_ID
Country_ID | University_ID| University_Name
BR | UFCABC | Federal University of ABC
Final Result: just these columns: Country_Name, Sport_Name, University_Name
Country_Name | Sport_Name | University_Name
Brazil | Football | Federal university of Brazil
I tried to join ContryXSport and after that with university DataFrame, however I wasn't able to do it.
Here is the code for creation and join the dataframes:
country_raw_data = {
'country_id': [country.id for country in countries],
'country_name': [country.name for country in countries] }
sport_raw_data = {
'country_id': [sport.country.id for sport in sports],
'sport_id': [sport.id for sport in sports],
'sport_name': [sport.name for sport in sports]
}
university_raw_data = {
'country_id': [university.country.id for university in universities],
'university_id': [university.state.id for university in universities],
'university_name': [university.name for university in universities]
}
Now, the dataframe instances:
I tried to create df like this:
country_df = pd.DataFrame(country_raw_data, columns: ['country_id', 'country_name'])
I don't know why, the country_df was create wrong, some columns didn't appear correct with the right values. Then I had to create like this, it is working.
country_df = pd.DataFrame(country_raw_data)
sport_df = pd.DataFrame(sport_raw_data)
university_df = pd.DataFrame(university_raw_data)
Here, is the joins declarations:
I tried to do like this, however the result didn't join correct. Some columns didn't include the data frame correctly.
country_state_df = pd.merge(country_df, state_df, on='country_id', how='inner')
Another code tht I did it, however I had the same problem before:
country_sport_df = pd.merge(country_df, sport_df,
left_on='country_id',
right_on='sport_id',
how='inner')
So, after the first join, I did the next join between country_state with city
country_sport_university.df = pd.merge(country_sport_df, university_df,
on=['country_id', 'country_id'],
how='inner')
I'd like the final result must be like these columns:
country_name | Sport_Name | University_Name
Brazil | Football | Federal University of ABC
it is possible to do it using dataframe, or I need to use another libraries?
So, there are a lot of data, around millions of data, for example.
Can anyone help my or give me a suggestion to solve the problem?
Thank you very much!
You should be able to:
country_sport_df = country_df.merge(sport_df, on='country_id', how='inner')
country_university_df = university_df.merge(sport_df, on='country_id', how='inner').drop(['country_id', 'sport_id', 'university_id'], axis=1)
I assume that it's on purpose that country_id is the only link between sport_id and university_id.

Categories