Convert plain text to CSV - python

My source file which contains 5000 rows is as below,
1.4.1 This is my text and is not limited to three words 3 ALL ALL
1.4.2 This is second sentence 1 ALL ALL
1.4.3 An another sentence that I just made up 2 ALL ALL
I want to search and replace (or any other method) to produce the output below
"1.4.1", "This is my text and is not limited to three words", "3", "ALL", "ALL"
"1.4.2","This is second sentence","1","ALL","ALL"
"1.4.3", "An another sentence that I just made up","2","ALL", "ALL"
The sentence that I would like in the 2nd column is of varying length but is always between two numbers - 1.4.1 and 3 for example. This is a complicated part that I am trying to figure out how to achieve.
Edit:
The last two columns are optional, may or may not appear on all lines
"1.4.1", "This is my text and is not limited to three words", "3"

This can actually be fairly simple assuming all columns but the 2nd one will never contain a space. You can simply split the entire string and pull out the necessary parts as below:
[ 1.4.1, This, is, my, text, and, is, not, ,limited, ,to, three, ,words, 3, ALL, ALL ]
+---+ +-------------------------------------------------------------+ + +-+ +-+
1st 2nd 3rd 4th 5th
Once the full line is split like above, you can simply access the proper elements for the one word / number columns and join the 2nd column elements. Below is a simple python function that should accomplish what you are looking to do:
def parse_record(line):
parts = line.split()
col_1 = parts[0]
col_2 = " ".join(parts[0: -3])
col_3 = parts[-3]
col_4 = parts[-2]
col_5 = parts[-1]
return col_1, col_2, col_3, col_4, col_5
If you have control over the file format this becomes much simpler actually. If you are able to change the way the 2nd column is specified, it is technically already a csv. Most csv parsers allow you to specify the delimiter between values, in this case a space. For example if the same file above would quote the 2nd column like this:
1.4.1 "This is my text and is not limited to three words" 3 ALL ALL
1.4.2 "This is second sentence" 1 ALL ALL
1.4.3 "An another sentence that I just made up" 2 ALL ALL
Since the 2nd column's values are wrapped in quotes they will be parsed as a single value rather than many separate ones, allowing you to simply use a space as the csv delimiter rather than the default comma.

Related

Python: string not splitting correctly at "|||" substring

I have a column in Pandas DataFrame that stores long strings, in which different chunks of information are separated by a "|||".
This is an example:
"intermediation|"mechanical turk"|precarious "public policy" ||| intermediation|"mechanical turk"|precarious high-level
I need to split this column into multiple columns, each column containing the string between the separators "|||".
However, while running the following code:
df['query_ids'].str.split('|||', n=5, expand = True)
What I get, however, are splits done for every single character, like this:
0 1 2 3 4 5
0 " r e g ulatory capture"|"political lobbying" policy-m...
I suspect it's because "|" is a Python operator, but I cannot think of a suitable workaround.
You need to escape |:
df['query_ids'].str.split('\|\|\|', n=5, expand=True)
or to pass regex=False:
df['query_ids'].str.split('|||', n=5, expand=True, regex=False)

Replacing a Character in .csv file only for specific strings

I am trying to clean a file and have removed the majority of unnecessary data excluding this one issue. The file I am cleaning is made up of rows containing numbers, see below example of a few rows.
[Example of data][1] [1]: https://i.stack.imgur.com/0bADX.png
You can see that I have cleaned the data so that there is a space between each character aside from the four characters that start each row. There are some character groupings that I have not yet added a space between each character because I need to replace the "1"s with a space rather than keeping the "1"s.
Strings I still need to clean2: https://i.stack.imgur.com/gmeUs.png
I have tried the following two methods in order to replace the 1's in these specific strings, but both produce results that I do not want.
Method 1 - Replacing 1's before splitting characters into their own columns
Data2 = pd.read_csv(filename.csv)
Data2['Column']=Data2['Column'].apply(lambda x: x.replace('1',' ') if len(x)>4 else x)
This method results in the replacement of every 1 in the entire file, not just the 1's in the strings like those pictured above (formatted like "8181818"). I would think that the if statement would excluded the removal of the 1's where there are less than 4 characters grouped together.
Method 2 - Replacing 1's after splitting characters into their own columns
Since Method 1 was resulting in the removal of each 1 in the file, I figured I could split each string into its own column (essentially using the spaces as a delimiter) and then try a similar method to clean these unnecessary 1's by focusing on the specific columns where these strings are located (columns 89, 951, and 961).
Data2[89]=Data2[89].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[89].str.split(' ').tolist()
Data2[89] = pd.DataFrame(Data2[89].str.split(' ').tolist())
Data2[951]=Data2[951].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[951].str.split(' ').tolist()
Data2[951] = pd.DataFrame(Data2[951].str.split(' ').tolist())
Data2[961]=Data2[961].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[961].str.split(' ').tolist()
Data2[961] = pd.DataFrame(Data2[961].str.split(' ').tolist())
This method successfully removed only the 1's in these strings, but when I am then splitting the numbers I am keeping from these strings into their own columns they are overwriting the existing values in those columns rather than pushing those existing values into columns further down the line.
Any assistance on either of these methods or advice on if there is a different approach I should be taking would be much appreciated.

Python: Trim strings in a column

I have a column dataframe that I would like to trim the leading and trailing parts to it. The column has contents such as: ['Tim [Boy]', 'Gina [Girl]'...] and I would like it to make a new column that just has ['Boy','Girl'... etc.]. I tried using rstrip and lstrip but have had no luck. Please advise. Thank you
I assume that the cells of the column are 'Tim [Boy]', etc.
Such as in:
name_gender
0 AAa [Boy]
1 BBc [Girl]
You want to use a replace method call passing a regular expression to pandas.
Assuming that your dataframe is called df, the original column name is 'name_gender' and the destination (new column) name is 'gender', you can use the following code:
df['gender'] = df['name_gender'].replace('.*\\[(.*)\\]', '\\1', regex=True)
or as suggested by #mozway below, this can also be written as:
df['gender'] = df['name_gender'].str.extract('.*\\[(.*)\\]')
You end up with:
name_gender gender
0 AAa [Boy] Boy
1 BBc [Girl] Girl
The regexp '.*\\[(.*)\\]' can be interpreted as matching anything, plus a '[', plus anything which is stored into a register (that's what the parentheses are there for), and a ']'. This is replaced then (second regexp) with the thing stored into register 1 (the only used in the matching regexp).
You might want to document yourself on regexps if you don't know them.
Anything which does not match the entry will not be replaced. You might want to add a test to detect whether some rows don't match that pattern ("name [gender]").

Python: Replace the first nth matching letter to another letter

This is related to trimming a csv file process.
I have a mar-formatted csv file that has 4 columns, but the last column has too many (and unknown number of) commas.
I want to replace the delimiter to another character such as "|"
For example, string = "a,b,c,d,e,f" into "a|b|c|d,e,f"
The following codes works, but I like to find a better and efficient way to process large size txt file.
sample_txt='a,b,c,d,e,f'
temp=sample_txt.split(",")
output_txt='|'.join(temp[0:3])+'|'+','.join(temp[3:])
Python has the perfect way to do this, with str.replace:
>>> sample_txt='a,b,c,d,e,f'
>>> print(sample_txt.replace(',', '|', 3))
a|b|c|d,e,f
str.replace takes an optional third argument (or fourth if you count self) which dictates the maximum number of replacements to happen.
sample_txt='a,b,c,d,e,f'
output_txt = sample_txt.replace(',', '|', 3)

Extracting part of a string based on its naming convention

I'm trying to extract a piece of information about a certain file. The file name is extracted from an xml file.
The information I want is stored in the name of the file, I want to know how to extract the letters between the 2nd and 3rd period in the string.
Eg. name is extracted from the xml, it is stored as a string that looks something like this "aa.bb.cccc.dd.ee" and I need to find what "cccc" actually is in each of the strings I extract (~50 of them).
I've done some searching and some playing around with slicing etc. but I can't get even close.
I can't just specify the letter in the range [6:11] because the length of the string varies as does the number of characters before the part I want to find.
UPDATE: Solution Added.
Due to the fact the data that I was trying to split and extract part from was from an xml file it was being stored as an element.
I iterated through the list of Estate Names and stored the EstateName attribute for each one as a variable
for element in EstateList:
EstateStr = element.getAttribute('EstateName')
I then used the split on this new variable which contains strings rather than elements and wrote them to the desired text file:
asset = EstateStr.split('.', 3)[2]
z.write(asset + "\n")
If you are certain it will always have this format (5 blocks of characters, separated by 4 decimals points) you can split on '.' then index the third element [2].
>>> 'aa.bb.cccc.dd.ee'.split('.')[2]
'cccc'
This works for various string lengths so you don't have to worry about the absolute position using slicing as your first approach mentioned.
>>> 'a.b.c.d.e'.split('.')[2]
'c'
>>> 'eeee.ddddd.ccccc.bbbbb.aaaa'.split('.')[2]
'ccccc'
Split the string on the period:
third_part = inputstring.split('.', 3)[2]
I've used str.split() with a limit here for efficiency; no point in splitting the dd.ee part here, for example.
The [2] index then picks out the third result from the split, your cccc string:
>>> "aa.bb.cccc.dd.ee".split('.', 3)[2]
'cccc'
You could use re module to extract the string between 2 and third dot.
>>> re.search(r'^[^.]*\.[^.]*\.([^.]*)\..*', "aa.bb.cccc.dd.ee").group(1)
'cccc'

Categories