How to avoid new line as a delimeter in pandas dataframe - python

I have an excel sheet which has a column as remarks. So, for example, a cell contains data in a format like
There is a
book.
There is also a pen along with the book.
So, I decided to study for a while.
When I convert that excel into a pandas data frame, the data frame only captures the 1st point till the new line. It won't capture point no 2. So, How can I get all points in an excel to one cell of a data frame?
The data which I get looks like:
There is a book.
The data which I want should look like:
There is a book. 2. There is also a pen along with the book. 3. So, I decided to study for a while.

I created an excel file with a column named remarks which looks like below:
remarks
0 1. There is a book.
2. There is also a pen along with the book.
3. So, I decided to study for a while.
Here, I have entered all the text mentioned in your question into single cell.
import pandas as pd
df = pd.read_excel('remarks.xlsx')
Now when I try to print the column remarks it gives:
df['remarks']
0 1. There is a book.\n2. There is also a pen al...
Name: A, dtype: object
To solve your problem try:
df['remarks_without_linebreak'] = df['remarks'].replace('\n',' ', regex=True)
If you print the row in the column 'remarks_without_linebreak' you will get the result as you want

Related

Divide data on the basis of specific column number using pandas

I am trying to load a .txt file using pandas read_csv function.
My data looks like this:
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
84-..
..
..
First column is ID
Second column is data
The problem I have while loading that data with space separator is that it divides all the subsquent column after #2 in new field, which is not what I want.
I want ID as first column, and then second column should have all the space separated data.
I can thing of using pandas for this task, but if there is any better library please let me know.
Here's the code snippet I tried:
test = pd.read_csv('my_file.txt', sep=' ', names=['id', 'data'])
I get unexpected output. Output I want should be :
id data
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
....
If your id has the same format "xx-xxxxxx-xxxx", you can use it as a separator:
df = pd.read_csv(
"your_file.txt",
sep=r"(?<=\d{2}-\d{6}-\d{4})\s",
engine="python",
names=["id", "data"],
)
print(df)
Prints:
id data
0 84-121123-0000 GO DO YOU HEAR
1 84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GR...
2 84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN S...
3 84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY TH...
I'm not sure how much slower it is to first read the text file and then create a pandas DataFrame, but you could first get each line into a list, separate each element by the first space (using split(" ",1)) and then create a DataFrame.
f = open( TXTFILE, "r" )
data = [ s.split(" ", 1) for s in f.readlines() ]
df = pd.DataFrame( data, columns=['col1','col2'] )
Note that f.readlines() only works once after opening the file so save it as a separate list if you are going to use it more than once.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Python: How to write a CSV file?

I would like to write a CSV that outputs the following:
John
Titles Values
color black
age 15
Laly
Titles Values
color pink
age 20
total age 35
And so far have:
import csv
students_file = open(‘./students_file’, ‘w’)
file_writer = csv.DictWriter(students_file)
…
file_writer.writeheader(name_title) #John
file_writer.writeheader(‘Titles’:’Values’)
file_writer.writerow(color_title:color_val) #In first column: color, in second column: black
file_writer.writerow(age_title:age_val) #In first column: age, in second column: 15
file_writer.writerow('total age': total_val) #In first column: 'total age', in second column: total_val
But seems like it just writes on new rows rather than putting corresponding values next to each other.
What is the proper way of creating the example above?
I don't think you got the concept of .csv right. The c stands for comma (or any other delimiter). I guess it means Comma,Separated,Values.
Think of it as an excel sheet or a relational database table (eg. MySQL).
Mostly the document starts with a line of col names, then the values follow. You are not in need to write eg. age twice.
Example structure of a .csv
firstname,lastname,age,favorite_meal
Jim,Joker,33,"fried hamster"
Bert,Uber,19,"salad, vegan"
Often text is enclosed in " to, so that itself can contain commas and dont't disturb your .csv structure.

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.
You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

Pandas/Python - Excel Data Manipulation

So, relatively new to Python/Pandas but I do have a couple of years of programming under my belt but it's mainly with Java/C++ so nothing like a scripting language like Python.
My new job has me doing some scripting stuff and it's been pretty basic so far so I decided to try and do more and hopefully show my bosses that I am driven and willing to work hard and move up on the ladder and with that I wanted to make one of our data analysis tasks more efficient, by using Pandas to remove redundancies from an excel sheet. However, the redundancies that I'm trying to "parse" for is a substring within a "description" excel column.
import pandas as pd
xlsx = pd.ExcelFile('Filename.xlsx')
sheet1 = xlsx.parse(0)
So I read the excel file and parsed it into a data frame. I realized it may be easier to just use the read_csv instead, but by the time I thought about it, I was already committed into following through with excel. (Unless the transition isn't difficult, i'm just confused how I could export as a comma delimited when the original file is space delimited)
Here is how the data is kind of laid out:
ID Count#1 Count#2 Count#3 Description
1A42H4 1 0 2 Blahblah JIG GN=TRAC Blah Blah
242JB4 0 0 2 Blahblah JIG GN=SMOOTH Blah Blah
3MIVJ2 2 0 2 Blahblah JIG GN=TRAC Blah Blah
4JAXI3 1 0 3 BlahBlah JIG GN=TRAC Blah Blah
So I want to parse this datasheet and look for any redundant GN=TRAC(just similar GN=something) and then organize them all together into a separate datasheet. So I made an array of just the description column
`array = dataframe.description`
Then, I decided to use a string split on "JIG", because I didn't need that, and it was constant for all rows. so
`Splits = array.str.split('JIG')`
Because of that I was left with the
`array[0] = Blahblah,
GN=TRAC Blah Blah`
and now I wanted to isolate it again for
GN=TRAC
so I added them all into an array
`array2[n] = splits[n][1]`
again and did another split splits2 = array2.str.split(' ') to reorganize the
GN=TRAC
as the first position and isolate by itself. I realize I could have just did a space delimited on the original description, but there are different amounts of words so I wouldn't be able to parse or compare since the location for the
GN=TRAC
are all varying.
Now to iterate and compare them all I came up with this little function.
counter = 0
temp = counter + 1
print(sheet1.iloc[counter])
while counter <= len(sheet1):
if splits2[counter][0] == splits2[temp][0]:
print(sheet1.iloc[temp])
temp += 1
if splits2[counter][0] != splits2[temp][0]:
temp += 1
counter += 1
But I can't get past here. I'm able to iterate through and find all of the redundant rows with the first row GN=TRAC value, but the counter isn't iterating for the next row for comparison. I've tried a couple of variations, but I was hoping for a new pair of eyes. Based off of that table above, it would then go to the second row and find all the rows that match the GN=SMOOTH and on and on until the counter reaches the final iterated row.
Lastly, I was hoping I could get some help on the best way to organize them together based on the GN=? into an output.xlsx. I realize that there is the writer and to_excel but I'm just not sure how I would use that then. I read through the documentation as much as I can and it doesn't seem like there is a function that I could use to help me which is why it's pretty complicated (do let me know how to make it more efficient and scriptable though, I can generalize it later)
p.s. Is there also a way to write to the excel but in descending order of Count#1?
You could try
sheet1['GN'] = sheet1.Description.apply(lambda x: x.split('JIG')[1].split()[0])
which should insert a new column with the name GN into your DataFrame with the appropriate GN=* values.
To sort the DataFrame from any particular column you can sort the DataFrame by sheet1.sort('GN').
To save the DataFrame to an excel file you can use sheet1.to_excel('filename'). You can chain this with the above sort function to write a file sorted by a particular column.

Categories