Divide data on the basis of specific column number using pandas - python

I am trying to load a .txt file using pandas read_csv function.
My data looks like this:
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
84-..
..
..
First column is ID
Second column is data
The problem I have while loading that data with space separator is that it divides all the subsquent column after #2 in new field, which is not what I want.
I want ID as first column, and then second column should have all the space separated data.
I can thing of using pandas for this task, but if there is any better library please let me know.
Here's the code snippet I tried:
test = pd.read_csv('my_file.txt', sep=' ', names=['id', 'data'])
I get unexpected output. Output I want should be :
id data
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
....

If your id has the same format "xx-xxxxxx-xxxx", you can use it as a separator:
df = pd.read_csv(
"your_file.txt",
sep=r"(?<=\d{2}-\d{6}-\d{4})\s",
engine="python",
names=["id", "data"],
)
print(df)
Prints:
id data
0 84-121123-0000 GO DO YOU HEAR
1 84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GR...
2 84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN S...
3 84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY TH...

I'm not sure how much slower it is to first read the text file and then create a pandas DataFrame, but you could first get each line into a list, separate each element by the first space (using split(" ",1)) and then create a DataFrame.
f = open( TXTFILE, "r" )
data = [ s.split(" ", 1) for s in f.readlines() ]
df = pd.DataFrame( data, columns=['col1','col2'] )
Note that f.readlines() only works once after opening the file so save it as a separate list if you are going to use it more than once.

Related

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]
Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever
You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

How to avoid new line as a delimeter in pandas dataframe

I have an excel sheet which has a column as remarks. So, for example, a cell contains data in a format like
There is a
book.
There is also a pen along with the book.
So, I decided to study for a while.
When I convert that excel into a pandas data frame, the data frame only captures the 1st point till the new line. It won't capture point no 2. So, How can I get all points in an excel to one cell of a data frame?
The data which I get looks like:
There is a book.
The data which I want should look like:
There is a book. 2. There is also a pen along with the book. 3. So, I decided to study for a while.
I created an excel file with a column named remarks which looks like below:
remarks
0 1. There is a book.
2. There is also a pen along with the book.
3. So, I decided to study for a while.
Here, I have entered all the text mentioned in your question into single cell.
import pandas as pd
df = pd.read_excel('remarks.xlsx')
Now when I try to print the column remarks it gives:
df['remarks']
0 1. There is a book.\n2. There is also a pen al...
Name: A, dtype: object
To solve your problem try:
df['remarks_without_linebreak'] = df['remarks'].replace('\n',' ', regex=True)
If you print the row in the column 'remarks_without_linebreak' you will get the result as you want

Pandas/Python - Excel Data Manipulation

So, relatively new to Python/Pandas but I do have a couple of years of programming under my belt but it's mainly with Java/C++ so nothing like a scripting language like Python.
My new job has me doing some scripting stuff and it's been pretty basic so far so I decided to try and do more and hopefully show my bosses that I am driven and willing to work hard and move up on the ladder and with that I wanted to make one of our data analysis tasks more efficient, by using Pandas to remove redundancies from an excel sheet. However, the redundancies that I'm trying to "parse" for is a substring within a "description" excel column.
import pandas as pd
xlsx = pd.ExcelFile('Filename.xlsx')
sheet1 = xlsx.parse(0)
So I read the excel file and parsed it into a data frame. I realized it may be easier to just use the read_csv instead, but by the time I thought about it, I was already committed into following through with excel. (Unless the transition isn't difficult, i'm just confused how I could export as a comma delimited when the original file is space delimited)
Here is how the data is kind of laid out:
ID Count#1 Count#2 Count#3 Description
1A42H4 1 0 2 Blahblah JIG GN=TRAC Blah Blah
242JB4 0 0 2 Blahblah JIG GN=SMOOTH Blah Blah
3MIVJ2 2 0 2 Blahblah JIG GN=TRAC Blah Blah
4JAXI3 1 0 3 BlahBlah JIG GN=TRAC Blah Blah
So I want to parse this datasheet and look for any redundant GN=TRAC(just similar GN=something) and then organize them all together into a separate datasheet. So I made an array of just the description column
`array = dataframe.description`
Then, I decided to use a string split on "JIG", because I didn't need that, and it was constant for all rows. so
`Splits = array.str.split('JIG')`
Because of that I was left with the
`array[0] = Blahblah,
GN=TRAC Blah Blah`
and now I wanted to isolate it again for
GN=TRAC
so I added them all into an array
`array2[n] = splits[n][1]`
again and did another split splits2 = array2.str.split(' ') to reorganize the
GN=TRAC
as the first position and isolate by itself. I realize I could have just did a space delimited on the original description, but there are different amounts of words so I wouldn't be able to parse or compare since the location for the
GN=TRAC
are all varying.
Now to iterate and compare them all I came up with this little function.
counter = 0
temp = counter + 1
print(sheet1.iloc[counter])
while counter <= len(sheet1):
if splits2[counter][0] == splits2[temp][0]:
print(sheet1.iloc[temp])
temp += 1
if splits2[counter][0] != splits2[temp][0]:
temp += 1
counter += 1
But I can't get past here. I'm able to iterate through and find all of the redundant rows with the first row GN=TRAC value, but the counter isn't iterating for the next row for comparison. I've tried a couple of variations, but I was hoping for a new pair of eyes. Based off of that table above, it would then go to the second row and find all the rows that match the GN=SMOOTH and on and on until the counter reaches the final iterated row.
Lastly, I was hoping I could get some help on the best way to organize them together based on the GN=? into an output.xlsx. I realize that there is the writer and to_excel but I'm just not sure how I would use that then. I read through the documentation as much as I can and it doesn't seem like there is a function that I could use to help me which is why it's pretty complicated (do let me know how to make it more efficient and scriptable though, I can generalize it later)
p.s. Is there also a way to write to the excel but in descending order of Count#1?
You could try
sheet1['GN'] = sheet1.Description.apply(lambda x: x.split('JIG')[1].split()[0])
which should insert a new column with the name GN into your DataFrame with the appropriate GN=* values.
To sort the DataFrame from any particular column you can sort the DataFrame by sheet1.sort('GN').
To save the DataFrame to an excel file you can use sheet1.to_excel('filename'). You can chain this with the above sort function to write a file sorted by a particular column.

I need your help about read_fwf in python pandas

The example of text file is picture
According to file, the direction of data will be changed after the word 'chapter'
In the other word, Direction of reading is changed horizontal to vertical.
In order to solve this big problem, I find read_fwf in pandas module and apply it but failed.
linefwf = pandas.read_fwf('File.txt', widths=[33,33,33], header=None, nwors = 3)
The gap between categories(Chapter, Title, Assignment) is 33.
But the command(linefwf) prints all of pages line which includes horizontal categories such as Title, Date, Reservation as well as blank lines.
Please, I want to know 'How to export vertical data only'
Let me take a stab in the dark: you wish to turn this table into a column (aka "vertical category"), ignoring the other columns?
I didn't have your precise text, so I guesstimated it. My column widths were different than yours ([11,21,31]) and I omitted the nwors argument (you probably meant to use nrows, but it's superfluous in this case). While the column spec isn't very precise, a few seconds of fiddling left me with a workable DataFrame:
This is pretty typical of read-in datasets. Let's clean it up slightly, by giving it real column names, and taking out the separator rows:
df.columns = list(df.loc[0])
df = df.ix[2:6]
This has the following effect:
Leaving us with df as:
We won't take the time to reindex the rows. Assuming we want the value of a column, we can get it by indexing:
df['Chapter']
Yields:
2 1-1
3 1-2
4 1-3
5 1-4
6 1-5
Name: Chapter, dtype: object
Or if you want it not as a pandas.Series but a native Python list:
list(df['Chapter'])
Yields:
['1-1', '1-2', '1-3', '1-4', '1-5']

Categories