The example of text file is picture
According to file, the direction of data will be changed after the word 'chapter'
In the other word, Direction of reading is changed horizontal to vertical.
In order to solve this big problem, I find read_fwf in pandas module and apply it but failed.
linefwf = pandas.read_fwf('File.txt', widths=[33,33,33], header=None, nwors = 3)
The gap between categories(Chapter, Title, Assignment) is 33.
But the command(linefwf) prints all of pages line which includes horizontal categories such as Title, Date, Reservation as well as blank lines.
Please, I want to know 'How to export vertical data only'
Let me take a stab in the dark: you wish to turn this table into a column (aka "vertical category"), ignoring the other columns?
I didn't have your precise text, so I guesstimated it. My column widths were different than yours ([11,21,31]) and I omitted the nwors argument (you probably meant to use nrows, but it's superfluous in this case). While the column spec isn't very precise, a few seconds of fiddling left me with a workable DataFrame:
This is pretty typical of read-in datasets. Let's clean it up slightly, by giving it real column names, and taking out the separator rows:
df.columns = list(df.loc[0])
df = df.ix[2:6]
This has the following effect:
Leaving us with df as:
We won't take the time to reindex the rows. Assuming we want the value of a column, we can get it by indexing:
df['Chapter']
Yields:
2 1-1
3 1-2
4 1-3
5 1-4
6 1-5
Name: Chapter, dtype: object
Or if you want it not as a pandas.Series but a native Python list:
list(df['Chapter'])
Yields:
['1-1', '1-2', '1-3', '1-4', '1-5']
Related
Important initial information: these values are ID's, they are not calculation results, so I really don't have a way to change the way they are saved in the file.
Dataframe example:
datetime
match_name
match_id
runner_name
runner_id
...
2022/01/01 10:10
City v Real Madrid
1.199632310
City
122.23450
...
2021/01/01 01:01
Celtic v Rangers
1.23410
Rangers
101.870
...
But the match_id in the Dataframe appears:
1.19963231
1.2341
And runner_id in the Dataframe appears:
122.2345
101.87
I tried to pass all values as string so it would see the numbers as string and not remove the zeros:
df = pd.read_csv(filial)
df = df.astype(str)
But it didn't help, he kept removing the zero on the right.
I am aware of the existence of float_format but in this case it is necessary to specify the number of decimal places to be used, so I could not use it and as they are ID's I cannot take the risk of a very large value being rounded.
Note: there are hundreds of different columns.
By the time your data is read, the zeros are already removed, so your conversion to str can no longer help.
You need to pass the option directly to read_csv():
df = pd.read_csv(filial, dtype={'runner_id': str})
If you have many columns like this, you can set dtype=str (instead of a dictionary), but then all your columns will be str, so you need to re-parse each of the interesting ones as their correct dtype (e.g. datetime).
More details in the docs ; maybe play with converters param too.
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I am trying to load a .txt file using pandas read_csv function.
My data looks like this:
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
84-..
..
..
First column is ID
Second column is data
The problem I have while loading that data with space separator is that it divides all the subsquent column after #2 in new field, which is not what I want.
I want ID as first column, and then second column should have all the space separated data.
I can thing of using pandas for this task, but if there is any better library please let me know.
Here's the code snippet I tried:
test = pd.read_csv('my_file.txt', sep=' ', names=['id', 'data'])
I get unexpected output. Output I want should be :
id data
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
....
If your id has the same format "xx-xxxxxx-xxxx", you can use it as a separator:
df = pd.read_csv(
"your_file.txt",
sep=r"(?<=\d{2}-\d{6}-\d{4})\s",
engine="python",
names=["id", "data"],
)
print(df)
Prints:
id data
0 84-121123-0000 GO DO YOU HEAR
1 84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GR...
2 84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN S...
3 84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY TH...
I'm not sure how much slower it is to first read the text file and then create a pandas DataFrame, but you could first get each line into a list, separate each element by the first space (using split(" ",1)) and then create a DataFrame.
f = open( TXTFILE, "r" )
data = [ s.split(" ", 1) for s in f.readlines() ]
df = pd.DataFrame( data, columns=['col1','col2'] )
Note that f.readlines() only works once after opening the file so save it as a separate list if you are going to use it more than once.
[https://github.com/rgrantham82/Hate_Crimes_Analysis/blob/master/Data%20Wrangling%20(1).ipynb]
if the above link doesn't work use [https://github.com/rgrantham82/Hate_Crimes_Analysis] and click on the Data Wrangling Jupyter notebook.
I am currently analyzing hate crimes data for Austin, TX. So far, I'm in the cleaning phase of it & I am having a brain-fart on how best to proceed.
So far, I concatenated 4 datasets from data.austintexas.gov -- reported hate crimes from 2017 to the present. The resulting set produced several new columns because the original columns of data, especially the 'date...', 'victim...', and 'offender...' columns were all formatted differently by the creator(s)/curator(s)...great work whomever you are working for austintexas.gov....anyways, my goal now is to:
The most important column for my purposes is the 'bias' column. How would I convert the data to a numerical type? I cannot visualize it with Matplotlib bc obviously it's not numerical.
Somehow convert the 'incident_number' data to datetime. or some other numerical data type to make visualization better.
Unless it's possible to clean and merge the various 'date' columns up & convert them, but the simplest way seems to be with manipulating the 'incident_number' column.
Btw, I'm very much a novice with Python. Any help is greatly appreciated but I'm aslo very open to advice, etc. Thanks y'all!
1) I believe you could convert the bias column to an int.
Lets say you have a dataframe named df with the column bias.
you could do:
df['bias'] = df['bias'].astype(int)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
2) For incident number you could do:
df['incident_number'] = pd.to_datetime(df['incident_number'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
Hope this helps!
Your bias column is text and you want numerals, for that use Categorical;
df['numerical_bias'] = pd.Categorical(df.bias)
For date-time formatting problems, use the format argument of the to_datetime function;
df['incident_number'] = pd.to_datetime(df['incident_number'], format='%d%m%Y')
The formatting documentation is available here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Renaming columns in pandas;
df2.columns = df1.columns
# or if the columns don't match
df2.columns = ['all', 'the', 'columns', 'you', 'require']
# if you want to rename only one column
i = 4
df2.columns[i] = 'new_name'
I'm new to python and I would appreciate if you give me an answer as soon as possible.
I'm processing a file containing reviews for products that can belong to more than 1 category. What I need is to group the review ratings by the categories, and date at the same time. Since I don't know the exact number of categories, or dates in advance, I need to add rows and columns as I'm processing the reviews data (50 GB file).
I've seen how I can add columns, however my trouble is adding a row without knowing how many columns are currently in the dataframe.
Here is my code:
list1=['Movies & TV', 'Books'] #categories so far
dfMain=pandas.DataFrame(index=list1,columns=['2002-09']) #only one column at the beginnig
print(dfMain)
This is what dfMain looks like:
If I want to add a column, I simply do this:
dfMain.insert(0, date, 0) #where date is in format like '2002-09'
But if I want to add a new category(row) and fill all the dates(columns) with zeros? How do I do that? I've tried with method append, but it asks for all the columns as parameters. Method Insert doesn't seem to work either..
Here's a possible solution:
dfMain.append(pd.Series(index=dfMain.columns, name='NewRow').fillna(0))
2002-09
Movies & TV NaN
Books NaN
NewRow 0.0