utf 8 Codec Can t Decode Bytes in Position 5893 5894 Invalid Continuation Byte
If you are getting trouble with the error "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte", take it easy and follow our article to overcome the problem. Read on it now.
Reason for "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte " error
This problem is common when reading a file under CSV format in pandas. It happens because the read_csv() function in pandas uses utf-8 Standard Encodings, which is defaulted in Python, but the file contains some special characters.
Now, we will read a CSV file about the biomedical domain by pandas and how the error happens.
You can download the CVS file here.
Code:
            import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv") data.head()                  Result:
          UnicodeDecodeError                        Traceback (most recent call last) <ipython-input-76-0c9089169b2f> in <module>       1 import pandas as pd ----> 2 a = pd.read_csv('/content/drive/MyDrive/LearnShareIT/alldata_1_for_kaggle.csv')   /usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 3: invalid start byte                          Note: You may get the same error with format like that: UnicodeDecodeError: 'utf-8' codec can't decode byte <<memory address>> in position <<position>> : invalid start byte error .
Solutions to solve this problem
Solution for reading csv file:
Some common encodings can bypass the codecs lookup machinery to improve performance such as latin1, iso-8859-1, ascii, us-ascii, etc.
You can pass a parameter named "encoding" with a string value which defines the type of encoding to perform the data.
In our example, we use "latin1" to encode the data.
Code:
            import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv", encoding = 'latin1') # pass encoding parameter data.head()                  Result:
          Unnamed:    0               0                                                  a 0           0  Thyroid_Cancer  Thyroid surgery in  children in a single insti... 1           1  Thyroid_Cancer  " The adopted strategy was the same as that us... 2           2  Thyroid_Cancer  coronary arterybypass grafting thrombosis ï¬b... 3           3  Thyroid_Cancer   Solitary plasmacytoma SP of the skull is an u... 4           4  Thyroid_Cancer   This study aimed to investigate serum matrix ...                Solution for reading text and json file:
The initial content of json and txt file:
            {"student":[     { "firstName":"™œœ''™™œ""××""™"ˆ'γ°°'ˆ'"œ™"ε""Ãö", "lastName":"Doe" },     { "firstName":"Anna", "lastName":"Smith" },     { "firstName":"Peter", "lastName":"Jones" }   ] }                              œMedical Informatics and œHealth Care Sciences                  Open file and read with binary mode
syntax: file_reader = open("path/to/file", "rb") with rb is binary reading mode
Read json file:
            import json   file = open('a.json', 'rb') content = json.load(file)  print(content)                  Result:
          {'student': [{'firstName': "™œ\x9dœ\x9d''™™œ\x9d""××""™"ˆ'γ°°'ˆ'"œ\x9d™"ε""Ã\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}                Read text file:
            file = open('a.txt', 'rb')  print(file.read())                  Result:
          b'\xc5\x93Medical Informatics\xc2\x9d and \xc5\x93Health Care Sciences'                Ignoring errors when reading file
Syntax: file = open("path/to/file", "r", errors="ignore" to ignore encoding errors can lead to data loss.
Read json file:
            import json   file = open('a.json', 'r', errors = 'ignore') content = json.load(file) print(content)                  Reuslt:
          {'student': [{'firstName': "â„¢Å"ÂÅ"Â''™™Å"Ââ€â€œÃƒâ€"Ãâ€"â€â€â„¢â€œË†â€™ÃŽÂ³Â°Â°â€™Ë†â€™â€œÅ"™“ε““ÃÂ\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}                Read txt file:
            file = open('a.txt', 'r',  errors='ignore') print(file.read())                  Result:
          Å"Medical Informatics and Å"Health Care Sciences                Summary
Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte is a common error when reading files. Through our article, hope you understand the root of the problem and the solution to the problem.
Maybe you are interested:
- UnicodeDecodeError: 'ascii' codec can't decode byte
- UnicodeEncodeError: 'ascii' codec can't encode character in position
- AttributeError: 'dict' object has no attribute 'iteritems'
             
          
              Full Name:              Huan Nguyen
              Name of the university:              HUST
              Major: IT
              Programming Languages: Python, C, C++, Machine Learning/Deep Learning/NLP
Source: https://learnshareit.com/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte/
0 Response to "utf 8 Codec Can t Decode Bytes in Position 5893 5894 Invalid Continuation Byte"
Enregistrer un commentaire