picture


picture
picture
picture

One of the Python text analysis series: NLTK corpus download





picture
picture
picture
Ordinary Day

Generally speaking, data analysis includes structured and unstructured data analysis. The former is for example structured data analysis in common list format, while the latter is for data analysis in unstructured formats such as text, images and videos. In fact, similar to structured data, plain text is also a common data format.

Text analytics extracts patterns and insights beneficial to end users by parsing unstructured text data into a more structured form using techniques such as natural language processing (NLP), information retrieval, and machine learning (ML).

Techniques such as text classification, text clustering, sentiment analysis, and similarity analysis and relationship modeling are common text analysis techniques.

For unstructured text data, we need to use the Python Natural Language Toolkit NLTK (The Python Natural Language Toolkit) for analysis. NLTK, which originated in 2001 and was originally designed for teaching, includes a text sample set called corpora. Obviously, unrolling text analysis requires us to obtain NLTK first.







picture
picture
picture


01


Download nltk_data from the official website

picture

Click the Refresh button in the lower right corner of the NLTK Downloader, and first modify the URL on the right side of the Server Index to "https://www.nltk.org/nltk_data" on the NLTK official website;

After selecting the installation package to be downloaded, click Download to download the nltk_data corpus to the "C:\Users\Administrator\AppData\Roaming\nltk_data" folder, see Figure 1.



picture

Figure 1 Download the nltk corpus from the official website


The capacity of the nltk corpus downloaded from the official website is as high as 1.8GB, and the download speed is slow. A feasible alternative is to use Baidu Cloud to download the compressed package, at the cost of manually decompressing each sub-compressed file in nltk_data.zip.




picture


picture
picture


02


Baidu cloud download compressed package nltk_data.zip

picture

Enter the following file link in the search bar of the 360 ​​browser: "https://pan.baidu.com/s/1LWM3o7iRZMF8XaD91vx9Dw", enter the dynamic verification code sent by the mobile phone to open the Baidu network disk, and then enter the extraction code "cnpf" to download Compressed package nltk_data.zip, see Figure 2.


picture

Figure 2 Baidu cloud download nltk corpus


Unzip the downloaded compressed package, you can get 9 subfolders such as chunkers, corpora, etc. We put them in the Download Directory path "C:\Users\Administrator\AppData\Roaming\nltk_data", see Figure 3.



picture

Figure 3 9 subfolders contained in the nltk_data folder





picture
picture
picture


03


Test whether the nltk corpus download is successful

picture

Open Jupyter Notebook, click the New button on the right to create a new Python file, and enter the following commands in turn to check whether the nltk corpus is downloaded successfully, see Figure 4.

picture


picture

Figure 4 nltk download test: access to Brown corpus


Brown is the world's first million-level English corpus, also known as the "Contemporary American Standard Corpus of English", developed by Kucera and Francis of Brown University in 1961. The corpus consists of texts from different sources and classifications.

The command execution result in Figure 4 tells us that there are 15 types in the corpus, such as news (news), mystery (mystery), legend (fiction), etc., which indicates that the native nltk corpus has been successfully installed.





picture
picture

04


An example of natural language processing:

Filtering of stop words, names and numbers based on Gutenberg corpus


NLTK contains the Gutenberg Corpus, a digital library project for people to read on the Internet.

1. Unzip the gutenberg, punkt, stopwords and words compressed packages in the nltk_data subfolder corpora, see Figure 5 .


picture

Figure 5 Decompression of the nltk_data subfolder


2. Create a new PY3 subfolder in the following path, and place the english.pickle file in this path in the newly created subfolder PY3, see Figure 6.

picture



picture

Figure 6 New subfolder PY3


3. Open Jupyter Notebook, click the New button on the right to create a new Python file, and enter the following commands in turn. See Figure 7 and Figure 8 for the running results.

picture


picture

Figure 7 NLP demo based on Jupyter Notebook



picture

Figure 8 NLP demo for filtering out stop words, names and numbers: based on the Gutenberg project


Figure 8 shows that stop words, names, and numbers have been filtered out of the words list.

picture

Editor: Cao Chengzhou

Review: Yang Lu

picture

Past reviews:

Python Data Analysis Series Eight: Interconnection of Python and Stata Data Analysis

Python Data Analysis Series Seven: Multiple Regression Analysis

Python data analysis series six: data visualization

Python Data Analysis Series Five: Numerical Operations

Python data analysis series four: data preprocessing

Python data analysis series three: descriptive statistics

Python data analysis series two: correlation operation

One of the Python data analysis series: the installation of Anaconda


picture
picture
      picture

    Introduction to Empirical Accounting Made Easy

     Scan the QR code to follow us

                    picture  

Dingyuan Accounting WeChat Group

Theme of this group:

Exchange Stata with Python,

Analyze structured data,

Explore unstructured text accounting,

Write the life of Dingyuan Accounting together .


picture