Click on " Python Programming Encyclopedia " above, select the " Star " public account

Super invincible dry goods, delivered as soon as possible! ! !


picture


Explore charts with visualization





1. Data visualization and exploration diagrams
      Data visualization refers to presenting data in the form of graphs or tables. Diagrams can clearly present the nature of data and the relationship between data or attributes, which can be easily read and interpreted. Through the Exploratory Graph, users can understand the characteristics of the data, find the trend of the data, and lower the threshold for understanding the data.
2. Common chart examples
      This chapter mainly uses the Pandas approach to plotting, rather than using the Matplotlib module. In fact, Pandas has integrated the drawing method of Matplotlib into the DataFrame, so in practical applications, users can complete the drawing work without directly referencing Matplotlib.
1. Line chart
      A line chart is the most basic chart and can be used to show the relationship between continuous data in different fields. The plot.line() method is used to draw a line chart, and parameters such as color and shape can be set. In terms of use, the method of drawing a line drawing completely inherits the usage of Matplotlib, so the program must also call plt.show() to generate a graph at the end, as shown in Figure 8.4.
df_iris[['sepal length (cm)']].plot.line() 
plt.show()
ax = df[['sepal length (cm)']].plot.line(color='green',title="Demo",style='--'
ax.set(xlabel="index", ylabel="length")
plt.show()


picture



2. Scatter plot
     Scatter Chart is used to view the relationship between discrete data in different fields. The scatter plot is drawn using df.plot.scatter(), as shown in Figure 8.5.
df = df_iris
df.plot.scatter(x='sepal length (cm)', y='sepal width (cm)')

from matplotlib import cm 
cmap = cm.get_cmap('Spectral')
df.plot.scatter(x='sepal length (cm)',
          y='sepal width (cm)'
          s=df[['petal length (cm)']]*20
          c=df['target'],
          cmap=cmap,
          title='different circle size by petal length (cm)')

picture


3. Histogram, bar chart
     A histogram chart is usually used in the same column to show the distribution of continuous data. Another kind of graph similar to a histogram is a bar chart, which is used to view the same column, as shown in Figure 8.6 .
df[['sepal length (cm)''sepal width (cm)''petal length (cm)','petal width (cm)']].plot.hist()
2 df.target.value_counts().plot.bar()

picture



4. Pie Chart, Box Chart
      A Pie Chart can be used to view the proportion of each category in the same column, and a Box Chart can be used to view the same column or compare the distribution differences of data in different columns, as shown in Figure 8.7. Show.
df.target.value_counts().plot.pie(legend=True)
df.boxplot(column=['target'],figsize=(10,5))

picture


Data exploration practice sharing



          
     This section uses two real datasets to actually demonstrate several approaches to data exploration.
1. 2013 American Community Survey
      In the American Community Survey, about 3.5 million households are asked detailed questions about who they are and how they live each year. The survey covered a number of topics, including ancestry, education, work, transportation, internet use and habitation.
      Data source: https://www.kaggle.com/census/2013-american-community-survey.
        Data name: 2013 American Community Survey.
       First observe the appearance and characteristics of the data, as well as the meaning, type and scope of each field.
# 读取数据
df = pd.read_csv("./ss13husa.csv")
# 栏位种类数量
df.shape
# (756065,231)

# 栏位数值范围
df.describe()

      First concatenate the two ss13pusa.csv, this data contains a total of 300,000 data, 3 fields: SCHL (education, School Level), PINCP (income, Income) and ESR (work status, Work Status).
pusa = pd.read_csv("ss13pusa.csv") pusb = pd.read_csv("ss13pusb.csv")
# 串接两份数据
col = ['SCHL','PINCP','ESR']
df['ac_survey'] = pd.concat([pusa[col],pusb[col],axis=0)

      Group the data according to educational background, observe the proportion of the number of different educational backgrounds, and then calculate their average income.
group = df['ac_survey'].groupby(by=['SCHL']) print('学历分布:' + group.size())
group = ac_survey.groupby(by=['SCHL']) print('平均收入:' +group.mean())

2. Boston Housing Dataset
      The Boston House Price Dataset contains information about houses in the Boston area, including 506 data samples and 13 feature dimensions.
     Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/.
       Data name: Boston House Price Dataset.
     First observe the appearance and characteristics of the data, as well as the meaning, type and scope of each field.
     The distribution of house prices (MEDV) can be plotted as a histogram, as shown in Figure 8.8.
df = pd.read_csv("./housing.data")
# 栏位种类数量
df.shape
# (506, 14)

#栏位数值范围df.describe()
import matplotlib.pyplot as plt 
df[['MEDV']].plot.hist() 
plt.show()

picture


Note: The English in the figure corresponds to the names specified by the author in the code or data. In practice, readers can replace them with the words they need.
     The next thing you need to know is which dimensions are significantly related to "house price". First use the scatter plot to observe, as shown in Figure 8.9 .
# draw scatter chart 
df.plot.scatter(x='MEDV', y='RM') .
plt.show()

picture


      Finally, the correlation coefficients are calculated and visualized using a cluster heatmap, as shown in Figure 8.10.
# compute pearson correlation 
corr = df.corr()
# draw  heatmap 
import seaborn as sns 
corr = df.corr() 
sns.heatmap(corr) 
plt.show()

picture


      The color is red, indicating a positive relationship; the color is blue, indicating a negative relationship; the color is white, indicating no relationship. The correlation between RM and housing prices tends to be red, which is a positive relationship; the correlation between LSTAT, PTRATIO and housing prices tends to be dark blue, which is a negative relationship; the correlation between CRIM, RAD, AGE and housing prices tends to be white, which is no relationship.

      Disclaimer: This article is selected from the book "In-depth Python Data Analysis" by Tsinghua University Press, with slight modifications, and published here with the authorization of the publisher.