Source: blog.csdn.net/master_hunter/article/details/125215433

Background of the project

As data analysts, we often need to make statistical analysis charts. But when there are too many reports, it often takes us most of our time to make reports. This delays us using a lot of time for data analysis. But as a data analyst, we should try our best to dig out the hidden related information behind the data of tables and charts, instead of simply making charts from statistical tables and sending reports. Since the work of reporting is unavoidable, how should we use the techniques we have learned to better handle the work?

This requires us to make a small Python program to implement it by ourselves, so that we have more time to do data analysis. We call this process of letting programs run on their own as automation.

The purpose of report automation

1. Save time and improve efficiency

Automation is always a great time saver and improves our work efficiency. Let our program programming reduce the coupling of each function implementation code as much as possible, and better maintain the code. In this way, we will save a lot of time and free us to do more valuable and meaningful work.

2. Reduce errors

If the coding effect is correct, it can be used all the time. If it is done manually, it may make some mistakes. It is more reassuring to leave it to a fixed program, and only modifying part of the code can solve the problem when the requirements change.

The scope of report automation

First of all, we need to formulate the reports we need according to business needs. Not every report needs to be automated. Some complex secondary development indicator data is more complicated to realize automatic programming, and there may be various bugs hidden. . Therefore, we need to summarize the characteristics of the reports used in our work. The following are several aspects that we need to consider comprehensively:

1. Frequency

For some tables that are often used in business, we may want to include these tables in the scope of automation programs. For example, customer information list, sales flow report, business loss report, month-on-month report, etc.

These frequently used reports are all necessary to automate. For those reports that need to be used occasionally, or secondary development indicators, reports that need to replicate statistics, these reports do not need to be automated.

2. Development time

This is equivalent to the cost and interest rate. If it is difficult to automate some reports and exceeds the time required for our ordinary statistical analysis, there is no need to automate them. So when you start automating work, you need to measure which time it takes to develop the script or the time it takes to manually make the table is shorter. Of course, I will provide a set of implementation solutions, but only for some common and simple reports.

3. Process

For each process and step of our report, each company is different, and we need to code and implement the function of each step according to the business scenario. Therefore, the process we make should be in line with business logic, and the program we make should also be in line with logic.

3. Implementation steps

First we need to know what metrics we need:

index

The overall overview indicator
reflects the overall size of a data indicator

comparative indicators

The index in the adjacent time period is directly worse than the chain
Comparison
of indicators at a common time point in adjacent time periods

Central trend indicator

median
Mode
Average/Weighted Average

Dispersion index

standard deviation
variance
Quartile
Full distance (range)
maximum bound minus minimum bound

Correlation Metrics

We take a simple report for simulation implementation:

Step 1: Read the data source file

First we need to understand where our data comes from, that is, the data source. Our final data processing is converted into DataFrame for analysis, so we need to convert the data source into DataFrame form:

import pandas as pdimport jsonimport pymysqlfrom sqlalchemy import create_engine # 打开数据库连接conn = pymysql.connect(host='localhost',                       port=3306,                       user='root',                       passwd='xxxx',                       charset = 'utf8'                       )engine=create_engine('mysql+pymysql://root:xxxx@localhost/mysql?charset=utf8') def read_excel(file):    df_excel=pd.read_excel(file)    return df_exceldef read_json(file):    with open(file,'r')as json_f:        df_json=pd.read_json(json_f)        return df_jsondef read_sql(table):    sql_cmd ='SELECT * FROM %s'%table    df_sql=pd.read_sql(sql_cmd,engine)    return df_sqldef read_csv(file):    df_csv=pd.read_csv(file)    return df_csv

‍

The above codes can be used normally after passing the test, but the read function of pandas is aimed at different forms of file reading, and its read function parameters also have different meanings, which need to be adjusted directly according to the form of the table.

Other read functions will be added later after the article is written. Except that read_sql needs to connect to the database, the others are relatively simple.

Step 2: DataFrame calculation

Let's take user information as an example:

The metrics we need to count are:

#Indicator description

Single table diagram:

Top 10 Product Audience Regions

Audience area for the product:

#将城市空值的一行删除    df=df[df['city_num'].notna()]    #删除error    df=df.drop(df[df['city_num']=='error'].index)    #统计    df = df.city_num.value_counts()

‍

We can just get the top 10 cities, encapsulated as a pie chart:

def pie_chart(df):    #将城市空值的一行删除    df=df[df['city_num'].notna()]    #删除error    df=df.drop(df[df['city_num']=='error'].index)    #统计    df = df.city_num.value_counts()    df.head(10).plot.pie(subplots=True,figsize=(5, 6),autopct='%.2f%%',radius = 1.2,startangle = 250,legend=False)pie_chart(read_csv('user_info.csv'))

‍

Save the chart:

plt.savefig('fig_cat.png')

If you think matplotlib's pictures are not very beautiful, you can also replace them with echarts pictures, which will look better:

pie = Pie()pie.add("",words)pie.set_global_opts(title_opts=opts.TitleOpts(title="前十地区"))#pie.set_series_opts(label_opts=opts.LabelOpts(user_df))pie.render_notebook()

‍

After packaging, it can be used directly:

def echart_pie(user_df):    user_df=user_df[user_df['city_num'].notna()]    user_df=user_df.drop(user_df[user_df['city_num']=='error'].index)    user_df = user_df.city_num.value_counts()    name=user_df.head(10).index.tolist()    value=user_df.head(10).values.tolist()    words=list(zip(list(name),list(value)))    pie = Pie()    pie.add("",words)    pie.set_global_opts(title_opts=opts.TitleOpts(title="前十地区"))    #pie.set_series_opts(label_opts=opts.LabelOpts(user_df))    return pie.render_notebook()user_df=read_csv('user_info.csv')echart_pie(user_df)

It can be saved, but unfortunately it is not a moving picture:

from snapshot_selenium import snapshotmake_snapshot(snapshot,echart_pie(user_df).render(),"test.png")

Save it as a web page to automatically load JS for rendering:

echart_pie(user_df).render('problem.html')os.system('problem.html')

‍‍

Step 3: Automatically send emails

A series of reports are generally sent to others for viewing. For some reports that need to be sent to a designated mailbox every day or multiple reports need to be sent, Python can be used to automatically send mailboxes.

Sending emails in Python mainly relies on the two modules smtplib and email.

smtplib: mainly used to establish and disconnect the work of connecting with the server.

email: mainly used to set some content related to the email itself.

Different types of mailbox servers have different connection addresses. You can set up the corresponding server to connect according to the mailboxes you usually use. Here the blogger uses NetEase mailbox to show:

First, you need to open the POP3/SMTP/IMAP service:

After that, you can use python to log in according to the authorization code.

import smtplibfrom email import encodersfrom email.header import Headerfrom email.utils import parseaddr,formataddrfrom email.mime.application import MIMEApplicationfrom email.mime.multipart import MIMEMultipartfrom email.mime.text import MIMEText #发件人邮箱asender="[email protected]"#收件人邮箱areceiver="[email protected]"#抄送人邮箱acc="[email protected]"#邮箱主题asubject="谢谢关注"#发件人地址from_addr="[email protected]"#邮箱授权码password="####"#邮件设置msg=MIMEMultipart()msg['Subject']=asubjectmsg['to']=areceivermsg['Cc']=accmsg['from']="fanstuck"#邮件正文body="你好，欢迎关注fanstuck，您的关注就是我继续创作的动力！"msg.attach(MIMEText(body,'plain','utf-8'))#添加附件htmlFile = 'C:/Users/10799/problem.html'html = MIMEApplication(open(htmlFile , 'rb').read())html.add_header('Content-Disposition', 'attachment', filename='html') msg.attach(html)#设置邮箱服务器地址和接口smtp_server="smtp.163.com"server = smtplib.SMTP(smtp_server,25)server.set_debuglevel(1)#登录邮箱server.login(from_addr,password)#发生邮箱server.sendmail(from_addr,areceiver.split(',')+acc.split(','),msg.as_string())#断开服务器连接server.quit()

Run the test:

download file:

definitely no problem! ! !

————END————